Chapter
Multimodal Learning with Audio and Video
The idea of multimodal learning is to take audio and video signals and learn a common embedding space where the two modalities can be closely related. By doing so, the learned representation can be used to recognize human actions and different types of sounds in downstream tasks.
Clips
In multimodal learning, both video and audio signals are used to learn a common feature space where related modalities can be together.
1:30:44 - 1:33:27 (02:42)
Summary
In multimodal learning, both video and audio signals are used to learn a common feature space where related modalities can be together. This video network can be used to recognize human actions or different types of sounds.
ChapterMultimodal Learning with Audio and Video
Episode#206 – Ishan Misra: Self-Supervised Deep Learning in Computer Vision
PodcastLex Fridman Podcast
Neural networks can learn to associate sounds with objects through correlation and detect the location of where the sound is coming from.
1:33:27 - 1:35:19 (01:51)
Summary
Neural networks can learn to associate sounds with objects through correlation and detect the location of where the sound is coming from. Additionally, it can detect the location of a person’s mouth based on their voice.