Chapter

Multimodal Learning with Audio and Video
listen on Spotify
1:30:44 - 1:35:19 (04:34)

The idea of multimodal learning is to take audio and video signals and learn a common embedding space where the two modalities can be closely related. By doing so, the learned representation can be used to recognize human actions and different types of sounds in downstream tasks.

Clips
In multimodal learning, both video and audio signals are used to learn a common feature space where related modalities can be together.
1:30:44 - 1:33:27 (02:42)
listen on Spotify
Multimodal Learning
Summary

In multimodal learning, both video and audio signals are used to learn a common feature space where related modalities can be together. This video network can be used to recognize human actions or different types of sounds.

Chapter
Multimodal Learning with Audio and Video
Episode
#206 – Ishan Misra: Self-Supervised Deep Learning in Computer Vision
Podcast
Lex Fridman Podcast
Neural networks can learn to associate sounds with objects through correlation and detect the location of where the sound is coming from.
1:33:27 - 1:35:19 (01:51)
listen on Spotify
Artificial Intelligence
Summary

Neural networks can learn to associate sounds with objects through correlation and detect the location of where the sound is coming from. Additionally, it can detect the location of a person’s mouth based on their voice.

Chapter
Multimodal Learning with Audio and Video
Episode
#206 – Ishan Misra: Self-Supervised Deep Learning in Computer Vision
Podcast
Lex Fridman Podcast