Clip

Learning Common Features for Video and Audio Signals
In multimodal learning, both video and audio signals are used to learn a common feature space where related modalities can be together. This video network can be used to recognize human actions or different types of sounds.