We propose a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission.
Model overview: Given a video, we first extract face track features and audio features. Face scale and position information are encoded as 2D Gaussians and embedded with CNN layers, which we refer to as spatial context. Next, we construct contextual audio-visual representations for each candidate, through intermediate visual relational context and audio-visual relational context modules. For the visual relational context, we introduce a permutation-equivariant layer to refine each speaker’s visual representation by incorporating pairwise relationships and long-term temporal context. For the audio-visual relational context, we model audio-visual affinity over time for each candidate and then suppress non-active speakers by contrasting their affinity features with others. The final refined visual and audio-visual representations are concatenated and passed through a shared prediction layer to estimate a confidence score for each visible candidate.
We took part in the International Challenge on Activity Recognition (ActivityNet) CVPR 2021 Workshop and won the first place in AVA-ActiveSpeaker Task with 93.44% mAP using an extended version of UniCon, which sets a new state-of-the-art on AVA-ActiveSpeaker. Below is the talk we gave in the virtual workshop. (Our part starts at 32:28.)
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu, Shiguang Shan, Xilin Chen, "UniCon: Unified Context Network for Robust Active Speaker Detection", 29th ACM International Conference on Multimedia (ACM Multimedia 2021), Chengdu, China, October 20-24, 2021. [bibtex]
@inproceedings{zhang2021unicon,
title={UniCon: Unified Context Network for Robust Active Speaker Detection},
author={Zhang, Yuanhang and Liang, Susan and Yang, Shuang and Liu, Xiao and Wu, Zhongqin and Shan, Shiguang and Chen, Xilin},
booktitle={{ACM} Multimedia},
publisher={{ACM}},
year={2021}
}
This work was partially supported by the National Key R&D Program of China (No. 2017YFA0700804) and the National Natural Science Foundation of China (No. 61876171).