YingSound: Video-Guided Sound Effects Generation
with Multi-modal Chain-of-Thought Controls

Zihao Chen1Haomin Zhang1Xinhan Di1Haoyu Wang1, 3Sizhe Shan1, 3Junjie Zheng1Yunming Liang1Yihan Fan1, 4Xinfa Zhu1, 2,
Wenjie Tian1, 2Yihua Wang1Chaofan Ding1 and Lei Xie2

1AI Lab, Giant Network
2Northwestern Polytechnical University
3Zhejiang University
4East China University of Science and Technology

[Arxiv] [Promotional Video]

Content

Promotional Video
Abstract
Method
V2A Generation Results Visualization
V2A Generation Examples
Audio Generation for Game
Audio Generation for Animation
Audio Generation for Real World
Audio Generation for Long Time Video
Audio Generation for AI Generated Video
Audio Generation Comparison with Prior Work
Text Control

Promotional Video

Abstract

Generating sound effects for videos requires creating industry-standard sound effects that diverge significantly to produce high-quality audio generation in few-shot settings. To address this problem, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. The foundation model consists of two modules. The first module is a conditional flow matching transformer that builds a fine-grained, learnable Audio-Vision Aggregator (AVA) that integrates high-resolution visual features with corresponding audio features across multiple stages. The second module is a multi-modal visual-audio chain-of-thought framework that leverages advanced audio generation techniques to produce high-quality audio in few-shot settings. Finally, an industry-standard video-to-audio dataset that encompasses a diverse array of real-world scenarios is presented. Through both automated evaluations and human studies, we show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs, surpassing existing methods in performance.

Method

Overview of YingSound

Figure 1: Overview of YingSound.

It comprises two key components: Conditional Flow Matching with Transformers and Adaptive Multi-modal Chain-of-Thought-Based Audio Generation.

V2A Generation Results Visualization

Balloon
Yellow Dog
Lion
Gun

Figure 2: Temporal Alignment Comparison.

V2A Generation Examples

Audio Generation for Game

Audio Generation for Animation

Audio Generation for Real World

Audio Generation for Long Time Video

Audio Generation for AI Generated Video

Audio Generation Comparison with Prior Work

Ours
GT
FoleyCrafter
Diff-Foley

Text Control

Without Prompt
Prompt: motorcycle engine
Prompt: car horn
Without Prompt
Prompt: bird song
Without Prompt
Prompt: thunder
Without Prompt
Prompt: subway driving