This is a Plain English Papers summary of a research paper called <a href=" Multi-modal Video-Audio Generation, Inpainting and Editing model</a>. If you like these kinds of analysis, join <a href=" or follow us on <a href=" <h2>The problem with video generation today</h2> <p>For years, video generation and audio generation have been strangers in separate labs. Current video models have become genuinely impressive, capable of synthesizing photorealistic scenes with complex motion and rich detail. Yet they operate in a vacuum, treating audio as optional decoration or ignoring it entirely.</p> <p>This creates a concrete problem: temporal misalignment. When you generate a video of rain hitting a metal roof, the audio (if present at all) was created independently. A door slam in the video doesn't sync with a door slam in the audio. A character's dialogue doesn't match their lip movements. The result feels uncanny, like a dubbed film where something is always slightly off.</p> <p>The deeper issue is architectural. Most multimodal models treat text as the sole conductor, with everything else serving it. But in real film production, video and audio inform each other constantly. A tight shot of rain isn't just about pixels, it's about acoustics. A crowded market scene needs audio that tells you which conversations matter. The cinematographer and sound engineer need to collaborate, not work sequentially.</p> <h2>Why sound needs to be born with vision, not added later</h2> <p>Imagine two musicians in a darkened room, unable to see each other but listening intently. One plays strings, one plays percussion. They share a conductor (the text prompt) and reference recording (the scene description). They can't see each other, but they hear themselves making music and they stay in time. That's the architectural insight of SkyReels-V4.</p> <p>Audio doesn't get generated after video here. Instead, both branches generate in parallel, conditioning each other. The video branch learns that an audio reference contains a dog barking, so it synthesizes motion matching that bark's timing and energy. The audio branch hears that the video contains a dog, so it generates sounds consistent with that animal's presence. This is fundamentally different from other approaches that bolt audio onto video as an afterthought.</p> <p>When two generative processes share the same input understanding, they can be orchestrated. They're not independent models handed off sequentially, they're two parts of one unified thought.</p> <h2>Architecture: dual streams with a shared mind</h2> <p>SkyReels-V4 uses a <strong>Dual-stream Multimodal Diffusion Transformer (MMDiT)</strong> where one branch synthesizes video and another generates audio, while both draw from a shared conceptual foundation. Here's how the pieces fit together.</p> <p>The video branch synthesizes frames in a learned latent space using diffusion, accepting rich visual conditioning: text descriptions, reference images, masks for inpainting, even full video clips. The audio branch generates sound spectrograms via the same diffusion process, conditioned on text and audio references. Both branches are grounded in a <strong>Multimodal Large Language Model (MMLM)</strong> based text encoder that understands visual concepts as well as language. When you describe a "thunderstorm over a wheat field," this encoder captures both the visual richness and the sonic expectations embedded in that description.
Overview of the SkyReels-V4 architecture showing dual-stream video and audio generation branches sharing a multimodal encoder.
The dual-stream architecture with shared multimodal encoder, where video and audio branches generate simultaneously while conditioned by the same text understanding.
Information flows from the text prompt into the shared encoder, gets decomposed into understanding, and that understanding flows into both branches. They don't wait for each other, but they're orchestrated by the same conceptual input.
Diffusion models are ideal for this joint generation because both video and audio benefit from step-by-step refinement. At each diffusion step, the video branch can be gently nudged by the audio branch's current estimate, and vice versa. It's like two musicians refining their performance in real time, each listening and adjusting to the other.
One interface for generation, editing, and inpainting
Here's where architectural elegance becomes practical power. Most video models require separate code paths for "generate from scratch," "edit this video," and "extend this clip." SkyReels-V4 unifies all of these under a single mechanism using channel concatenation.
The trick is deceptively simple. Different input channels can be filled with different content, or left masked:
- Text-to-video generation: All input channels are empty (masked), so the model generates everything from scratch.
- Image-to-video: A starting image is embedded into certain channels, others remain empty, and the model generates the video that follows.
- Video extension: Existing video frames fill some channels, others are masked, and the model generates what comes next.
- Inpainting: A video with masked regions is provided, those regions' channels are empty, and the model fills the gaps coherently.
- Vision-referenced editing: Both a video to edit and a reference image showing the desired style get embedded as conditioning, and the model edits accordingly.
Traditional approaches require different models or training procedures for each task. SkyReels-V4 learns one unified diffusion process. During training, it sees random combinations of filled and empty channels and learns to inpaint intelligently. This unified treatment extends naturally to complex scenarios where multiple references guide the generation, something crucial for cinema-level production.
Making cinema resolution computationally feasible
Generating 1080p video at 32 frames per second for 15 seconds is computationally expensive. You can't simply make the diffusion process bigger and hope for feasible inference times. Instead, SkyReels-V4 uses a three-stage strategy that maintains quality where it matters most while reducing computational cost elsewhere.
...