We present a research preview of Self-Flow: a scalable approach for training multi-modal generative models. Multi-modal generation requires end-to-end learning across modalities: image, video, audio, text - without being limited by external models for representation learning. Self-Flow addresses this with self-supervised flow matching that scales efficiently across modalities. Results: • Up to 2.8x faster convergence across modalities. • Improved temporal consistency in video • Sharper text rendering and typography This is foundational research for our path towards multimodal visual intelligence.
Self-Flow improves temporal consistency in video generation. 4B parameter multi-modal model trained on 6M videos.
Cleaner typography and text rendering. 4B parameter multi-modal model trained on 200M images.
Joint video-audio generation from a single model (sound on) 4B parameter multi-modal model trained on 2M audio-video pairs.
Self-Flow opens a path toward world models: combining visual scalability with semantic abstraction for planning and understanding. Here's action prediction from a 675M parameter model.
75