DApp Store | Web3 Hub for Events & Games

Trending topics

Alibaba shipped four Qwen 3.5 small models with a trick borrowed from their 397B model: Gated DeltaNet hybrid attention. Three layers of linear attention for every one layer of full attention. The linear layers handle routine computation with constant memory use. The full attention layers fire only when precision matters. This 3:1 ratio keeps memory flat while quality stays high, which is why even the 0.8B model supports a 262,000-token context window. Every model handles text, images, and video natively. No adapter bolted on afterward. The vision encoder uses 3D convolutions to capture motion in video, then merges features from multiple layers instead of just the final one. The 9B beats GPT-5-Nano by 13 points on multimodal understanding, 17 points on visual math, and 30 points on document parsing. The 0.8B runs on a phone and processes video. The 4B fits in 8GB of VRAM and acts as a multimodal agent. All four are Apache 2.0. If this architecture holds, the small model space just became a capability race instead of a size race. A year ago, running a multimodal model locally meant a 13B+ model and a serious GPU. Now a 4B model with 262K context handles text, images, and video from consumer hardware. The gap between edge models and flagship models is closing faster than the gap between flagships and humans.

Top

Ranking

Favorites