On 12 February, Chinese AI developer Mianbi Intelligence published a new attention architecture called SALA — a sparse‑linear attention hybrid — and a 9‑billion‑parameter text model, MiniCPM‑SALA, built on that design. The company says the model does not rely on speculative sampling or other throughput tricks and achieves a 3.5× inference speed advantage over Qwen3‑8B at a 256k‑token context on cloud inference chips. Mianbi also claims MiniCPM‑SALA can run contexts as long as one million tokens both on cloud accelerators and on consumer‑grade GPUs, an unusually large context window for a model of this size.
Architecturally, SALA mixes sparse attention — which limits dense attention to a subset of tokens — with linear attention techniques that scale more gently with sequence length. That combination aims to reduce the quadratic memory and compute growth that typically restricts transformer models to short contexts. By blending these approaches, Mianbi seeks to retain enough expressive power for long‑range dependencies while keeping the parameter count and per‑token cost relatively low.
If the performance claims hold up under independent benchmarking, the practical implications are straightforward: organisations could process extremely long documents, codebases, or multi‑hour transcripts without resorting to expensive, very large models or heavy prompt‑engineering. For businesses and research teams, a 9B model that handles 100k–1M token contexts cheaply could be more attractive than much larger models that remain limited to tens of thousands of tokens or that require shard‑heavy cloud infrastructure.
The announcement also sheds light on an emerging strategy in China’s AI ecosystem: software and algorithmic innovation to stretch the capabilities of mid‑sized models, combined with support for domestically available inference hardware. This approach reduces dependence on ever‑larger parameter counts and on foreign model architectures, while making advanced long‑context features achievable for cloud providers and edge deployments alike.
Caveats remain. Speed and maximum context length are not the same as model quality: latency improvements do not automatically translate into better accuracy, coherence, reduced hallucination, or safety when the context runs to hundreds of thousands of tokens. Independent audits and standardized benchmarks will be needed to assess MiniCPM‑SALA’s performance on downstream tasks and whether SALA introduces degradation in attention fidelity for particular problem types.
Still, the release is notable in a crowded field where rival Chinese models and international players are racing to solve the long‑context problem. Whether SALA becomes a widely adopted building block will depend on transparent benchmarks, open implementations, and how the architecture juggles throughput, memory footprint, and answer quality at extreme context lengths.
