As Cloud Giants Duel, China’s Mianbi Pushes a 9‑Billion‑Parameter Multimodal ‘Brain’ for Edge Devices

Mianbi Intelligence has launched MiniCPM‑o 4.5, a 9‑billion‑parameter multimodal model that can ingest continuous audio, video and text while producing simultaneous outputs, and introduced a Jetson‑based developer board called Pinea Pi. The company positions the stack as an early example of on‑device, embodied AI aimed at robotics, automotive and personal devices, arguing that hybrid cloud‑edge deployments will better meet latency, privacy and stability needs than cloud‑only approaches.

A stone pine tree branching against a bright blue sky, showcasing nature's artistry.

Key Takeaways

  • 1Mianbi released MiniCPM‑o 4.5, a ~9B parameter full‑duplex multimodal model and the Pinea Pi developer board aimed at on‑device AI.
  • 2The model supports simultaneous ingestion of video, audio and text streams while producing text and speech output, enabling continuous perception in real time.
  • 3Pinea Pi is a Jetson‑based dev kit for education and prototyping; it supports offline model execution to lower token costs and improve privacy and latency.
  • 4Mianbi warns of deep architectural challenges in unifying continuous visual/audio understanding with generative pipelines and says multimodal data is plentiful but hard to convert into generalisable capabilities.
  • 5The company expects a hybrid cloud‑edge future and is pursuing hardware partnerships to overcome endpoint constraints.

Editor's
Desk

Strategic Analysis

Mianbi’s launch illustrates a consequential split in the AI landscape: one track pursues ever‑larger cloud models and another focuses on tightly optimised models designed to run at the edge. For governments, device makers and enterprises the latter is attractive because it can deliver lower latency, reduce recurring inference costs and keep sensitive data local. But winning that space requires more than a compact model: success depends on robust tooling, efficient multimodal pretraining pipelines, and alliances with chip and device manufacturers to get acceptable power, thermal and cost profiles. If startups such as Mianbi can prove compelling real‑world use cases in robotics, automotive or consumer devices, they will create durable niches that complement, rather than simply compete with, cloud supermodels.

China Daily Brief Editorial
Strategic Insight
China Daily Brief

Chinese start‑up Mianbi Intelligence this month unveiled a new strategy to push large models off the cloud and onto end devices. The company released MiniCPM‑o 4.5, a roughly 9‑billion‑parameter multimodal model that ingests video, audio and text streams while producing continuous text and speech output, and paired it with a developer‑focused hardware board called Pinea Pi. Founder Li Dahai framed the move as a bet on a long runway for startups despite fierce competition among cloud incumbents: the future, he argues, will be a hybrid of cloud and capable on‑device models.

MiniCPM‑o 4.5 is billed as a “full‑duplex, full‑modal” model: it can continue to receive multimodal inputs while generating responses and maintain awareness of environmental events without interrupting output. Mianbi demonstrated the capability with an assistive navigation scenario—continuously listening and alerting a blind user to bus arrivals or changes in traffic lights—claiming the model itself judges timing and relevance rather than relying on piecemeal engineering tricks such as separate voice‑activity detectors. The company’s chief multimodal scientist, Yao Yuan, describes this architecture as closer to an AI‑native solution for sustained interaction and perception.

Mianbi’s explicit rationale for a 9B‑parameter target is pragmatic: the model is small enough to be considered for robots, in‑car systems and consumer PCs yet expressive enough to support continuous multimodal tasks. The Pinea Pi development board, built on NVIDIA’s Jetson modules and equipped with a camera, microphones and multiple interfaces, is positioned as an educational and prototyping kit for offline multimodal assistants, embodied intelligence prototypes and programming pedagogy. Running models locally, Mianbi argues, avoids ongoing token costs, improves latency and stability for complex interactive tasks, and reduces certain privacy risks compared with an exclusive cloud approach.

But technical hurdles remain. Mianbi and other researchers note a persistent split at the architectural level: visual understanding typically leans on continuous representations, while high‑quality generative output often uses diffusion‑based methods—two modelling paradigms that are not straightforward to unify. Attempts to discretize continuous modalities for unified autoregressive modelling introduce information loss that can harm OCR and fine‑grained visual tasks, and current unified architectures still struggle to outpace modality‑specific specialists when compute and data budgets are matched. On the data side, Yao argues that multimodal training data is still far from exhaustion—video and audio volumes are expanding rapidly online—but converting that raw multimedia into broadly generalisable learning signals is the core challenge.

Mianbi’s product and roadmap are deliberately measured: Pinea Pi is described as a first‑stage developer kit rather than a finished consumer product, and pricing will be set mainly according to hardware costs. The firm sees the long term as a cooperative terrain where on‑device models and cloud supermodels complement each other, and it is already seeking partnerships with chipmakers to ease hardware bottlenecks. For global markets and competitors the development is notable: it underscores a parallel track in which smaller, nimble teams pursue hardware‑software co‑design to deliver low‑latency, privacy‑sensitive multimodal experiences even as hyperscale cloud models remain dominant for many tasks.

Share Article

Related Articles

📰
No related articles found