Chinese start‑up Mianbi Intelligence this month unveiled a new strategy to push large models off the cloud and onto end devices. The company released MiniCPM‑o 4.5, a roughly 9‑billion‑parameter multimodal model that ingests video, audio and text streams while producing continuous text and speech output, and paired it with a developer‑focused hardware board called Pinea Pi. Founder Li Dahai framed the move as a bet on a long runway for startups despite fierce competition among cloud incumbents: the future, he argues, will be a hybrid of cloud and capable on‑device models.
MiniCPM‑o 4.5 is billed as a “full‑duplex, full‑modal” model: it can continue to receive multimodal inputs while generating responses and maintain awareness of environmental events without interrupting output. Mianbi demonstrated the capability with an assistive navigation scenario—continuously listening and alerting a blind user to bus arrivals or changes in traffic lights—claiming the model itself judges timing and relevance rather than relying on piecemeal engineering tricks such as separate voice‑activity detectors. The company’s chief multimodal scientist, Yao Yuan, describes this architecture as closer to an AI‑native solution for sustained interaction and perception.
Mianbi’s explicit rationale for a 9B‑parameter target is pragmatic: the model is small enough to be considered for robots, in‑car systems and consumer PCs yet expressive enough to support continuous multimodal tasks. The Pinea Pi development board, built on NVIDIA’s Jetson modules and equipped with a camera, microphones and multiple interfaces, is positioned as an educational and prototyping kit for offline multimodal assistants, embodied intelligence prototypes and programming pedagogy. Running models locally, Mianbi argues, avoids ongoing token costs, improves latency and stability for complex interactive tasks, and reduces certain privacy risks compared with an exclusive cloud approach.
But technical hurdles remain. Mianbi and other researchers note a persistent split at the architectural level: visual understanding typically leans on continuous representations, while high‑quality generative output often uses diffusion‑based methods—two modelling paradigms that are not straightforward to unify. Attempts to discretize continuous modalities for unified autoregressive modelling introduce information loss that can harm OCR and fine‑grained visual tasks, and current unified architectures still struggle to outpace modality‑specific specialists when compute and data budgets are matched. On the data side, Yao argues that multimodal training data is still far from exhaustion—video and audio volumes are expanding rapidly online—but converting that raw multimedia into broadly generalisable learning signals is the core challenge.
Mianbi’s product and roadmap are deliberately measured: Pinea Pi is described as a first‑stage developer kit rather than a finished consumer product, and pricing will be set mainly according to hardware costs. The firm sees the long term as a cooperative terrain where on‑device models and cloud supermodels complement each other, and it is already seeking partnerships with chipmakers to ease hardware bottlenecks. For global markets and competitors the development is notable: it underscores a parallel track in which smaller, nimble teams pursue hardware‑software co‑design to deliver low‑latency, privacy‑sensitive multimodal experiences even as hyperscale cloud models remain dominant for many tasks.
