Inside the Magic Box: how a screen-free AI toy actually works
A plain-language teardown of our own shipping module (mic array to wake word to voice model to speaker) and where the privacy boundaries actually sit.
Every partner call reaches the same moment: someone holds up our module and asks what's actually inside. Fair. If you're going to put this in your product, you should know how it works, not at the marketing level, at the block-diagram level. This is that walkthrough, for our own shipping hardware. No magic. A lot of box.
The hardware: what fits in a palm
The Kheelona Magic Box is a custom PCB around an ESP32-class microcontroller. On the board: a microphone array (more than one mic, so the toy can hear a child across a noisy room, not just a child shouting at it), an amplifier and speaker driver, USB-C charging with battery management, an antenna tuned to survive being inside a plush body, and thermal design for the same reason: fabric is a blanket, and electronics don't love blankets.
Form factor matters more than spec-sheet bravado here. The module has to drop into a plush seam, a crib rail or a pet's belly without redesigning the shell around it. That single constraint (one module, any body) drives almost every engineering decision, and it's why the same board family runs a $10 talking plush and a $100 crib brain. Tiers, not products.
The round-trip: from "Lumi?" to an answer
Step one happens before anything you'd call AI: the wake word. The toy is not recording the room. The mic array runs a tiny always-on detector that listens for one thing, its own name, and discards everything else. On our mid and top tiers this runs on the device itself; nothing leaves the toy until the toy is actually spoken to.
Then the voice trip. On the entry tier, audio goes to the cloud, where speech becomes meaning, meaning becomes a reply, and the reply becomes a voice. On higher tiers, parts of that pipeline run at the edge, on the module, which buys two things: lower latency and less data leaving the device. Our model layer is a voice-to-voice SLM under a billion parameters: small enough to be fast and cheap against toy-grade hardware, constrained enough to be auditable. There is no text transcript bouncing through three vendors' APIs. Voice in, voice out.
The latency budget is brutal and non-negotiable. Past roughly a second of silence, a small child decides the toy is broken and starts shaking it. Every engineering choice (model size, edge inference, where the wake word runs) answers to that second.
The two filters nobody sees
Notice the diagram has a safety filter on both sides of the model. Input filtering decides what the model should even be asked; output filtering checks every reply against age-graded policies before the speaker gets it. The model never has open internet access, so "the toy googled something horrifying" is not a failure mode that exists. We red-team the whole pipe before any release ships (a build doesn't go out until the red team gives up), and the 2026 COPPA Rule, which now treats a child's voiceprint as personal information, made this architecture the legal floor rather than a nice-to-have.
The parts you don't hear
- The parent app. Consent lives here: onboarding, topic controls, time limits, conversation review and one-tap delete. Under the updated rules, training data needs its own separate opt-in; that flow ships in the app, not in fine print.
- Region pinning. A child's data stays in the child's region: US in the US, EU in the EU, India in India. It's a deployment setting, not a promise in a PDF. More on the safety page.
- OTA updates. The toy on a shelf this Diwali gets smarter by next Diwali: new stories, new languages, security patches. The same channel carries the kill-switch: if a fleet ever misbehaves, it can be remotely disabled. We'd rather brick a toy than headline a news cycle.
Why no screen, on purpose
Cheapest interactivity in consumer electronics? A screen. It's also the thing parents are actively fleeing. A voice-only architecture is harder to build (you can't paper over a slow model with a loading spinner), but it's the entire premise. The toy has to hold a conversation the way a friend does: with its voice, its memory and (in Lumi's case) a gentle breathing motion. The 10 families testing Lumi daily tell us their kids ask for her before the TV. That doesn't happen if she is a TV.
What this costs, since you're wondering
PlayOS itself (firmware, cloud, parent app, SDK) is free. The module is the only hardware you buy, and the full pricing logic is on the platform page. We publish it because the alternative ("contact us for pricing") is how this industry signals a number you won't like.