EngineeringJune 12, 2026 · 7 min read · Kheelona Team

Inside the Magic Box: how a screen-free AI toy actually works

A plain-language teardown of our own shipping module (mic array to wake word to voice model to speaker) and where the privacy boundaries actually sit.

Every partner call reaches the same moment: someone holds up our module and asks what's actually inside. Fair. If you're going to put this in your product, you should know how it works, not at the marketing level, at the block-diagram level. This is that walkthrough, for our own shipping hardware. No magic. A lot of box.

The hardware: what fits in a palm

The Kheelona Magic Box is a custom PCB around an ESP32-class microcontroller. On the board: a microphone array (more than one mic, so the toy can hear a child across a noisy room, not just a child shouting at it), an amplifier and speaker driver, USB-C charging with battery management, an antenna tuned to survive being inside a plush body, and thermal design for the same reason: fabric is a blanket, and electronics don't love blankets.

Form factor matters more than spec-sheet bravado here. The module has to drop into a plush seam, a crib rail or a pet's belly without redesigning the shell around it. That single constraint (one module, any body) drives almost every engineering decision, and it's why the same board family runs a $10 talking plush and a $100 crib brain. Tiers, not products.

The round-trip: from "Lumi?" to an answer

The whole trip runs in about a second. A four-year-old's patience runs out in about two.

Step one happens before anything you'd call AI: the wake word. The toy is not recording the room. The mic array runs a tiny always-on detector that listens for one thing, its own name, and discards everything else. On our mid and top tiers this runs on the device itself; nothing leaves the toy until the toy is actually spoken to.

Then the voice trip. On the entry tier, audio goes to the cloud, where speech becomes meaning, meaning becomes a reply, and the reply becomes a voice. On higher tiers, parts of that pipeline run at the edge, on the module, which buys two things: lower latency and less data leaving the device. Our model layer is a voice-to-voice SLM under a billion parameters: small enough to be fast and cheap against toy-grade hardware, constrained enough to be auditable. There is no text transcript bouncing through three vendors' APIs. Voice in, voice out.

The latency budget is brutal and non-negotiable. Past roughly a second of silence, a small child decides the toy is broken and starts shaking it. Every engineering choice (model size, edge inference, where the wake word runs) answers to that second.

The two filters nobody sees

Notice the diagram has a safety filter on both sides of the model. Input filtering decides what the model should even be asked; output filtering checks every reply against age-graded policies before the speaker gets it. The model never has open internet access, so "the toy googled something horrifying" is not a failure mode that exists. We red-team the whole pipe before any release ships (a build doesn't go out until the red team gives up), and the 2026 COPPA Rule, which now treats a child's voiceprint as personal information, made this architecture the legal floor rather than a nice-to-have.

The honest test for any AI toy, ours included: ask the vendor what happens when a child asks the toy about weapons. If the answer involves the phrase "the model usually...", walk away. "Usually" is not a safety architecture.

The parts you don't hear

The parent app. Consent lives here: onboarding, topic controls, time limits, conversation review and one-tap delete. Under the updated rules, training data needs its own separate opt-in; that flow ships in the app, not in fine print.
Region pinning. A child's data stays in the child's region: US in the US, EU in the EU, India in India. It's a deployment setting, not a promise in a PDF. More on the safety page.
OTA updates. The toy on a shelf this Diwali gets smarter by next Diwali: new stories, new languages, security patches. The same channel carries the kill-switch: if a fleet ever misbehaves, it can be remotely disabled. We'd rather brick a toy than headline a news cycle.

Why no screen, on purpose

Cheapest interactivity in consumer electronics? A screen. It's also the thing parents are actively fleeing. A voice-only architecture is harder to build (you can't paper over a slow model with a loading spinner), but it's the entire premise. The toy has to hold a conversation the way a friend does: with its voice, its memory and (in Lumi's case) a gentle breathing motion. The 10 families testing Lumi daily tell us their kids ask for her before the TV. That doesn't happen if she is a TV.

What this costs, since you're wondering

Tier	What runs where	Per unit
Talk & Respond	Wake word on device · voice in the cloud	$10
Real Conversation	Wake word + more audio on device · memory + languages	mid-band
Sees & Understands	Camera + edge inference on the module	$50
Baby-care band	Cry detection → soothing → sleep analytics + motors	$30–100

PlayOS itself (firmware, cloud, parent app, SDK) is free. The module is the only hardware you buy, and the full pricing logic is on the platform page. We publish it because the alternative ("contact us for pricing") is how this industry signals a number you won't like.

Want the block diagram for your specific shell? Bring it to a call, and an engineer will be on the line, not a deck.Book a partnership call

Inside the Magic Box: how a screen-free AI toy actually works

The hardware: what fits in a palm

The round-trip: from "Lumi?" to an answer

The two filters nobody sees

The parts you don't hear

Why no screen, on purpose

What this costs, since you're wondering

What it actually costs to build an AI toy in 2026

White-label AI board vs platform: what your toy actually ships with

The 2026 COPPA rule, explained for toy makers

Want this spec'd for your product?