What the Kumma bear failure teaches everyone building AI toys
FoloToy's Kumma bear showed what a raw LLM in a toy does. The root cause, the layered guardrail stack, and the diligence checklist for any AI toy vendor.
A teddy bear came off the market in November 2025. Not for lead paint or a loose seam, but for the things it said to children. The bear was Kumma, made by FoloToy, and the architecture that failed inside it is sitting inside other AI toys on shelves right now.
We should declare our stake before we argue anything. We build PlayOS, a toy operating system that competes directly with the approach that failed here. That makes this a vendor's article. Read it that way. Pressure-test every claim, and when you reach the checklist near the end, run it on us first.
What happened, without embellishment
The facts are short. Kumma was a plush bear with a microphone, a speaker and a connection to a general-purpose language model. In November 2025, researchers at the PIRG Education Fund published their annual Trouble in Toyland research on AI toys. They reported that Kumma would tell them where in a home to find knives, pills, matches and plastic bags, and that it would drift into explicit sexual topics. They also reported something quieter and more important: the bear's guardrails grew weaker the longer a conversation ran.
CNN reported that the bear used OpenAI's GPT-4o by default, that FoloToy suspended sales and announced a safety audit, and that OpenAI cut the developer's access to its models. The story circled the world in days. Kumma took the headlines, but it was not the only toy being tested that season; NBC News ran its own tests on several other AI toys ahead of the gift cycle.
The root cause fits in one sentence
Kumma was a pipe. Microphone in, frontier model in the middle, speaker out. Between the model and the child there was no layer built specifically for children, so the model vendor's general safety settings were the entire child-safety system. That is the architecture to interrogate, and it fails for three reasons that have nothing to do with bad luck.
Trained on the wrong audience
A frontier model learns from the public internet, which is written by adults, for adults. Its default reader is a grown-up. Point it at a four-year-old and every reply is a translation it was never graded on. Most of the time the translation holds. The failures are what make the news.
Children are accidental jailbreakers
Guardrails on a general model are instructions layered on top of its training, not removals from it. Adults defeat them with crafted prompts. Children defeat them by accident. A child asks the same question forty times, invents role-play, insists the bear is allowed to answer. PIRG watched Kumma's protections wear down across long sessions, and a child's conversation is nothing but a long session.
One policy cannot fit every age
What suits a nine-year-old does not suit a four-year-old. A general model ships one safety policy for the whole world, tuned for adult users, with no idea how old the listener is. A toy needs the opposite: responses graded by age band, enforced by policy rather than by the model's judgment in the moment. No frontier model offers that as a default.
None of this is an argument against large models. It is an argument against putting one inside a toy unwrapped. The wrapping is the product.
What a real guardrail stack looks like
Here is the stack we build into PlayOS. We are describing our own product, and we have already told you we have a horse in this race, so every layer below comes with a way to verify it. The shape matters more than the logo. If your vendor is not us, ask them to point at their version of each layer.
- A constrained model under one billion parameters. Ours is called Voice SLM, and it is trained for one job: conversation with children. A frontier model holds most of the internet and tries to fence it off. A small model never held it. You cannot recite what you never read, and a small model can be audited in a way a frontier model cannot.
- Filters on both sides of the model, on-device and in the cloud. What a child says is screened before the model sees it. What the model produces is screened again before the speaker plays it. The checks run in two places, so a miss on the device is caught in the cloud, and the reverse.
- Age-graded response policies. The same question gets a different answer at four than at nine, and the policy decides which, not the model in the moment.
- No open internet. The toy cannot browse, search or fetch. Nothing reaches the child that was not reviewed before release. This closes the fastest road by which an AI toy goes off-script.
- Human-reviewed escalation paths. When a child brings up something heavy, like being hurt or a secret an adult told them to keep, the system stops improvising. It moves to responses written and reviewed by humans, and it flags the moment to the parent.
- Red-teaming as a release gate. Before a toy ships, people are paid to break it, using everything from weapon questions to the long wear-down sessions PIRG used. The toy does not ship until the red team gives up. The output is a dated report a buyer can ask to read.
- Parent visibility, wellbeing flags and SOS. The parent app shows what the toy talks about, raises a flag when a pattern deserves attention, and carries an SOS path for moments that cannot wait. Recordings die on one tap. A toy that talks to your child must never be a black box to you.
- An over-the-air kill-switch. If a deployed toy misbehaves, we can switch it off remotely, one unit or the whole fleet. The last line exists because we assume every layer above it can fail. That assumption is the safety culture.
Around the stack sit the privacy defaults: onboarding that requires parental consent, data residency pinned to your region, no ads, and no resale of data, ever. The full list lives on our safety page.
The checklist your diligence team should run
Boards started asking about AI toys the week the Kumma story broke. If a vendor pitch is on your desk, the questions below separate a safety stack from a safety slide. Good answers are documents and live demos. Adjectives are not answers.
That last row is current law, not a future requirement. The updated COPPA Rule took effect on April 22, 2026: a child's voiceprint now counts as personal information, and using children's data to train AI requires its own separate, verifiable parental consent. A vendor whose consent flow has not changed since spring is behind the rule, not ahead of you.
Run the table on us too. Bring it to a partnership call and go in order; we will answer with the documents open. A vendor who resents these questions is telling you something useful.
Safety is the license to operate
The toy business runs on trust that takes decades to build and seconds to spend. In this category, the downside case is not a one-star review. It is a thirty-second clip of your product saying something terrible to someone's child, and that clip does not expire. FoloToy went from listing to suspension inside a news cycle. The lesson is not that one company was careless. The lesson is that the default architecture fails quietly in the demo and publicly in the field.
We hold our own products to the stack above: Lumi, our companion, is live with ten families; Lori, our baby monitor, is production-ready; partner integrations run four to six weeks on the same eight layers, none of them optional. Safety in this category is not a feature to compare on a grid. It is the license to operate, and every toy on the shelf is carrying everyone else's license too.