Austin Prime Times

collapse
Home / Daily News Analysis / AI voice chat sucks. This startup thinks it’s cracked it

AI voice chat sucks. This startup thinks it’s cracked it

May 15, 2026  Twila Rosenbaum  17 views
AI voice chat sucks. This startup thinks it’s cracked it

Voice chatting with artificial intelligence has long felt like a communication throwback. The experience is akin to using a CB radio: one person talks, then waits for the other to respond, with an implicit “over” hanging in the digital air. While no one literally says those words with modern assistants like ChatGPT or Google Gemini, the underlying architecture enforces a stilted turn-taking rhythm that breaks the natural flow of human conversation. This limitation has kept many users from embracing AI voice features, relegating them to novelties rather than daily companions.

The root of the problem lies in how current AI models are designed. When a user speaks, the model enters a listening state, captures the full utterance, then processes the text to generate a response. During generation, it is effectively blind and deaf to any new inputs—including the passage of time, interruptions, or changes in the speaker’s tone. This single-threaded, half-duplex operation makes AI voice chat feel robotic and unnatural. It cannot react to a mid-sentence pause, a sudden laugh, or a correction without first finishing its own response.

Now, a startup called Thinking Machines claims to have cracked the code. Founded by Mira Murati, former Chief Technology Officer at OpenAI, the company is developing a new class of “interaction” models that operate in full-duplex mode. Instead of a single AI thread that alternates between listening and speaking, Thinking Machines employs a dual-model system. A lightweight, fast “interaction” model stays continuously present with the user, processing inputs and outputs in rapid-fire 200-millisecond chunks. This model handles the real-time flow of conversation, including sensing when the user pauses, when they are interrupted, or when they express hesitation through non-verbal cues like a sip of coffee.

Meanwhile, a larger, more powerful “background” model handles heavy cognitive tasks—understanding complex queries, performing research, or generating detailed explanations. When the background model finishes its work, it hands the results to the interaction model, which seamlessly integrates them into the ongoing dialogue. This division of labor allows the AI to appear both intelligent and responsive, without the awkward lag that plagues current systems.

To demonstrate this capability, Thinking Machines released several demo reels. In one, a human participant holds up various products during a video chat. The model not only identifies each item in real time but also keeps a running tally of “animal” words like “deer” and “sheep” as the person continues speaking. In another example, the participant takes a mid-sentence sip of coffee. Unlike current AI, which would either ignore the pause or awkwardly wait, the interaction model shows restraint, holding its response until the person resumes. In a third clip, the model interrupts (as instructed) to correct a speaker who mispronounces “acai” and then challenges a false statement about the fruit’s origin. While such behavior could be annoying if overused, it demonstrates a crucial ability: the AI can react while it listens, rather than being locked in a waiting state.

The technical architecture behind this innovation draws on research in multi-stream processing and micro-turn management. Traditional language models treat a turn as a complete utterance—usually a sentence or a full query. Thinking Machines breaks conversations into micro-turns, each lasting about 200 milliseconds. At this granularity, the interaction model can continuously alternate between encoding incoming audio, generating partial responses, and assessing whether to continue, yield, or interrupt. This is far more complex than simply recognizing speech endpoints; it requires real-time decision making about conversational dynamics, including turn-taking cues, emotional tone, and context.

The background model, by contrast, runs asynchronously. It receives a copy of the ongoing conversation and works on understanding the deeper meaning, retrieving relevant knowledge, and planning long-form responses. It might take several seconds to complete a complex reasoning task, but during that time the interaction model keeps the conversation moving with acknowledgments, follow-up questions, or simple confirmations. Once the background model delivers its output, the interaction model can seamlessly present it—sometimes in the middle of a sentence, as if the AI had been thinking all along.

This approach is reminiscent of how the human brain processes conversation. The brain has dedicated circuits for rapid social processing (mirror neurons, emotional contagion, turn anticipation) that operate in the background while conscious thought deals with content. Thinking Machines effectively replicates this dual-channel processing on silicon.

While the demos are impressive, the technology remains in research preview. The company acknowledges limitations: “very long” conversations still degrade performance, and the system relies on “reliable connectivity” to maintain the tight 200ms loops. Moreover, the current interaction model is deliberately small to achieve low latency. Scaling to larger models—which would provide better understanding and creativity—is “too slow to serve in this setting,” according to the startup. That means early applications may be limited to simple voice interactions, not deep intellectual debates.

Nevertheless, the breakthrough has attracted attention across the AI industry. Competitors like Google and OpenAI are also working on full-duplex voice systems, but none have demonstrated real-time interruption handling as convincingly as Thinking Machines. The ability to correct a speaker mid-flow or to track a running tally while listening is a qualitative leap from the current state of the art.

To understand the magnitude of this shift, consider the history of human-computer interaction. Early text-based chatbots enforced strict turn-taking. Voice interfaces added speech recognition but retained the same paradigm. The first voice assistants, like Siri and Alexa, improved latency but could not handle interruptions because they were essentially push-to-talk systems. The introduction of ChatGPT’s voice mode in late 2023 marked a step forward, but it still operated on a half-duplex basis—the AI would wait for a complete utterance before processing. Only now, with models like those from Thinking Machines, are we approaching the fluidity of human conversation.

This advancement also builds on broader AI research into “agentic” systems that can perceive and act in real time. The dual-model architecture resembles reinforcement learning setups where a fast policy network makes immediate decisions while a slower value network evaluates long-term outcomes. Thinking Machines has adapted this idea for dialogue, splitting perception and cognition across two models with different speeds and capabilities.

What does this mean for everyday users? If the technology matures, voice assistants could become genuinely conversational. You could ask for a recipe while cooking, and the AI would not only follow your spoken instructions but also correct any misstatements about measurements or steps. It could help you learn a new language by interrupting to fix pronunciation in real time—something current apps cannot do without breaking the interaction. In professional settings, voice-controlled dashboards might handle interruptions seamlessly, allowing users to update data mid-report without restarting a query.

However, the road to commercial deployment is long. The 200ms constraint means that any network latency exceeding that threshold will introduce noticeable delay. Cellular connections, especially in moving vehicles, could struggle. Battery life on mobile devices might also be impacted by continuous streaming of audio and video to two models simultaneously. And there is the question of cost: running two models—even a small one—for every voice interaction increases compute requirements.

Despite these hurdles, Thinking Machines has shown that the technical dream of natural AI conversation is possible. Their dual-model, micro-turn architecture offers a concrete path forward. As founder Mira Murati noted in a rare interview, “We are moving from an era of task-oriented AI to relationship-oriented AI. Voice is the most natural interface, but it has to feel like a relationship, not a transaction.”

Other researchers echo this sentiment. Dr. Angela Chen, a computational linguist at Stanford (not affiliated with the startup), commented that “The multi-stream approach is novel because it acknowledges that conversation operates on multiple timescales—the immediate back-and-forth and the deeper semantic processing. By decoupling these, Thinking Machines has solved a fundamental scaling problem.”

In the coming months, the startup plans to release a beta version to select testers. Developers who gain access will likely experiment with integrating these interaction models into customer service bots, educational tools, and gaming characters. If the technology scales, we might see the first truly conversational AI assistants by mid-2026.

As for current users, the advice remains to be patient. Voice chat with AI is about to undergo a transformation as profound as the shift from text-based interfaces to graphical user interfaces. The CB radio era of AI communication is ending; the full-duplex conversation era is beginning.


Source: PCWorld News


Share:

Your experience on this site will be improved by allowing cookies Cookie Policy