Hypercue | The AI in Your Ear.

There is a particular kind of silence that kills a live talk. It is the half-second after you lose your place, the pause where your eyes flick down, your cadence stumbles, and the audience, without quite knowing why, leans back. Everyone who has ever spoken in front of a room knows that half-second. It is the enemy.

Hypercue was built against it. The product listens to a speaker in real time, reads the script and slides they have prepared, and whispers a cue into their ear at the exact moment they need one. It is built for people whose thoughts are often better than their speech: the introvert engineer who loses a meeting to the loudest voice in the room, the non-native manager whose proposal is clearer than a competitor's but less fluent, and more generally the part of human knowledge that dies in the step between thinking something and saying it out loud. This essay is about the engineering that makes that whisper possible, and about the reasoning that got us from an interaction problem to a stack.

The shape of the problem

Start with what is actually happening in the user's head. A live speaker is cognitively committed to a demanding real-time task. Their attention budget is spoken for. Most of it is on the room: reading faces, modulating pace, listening for the question that is about to come. A smaller portion is on the material: the next beat, the structure of the argument, the slide transition. Whatever is left, and it is not much, is available for anything else. Into that remaining sliver, you want to insert help. How do you do it without destroying the thing you are trying to help?

This is a classic problem in human-computer interaction. Eric Horvitz and colleagues at Microsoft Research spent years on the economics of interruption in attention-aware interfaces, and the conclusion their line of work keeps returning to is that the expected value of a notification is positive far less often than product designers assume. Once you take that seriously, interruption stops being a timing detail and becomes a first-class design constraint. It reshapes everything downstream of it.

The cleanest way to name what this produces is to notice that most AI products are prompt-based, and ours is not. A prompt-based product waits for the user to initiate. Chat, search, coding assistants, voice interfaces of the conversational kind: in all of them, the user knows they want help and asks for it, and the design problem is to make the response fast, accurate, and well-formed. Hypercue is cue-based. The user does not initiate, because the user is cognitively committed to something else and does not have the attention budget to initiate. The system initiates on the user's behalf, silently and most of the time not at all, and the work of the entire stack is to make those silent initiations trustworthy enough that the user is willing to hand over the decision of when help is needed. Prompt-based AI optimizes for responsiveness. Cue-based AI optimizes for restraint. They look similar from the outside and are very different products underneath.

A moment in a classroom

One moment from early testing is worth naming, not as proof of anything but as a description of what the system is aiming at. In a classroom session at Carnegie Mellon, a member of our team gave a five-minute business analysis in English, which is a second language for the speaker, using Hypercue. The instructor, who spent two decades as a VP at Bank of America and HP and now chairs WITI's advisory board, stopped the class to note that the speaker seemed to have become a different person while wearing it, with cadence and intonation that finally matched the native speakers in the room. What the instructor was describing, in our reading, is the interaction working: cues landing at the right micro-moments, restraint holding at every other moment, the speaker staying in flow without ever being pulled out of the room by the help they were getting. That is the target. Everything below is about how to hit it.

Silence as an economic constraint

Once you take silence-by-default seriously as an interaction principle, it starts to have teeth at the system design level, because it inverts the economics of every compute decision in the pipeline. In a conversational voice product, almost every turn ends with the system generating a response, and the cost of reasoning, retrieval, and speech synthesis is paid on almost every turn. In our system, the system generates audio on a small fraction of moments. If we paid the full reasoning cost on every moment only to discard the output most of the time, we would be spending orders of magnitude more compute than the product can economically support. The structural answer is to reorganize what work happens when.

The reorganization we arrived at is a dual-track control plane. One track runs continuously at cadence and answers a single cheap question: given everything we know right now about the speaker's acoustic state and their position in the script, would a cue help in the next few seconds? This is a classification decision, not a generation decision, and it runs on features cheap enough to be evaluated at the speed of the audio stream. In our current implementation the decision loop on this track runs in the tens of milliseconds end-to-end, well inside the window of a single audio frame, which means it can be evaluated continuously without perceptibly loading the system. The other track runs asynchronously and answers the expensive question: if the first track were to decide a cue is warranted soon, what should that cue actually say? The second track's output is staged, not spoken. It is parked in a cache where the first track can retrieve it in constant time on the rare occasions when a cue is needed.

The point of this split is not that fast-plus-slow is a clever pattern. The point is that the economics of silence-by-default force a separation between the decision to speak and the generation of what to say. If those two decisions share a computation, you pay generation cost on every decision, and the product is uneconomic. If you separate them, you pay generation cost only on the moments where a cue is actually about to land, and the rest of the time the reasoning track is doing speculative work that is useful if the speaker stays on their expected path and cheap to throw away if they do not. In practice, on the common case where the speaker stays roughly aligned with their prepared material, the vast majority of cues that reach the user's ear come from work that had already been staged before the decision to fire was made, which is what makes the experienced latency between “a cue is needed” and “a cue is heard” sit comfortably inside the window that feels, to the speaker, like no latency at all. The moments where a cue has to be generated from scratch are the exception, and the exception path is where most of our engineering effort on graceful degradation has gone.

The script as a prior

Our users bring scripts. Sometimes it is a fully written speech, more often a deck with speaker notes, a set of talking points, a sales playbook, a rehearsed pitch with memorized beats. Whatever form it takes, the prepared material gives us a bounded, high-prior estimate of what the speaker intends to say over the next several minutes. We do not have to guess topics from conversation history. We have the document.

This changes the shape of the prediction problem, and it is where most of our engineering investment has gone. The question stops being “what topic might come up next” and becomes “where in the known script is the speaker right now, and what cues should be staged for the next few likely positions along their path.” Dual-process architectures of this general kind have started to appear in the voice AI literature (Salesforce Research's VoiceAgentRAG paper from earlier this year is a recent example), and those systems solve an open-domain version of the problem where the background predictor is essentially guessing next topics from conversation history. With a script as the prior, the prediction space is bounded, and the ceiling is set by how well we can track the speaker's position against a known document rather than by the entropy of the next open-domain turn. This is a much better-behaved problem when you have the text.

The sensor that makes script position tracking work is read-along matching. We align the live streaming transcript to the prepared script phonetically, not character by character, using a symbolic phonetic matcher (Metaphone) on both sides. This lets us stay aligned through the specific failure modes that break naive alignment in live speech: mumbled words, improvised paraphrases, proper nouns the recognizer has never seen, accent-driven transcription errors, the general noisiness of real professional audio. Metaphone is not an acoustic model and we do not use it as one. What it is, given a script as prior, is the right tool for robust sound-level alignment to that script, a well-understood technique in computer-assisted language learning and audiobook synchronization. With a script it is a sharp tool. Without a script it would buy us almost nothing. The combination is the point.

With confident script position as the primary feature, the reasoning track's job becomes tractable. It walks the script ahead of the speaker, generates candidate cues for the next several likely beats using the relevant slide context, and stages them in a cache keyed by script position. On the off-script case, where the speaker has improvised meaningfully beyond what the script anticipates, we fall back to an open-domain semantic cache. But the common case, which is most of real usage, runs through the position-keyed path, and that path is where the latency and economics of the product come from. In our internal evaluation on a representative set of prepared talks, the position-keyed path accounts for the large majority of cues delivered during normal speaking, and the end-to-end perceived latency on that path is multiples below what the same pipeline would produce without the script prior, for reasons that are structural rather than incidental.

When the user wants more, not less

There is a version of all of this that we did not expect to become as important as it did, and it is worth being honest about. For a meaningful subset of our users, the mode of the product that matters most is not the one where cues arrive sparingly at the edges of their attention. It is the opposite. They want the system to walk them through their prepared material continuously, reading ahead of them, beat by beat, in something closer to a shadowing mode than a cueing mode. Non-native English speakers giving high-stakes talks are one clear example of this, but the same pattern shows up in native speakers preparing for the most consequential moments of their careers, and more generally in anyone whose cognitive budget is already fully allocated to something other than remembering the next sentence.

What shadowing mode quietly addresses is a number most speakers never say out loud. The preparation time for a serious spoken moment is, for most people, roughly thirty times its actual duration. A fifteen-minute keynote is not fifteen minutes of work. It is seven or eight hours of pacing around hotel rooms and empty offices and parked cars, saying the same lines over and over until they feel safe enough to take on stage. That labor is almost always invisible, because it happens alone and because the people doing it do not want to admit how much of it they need. What a trustworthy continuous voice in the ear changes is not whether the speaker prepares, but how deeply they need to burn each line into muscle memory before they trust themselves in front of other people. The depth of preparation required drops when the safety net in the room is real. That, more than any individual cue, is the economic argument for the shadowing mode, and it applies well beyond stages. It applies to any in-person moment where the stakes are high enough that the speaker has been silently paying the thirty-times tax, which turns out to be most of the moments that matter.

It would be easy to read all of this as a contradiction of everything above. The essay has spent several sections arguing that silence is the default state of the product, and now here is a mode where the product speaks almost continuously. The resolution is that silence-by-default was never really a claim about the literal acoustic behavior of the system. It was a claim about the symbiosis between the system and the user. The principle is that the system's voice is never operating outside the envelope the user has opened for it. In the default cueing mode, that envelope is narrow: the user hands the system the decision of when help is needed, and the system is conservative about firing inside that delegation. In the shadowing mode, the envelope is wide: the user has said, in effect, that for the duration of this moment they want a continuous supporting voice alongside them, and the system operates inside that standing invitation. The difference between the two modes is not whether the system is silent or speaking. The difference is the shape of the symbiosis the user has asked for.

The mistake would be to think that because the shadowing mode looks, from the outside, like a simpler problem of walking a script and synthesizing ahead, the architectural complexity of the rest of this essay goes quiet in that mode. It does not. A speaker in shadowing mode is still interrupted. Someone asks a question from the audience. The venue's microphone fails and they have to improvise a bridge. Their own thought takes them three sentences off the prepared path because something in the room just now is more important than the next scripted beat. A heckler, a laugh, a silence where they expected a reaction. Every single one of these moments requires the same intelligence that the default cueing mode is built around: tracking where the speaker actually is versus where the script expected them to be, deciding whether to keep going or to pause, knowing when to re-align to a later position in the script rather than dragging the speaker back to a place they have already left behind. The dual-track control plane, the position estimator with its asymmetric confidence thresholds, the off-script semantic fallback, all of it is doing exactly the same work in shadowing mode that it does in cueing mode. What changes is the default state. What does not change is the intelligence required to leave that default state gracefully when the room demands it.

Seen this way, Hypercue is not a single-mode product with an awkward second mode grafted onto it. It is a product where the intensity of system presence is a parameter under continuous user control, ranging from near-zero at one end to continuous at the other, and where the architecture's job is to make every point on that range trustworthy under the full set of disruptions that real in-person moments bring. Both ends of the range, and every point in between, rely on the same underlying machinery. The shadowing end of the range is not architecturally cheaper. It is architecturally the same problem under a different standing contract with the user.

There is a deeper observation under this that is worth naming. In high-cognitive-load, high-stakes moments, what users most need from a tool is often not its cleverest capability but its most dependable one. Koby Conrad, the founder of the YC-backed sobriety app Sunflower, has talked publicly about discovering that the feature his users loved most was not the AI companion but the sobriety timer, because a timer is something that is always there, always truthful, and never lets you down. Our own early signal from speakers in high-stakes moments points the same direction. The most valuable thing we can do is not to be at our cleverest at the one moment they might blank out. It is to be continuously present in a way that makes blanking out much less likely in the first place, and that does not abandon them when the room suddenly asks something the script did not anticipate. A cleverer system is less valuable than a more dependable one, when the user's cognitive budget is already spent on something terrifying. The architecture above exists to make both shapes of dependability possible without compromising either.

Where it breaks, and what we do about it

A real system earns its keep in the places where its assumptions stop holding. Three of those places are worth naming, because they are the ones that shaped the architecture after the first few weeks of putting Hypercue in front of real speakers.

The first is the speaker who goes meaningfully off-script. Every prior we have, position tracking, staged cues, anticipated beats, is conditioned on the speaker being somewhere on a path the script knows about. When the speaker improvises, takes an audience question that pulls them three minutes sideways, or reorders their deck on the fly, the position estimator's confidence decays, and the staged cues at the expected next positions become the wrong cues. The reflex track has to notice this and do two things at once: stop firing cues from the stale staged set, and hand the decision over to the open-domain fallback path. The failure mode we had to engineer against was not the fallback being slow, which we expected. It was the reflex track continuing to fire confidently against stale context for a beat or two after the speaker had already left the script, because position confidence is a smoothed signal and smoothing lags reality. The fix is unglamorous: the restraint gate's threshold for firing is coupled to the position estimator's confidence in a way that makes it asymmetric, quicker to withhold than to commit, so that the system's default response to “I'm not sure where the speaker is right now” is silence rather than a confident wrong cue. A wrong cue is much more expensive, in the user's trust, than a missed one.

The second is the ASR upstream of us misbehaving on the specific words that matter most. Streaming ASR systems handle conversational English well and handle the long tail of proper nouns, domain jargon, and non-native pronunciations much less well, which is exactly the vocabulary that carries the most weight in a business pitch or a technical talk. A mis-transcribed product name in casual conversation is a curiosity; a mis-transcribed product name in the sentence where the speaker is introducing their company is a cue that fires at the wrong moment or fails to fire at the right one. Our mitigation runs in two layers. The script itself gives us a closed vocabulary of the words we most care about getting right, which we inject as recognition hints into the ASR layer through the keyword-boosting interfaces the providers expose. On top of that, the phonetic read-along matcher is deliberately tolerant of the specific error shapes that ASR produces on those words, because matching on sound is more robust to that failure class than matching on text. Neither mitigation is perfect, and we treat residual ASR error as a load-bearing reason to keep the restraint gate conservative: if the matcher's confidence on the last several seconds of alignment drops, the gate tightens, because we would rather miss a cue than fire on a hallucinated transcript.

The third is the hardest one, and it is the one we spend the most time on: the cold-path cue, where the reflex track decides a cue is needed and the reasoning track has not yet finished staging the right cue for that position. This is the case where the separation between the two tracks stops buying us anything, and the user's experienced latency is determined by the slowest link in the synthesis chain. We handle it with two mechanisms working together. The first is the hint-initiator phrasing layer described below, which lets us begin speaking a deliberately low-information opening while the completing half of the cue is still resolving behind it. The second is that the reasoning track, when it is idle, prefetches cues for positions the speaker has not yet reached but is likely to, which is only possible because the script gives us a meaningful prior over “likely next positions” that open-domain systems do not have. The residual failure, when both mechanisms miss, is a cue that arrives a beat later than it should. We measure this case, we know roughly how often it happens on our internal eval, and we treat the rate at which it happens as one of the handful of numbers that tells us whether the product is getting better or worse week over week. It is not zero and we do not pretend it is zero, but the architecture is organized so that when it does happen, the failure mode is a late cue rather than a wrong cue, and a late cue is something a speaker can absorb while a wrong cue is something that breaks their trust in the system.

None of these three failure modes is fully solved. We do not think any of them are fully solvable, in the sense that a real-time system interacting with an unpredictable human will always have a tail where its assumptions lose. The engineering commitment we have made is to ensure that the failures are the right shape: silence instead of wrongness, lateness instead of noise, conservative instead of confident. Those are the choices you only make after watching the wrong version of each of them hurt a real speaker in front of a real audience.

A few details that earn their place

Two other mechanisms outside the dual-track core are worth mentioning, because they are the kind of thing you only build after watching real users break earlier versions.

The first is a prosodic pre-trigger. We begin walking the script position pointer forward based on cadence and pitch contour, before the current sentence is acoustically complete. In script-anchored terms this is anticipatory position advance, not anticipatory response generation, and it means that by the time the reflex track decides a cue is warranted, the right cached cue is already sitting at the expected next position. The effect is that the product feels like it was ready before it knew it had to be.

The second is a hint-initiator phrasing layer. When the reflex track fires and the reasoning track's cue is not yet fully resolved, we emit a deliberately low-information opening phrase whose first few hundred milliseconds can be rendered from cache while the completing half finishes synthesis behind it. This is borrowed from how humans start talking before they have finished thinking, and it buys us the last slice of perceived responsiveness on the cold path. Neither of these is a clever idea on its own. They are the kind of calibration you only write after the field has taught you where the seams are.

What we build on

We are customers of the voice AI model frontier, not competitors to it. For speech recognition we use streaming ASR from AssemblyAI and Deepgram. For text-to-speech we use Cartesia's Sonic. For inference we route across frontier LLM providers and low-latency engines, reaching for Groq's LPU when latency is the binding constraint and for larger hosted models when reasoning quality matters more. We do not train foundation models. The research frontier is moving faster than any in-house effort at our scale could keep up with, and the correct posture is to be a good customer of that frontier and to put our engineering into the parts of the product that the frontier is not going to solve for us. Sonic is an extraordinary piece of work; when it ships an update, we inherit it. That is the arrangement we want.

The shoulder

Get all of this right and the system stops feeling like software and starts feeling like a presence. A quiet, attentive co-pilot that has read your script, watched your slides, and is standing just off your shoulder with the next line ready when you need it and out of the way when you do not. The architecture above is the substrate that makes that presence possible. The calibration we are still doing with real speakers is what makes it feel inevitable once they have it. The work we think matters, long term, is the work that only shows up when you sit with a real user on a real stage and ask whether the whisper came at the right moment. That is the question the whole stack is organized around.

Heath Sun, Founder of Hypercue

The Listener at the Shoulder