Most voice agents are glorified IVR systems. They follow rigid scripts, fall apart at the first unexpected input, and sound like they're reading from a teleprompter. We wanted something different — an agent that could hold a genuine sales conversation, adapt to objections, and know when to push and when to listen.
The quality of a conversation is determined by the quality of the questions asked, not the answers given.
The technical challenge wasn't speech recognition or synthesis — those are commodity APIs now. The real problem was dialogue state management. How do you build a system that tracks where it is in a conversation, what it's learned about the prospect, and what to say next — all in real-time?
Architecture
We settled on a 4-chain architecture using LangGraph. Each chain handles a distinct concern:
- Understanding chain — parses intent, extracts entities, classifies objections
- Strategy chain — decides the next conversational move given current state
- Generation chain — produces natural language from the strategic decision
- Guard chain — validates output before sending to TTS
const graph = new StateGraph({
channels: {
messages: { value: [] },
prospect: { value: null },
stage: { value: 'opening' },
objections: { value: [] },
}
})
.addNode('understand', understandChain)
.addNode('strategize', strategyChain)
.addNode('generate', generationChain)
.addNode('guard', guardChain)
.addEdge('understand', 'strategize')
.addEdge('strategize', 'generate')
.addEdge('generate', 'guard');The key insight was separating understanding from strategy. Early prototypes combined these — the LLM both interpreted what was said and decided what to do next. This created a coupling problem: improving intent classification would sometimes degrade strategic decisions, and vice versa.
The Voice Pipeline
The voice layer runs on Pipecat, a real-time audio framework. The architecture has an inversion of control pattern that took a while to grok: the framework discovers and calls your function per-connection, not the other way around.
Two things called "runner" do completely different jobs:
- The transport runner manages WebRTC connections
- The pipeline runner orchestrates the audio processing chain
WebRTC negotiation completes before the pipeline even exists. This means the prospect hears silence (or a holding message) while the pipeline initializes. Getting this gap under 500ms was a significant engineering challenge.
What Broke
The guard chain was supposed to catch hallucinations and off-brand responses. In practice, it caught about 60% of problems and added 200ms of latency to every turn. We ended up moving most of the guard logic into the generation prompt itself — faster, simpler, and paradoxically more reliable.
The other surprise: stage transitions. We initially modeled conversation stages as a linear funnel (opening → discovery → pitch → close). Real conversations don't work that way. Prospects loop back, skip stages, and sometimes start at the close.
We switched to a state machine with weighted transitions — any stage can reach any other stage, but the weights bias toward forward progress. This solved the rigidity problem without losing the structure.
Lessons
Three things we'd do differently next time:
- Start with the voice pipeline, not the dialogue logic. We built sophisticated conversation management on text, then discovered half our design assumptions didn't survive the transition to real-time audio. Latency constraints reshape everything.
- Instrument from day one. We added observability late and spent weeks debugging issues that proper tracing would have caught in minutes.
- Test with real humans earlier. Synthetic test conversations are poor predictors of how real prospects behave. The first real conversation broke three assumptions in the first thirty seconds.