Building a Voice Sales Agent with LangGraph
Deep dive into 4-chain dialogue architecture for conversational agents. How we built it, what broke, and what we'd do differently.
Thinking, building, learning — in public.
From state machine to spoken conversation
Most voice agents are glorified IVR systems. They follow rigid scripts, fall apart at the first unexpected input, and sound like they're reading from a teleprompter. We wanted something different — an agent that could hold a genuine sales conversation, adapt to objections, and know when to push and when to listen.
The quality of a conversation is determined by the quality of the questions asked, not the answers given.
The technical challenge wasn't speech recognition or synthesis — those are commodity APIs now. The real problem was dialogue state management. How do you build a system that tracks where it is in a conversation, what it's learned about the prospect, and what to say next — all in real-time?
We settled on a 4-chain architecture using LangGraph. Each chain handles a distinct concern:
const graph = new StateGraph({
channels: {
messages: { value: [] },
prospect: { value: null },
stage: { value: 'opening' },
objections: { value: [] },
}
})
.addNode('understand', understandChain)
.addNode('strategize', strategyChain)
.addNode('generate', generationChain)
.addNode('guard', guardChain)
.addEdge('understand', 'strategize')
.addEdge('strategize', 'generate')
.addEdge('generate', 'guard');The key insight was separating understanding from strategy. Early prototypes combined these — the LLM both interpreted what was said and decided what to do next. This created a coupling problem: improving intent classification would sometimes degrade strategic decisions, and vice versa.
The voice layer runs on Pipecat, a real-time audio framework. The architecture has an inversion of control pattern that took a while to grok: the framework discovers and calls your function per-connection, not the other way around.
Two things called "runner" do completely different jobs:
WebRTC negotiation completes before the pipeline even exists. This means the prospect hears silence (or a holding message) while the pipeline initializes. Getting this gap under 500ms was a significant engineering challenge.
The guard chain was supposed to catch hallucinations and off-brand responses. In practice, it caught about 60% of problems and added 200ms of latency to every turn. We ended up moving most of the guard logic into the generation prompt itself — faster, simpler, and paradoxically more reliable.
The other surprise: stage transitions. We initially modeled conversation stages as a linear funnel (opening → discovery → pitch → close). Real conversations don't work that way. Prospects loop back, skip stages, and sometimes start at the close.
We switched to a state machine with weighted transitions — any stage can reach any other stage, but the weights bias toward forward progress. This solved the rigidity problem without losing the structure.
Three things we'd do differently next time: