26,000 Lines of Code I Couldn't Explain

Four days, three AI agents, and 26,000 lines of code. 858 tests passing. A 4-chain LangGraph dialogue system with SPIN selling methodology, PDF ingestion, email integration, and a WebRTC voice pipeline. I couldn't explain how a single API endpoint worked.

I sat in front of the codebase and tried to trace a request from the browser to the server. I couldn't. Not because the code was obfuscated or badly written — but because I'd never read it. Three AI agents had written every line, and I'd accepted every line, and somewhere in the middle of all that accepting I'd confused output with understanding.

Code that you can't explain in an interview is worse than no code.

I'd already written about this problem in the abstract. The Acceleration Paradox argued from research that speed without understanding is liability. Then I proved it on myself.

This is that story.

The Machine

In early March 2026, I built a multi-agent coding rig based on Jeffrey Emanuel's Agent Flywheel. The setup: a Contabo VPS running Named Tmux Manager, spawning Claude and Codex instances in parallel terminal panes. Agent Mail handled inter-agent messaging. Beads tracked tasks. I pointed the whole thing at a single goal: build a voice-enabled sales agent.

The Flywheel marketing gives the impression of a self-coordinating agent swarm. The reality is a collection of CLI tools that you orchestrate manually. NTM opens terminal panes. Agent Mail lets agents declare file reservations — but there's no enforcement. You assign work, monitor progress, handle conflicts, and merge results. The "flywheel" effect comes from the tools working well together, not from autonomous coordination.

I didn't fully grasp this at the time. I was too busy watching code appear.

The Identity Crisis

The first real problem surfaced on day two. After running for a few hours, agents would hit context limits and undergo compaction — the process where the LLM compresses its conversation history to fit new context. Post-compaction, agents forgot who they were. An agent assigned to work on authentication would wake up and start modifying the frontend. Another would duplicate work a teammate had already finished.

I vibe-coded a fix: a three-layer identity persistence system. Agent Mail registration gave each agent a memorable name — GreenCastle, BlueLake. An IDENTITY.md file defined their role and decision-making authority. A MEMORY.md file accumulated learnings across sessions. A session protocol in the main CLAUDE.md instructed agents to reload their identity files after compaction.

The tmux window ID became the lookup key — the one thing that survived compaction. It worked. Agents would come back from compaction, read their identity files, and resume their assigned work.

I solved a real systems design problem. The identity persistence architecture was sound. But I solved it the same way I was building everything else — by telling an agent what I wanted and accepting what it produced. I didn't write the session protocol. I didn't design the three-layer persistence model from first principles. I described the problem, and an agent designed the solution, and I said "yes" and moved on.

The Output

Four days in, the codebase had:

26,000 lines of code
858 passing tests
A 4-chain LangGraph dialogue system (Understanding → Strategy → Generation → Guard)
SPIN selling methodology encoded as a state machine
PDF ingestion for product knowledge
Email integration
WebRTC voice pipeline via Pipecat and Daily.co
Full component stack with TypeScript types

On paper, this was a production-ready sales agent. In practice, it was a collection of components that had never been connected by a human who understood them.

The Uncomfortable Truth

The agents built the beginning and the end, but not the middle.

Each component worked in isolation. The understanding chain parsed intent correctly. The strategy chain made reasonable conversational decisions. The generation chain produced natural-sounding responses. The guard chain caught most hallucinations. The tests proved all of this.

What was missing: the connective tissue. Config management was scattered across environment variables, hardcoded strings, and YAML files — three different systems that nobody had deliberately chosen. Error handling followed no consistent pattern. Modules referenced each other through implicit contracts that would break the moment you moved a file. The voice pipeline initialized after WebRTC negotiation completed, creating a silence gap that nobody had designed around — it just happened to be short enough that I hadn't noticed.

I wrote a document called HOW-IT-ACTUALLY-WORKS.md to explain the system to myself. The honest summary landed on a single sentence that I still think about:

You are the orchestrator. Tools reduce friction, not remove human orchestration.

The Flywheel tools reduced the friction of spawning agents, managing tmux sessions, and passing messages. They did not remove the need for a human who understands the architecture, makes integration decisions, and catches the gaps between components. I'd been treating velocity as a proxy for progress, and velocity had produced 26,000 lines of code and zero architectural understanding.

Mario Zec, creator of the Pi coding agent harness, has a way of framing this that stung when I heard it:

"So you get enterprise grade complexity within two weeks with just two humans and 10 agents. Congratulations."

That was exactly what had happened. Enterprise-grade complexity, startup-grade comprehension.

The Pivot

I stopped coding and went back to school.

Not literally. But I stopped producing and started learning. I enrolled in Full Stack Open to build a proper foundation in React, TypeScript, and web fundamentals. I started the Odin Project for JavaScript foundations. I began reading framework documentation instead of asking agents to summarise it for me.

There's a cognitive science analogy that captures what this felt like. When you hear a symphony, you experience the whole — the melody, the harmony, the emotional arc. You can recognise it, hum along, feel something. But you can't reproduce it. You can't sit at a piano and play the violin part, because recognising music and reading sheet music are entirely different cognitive abilities. One is pattern matching. The other is structural understanding.

I'd been pattern-matching my way through 26,000 lines of code. Recognising that the output looked right. Never building the structural understanding that would let me reproduce it, modify it, or explain it.

Building the Design Methodology

Before writing another line of code, I built the process I should have started with.

I created a DESIGN-FRAMEWORK.md that codified how to approach system design: data model first, permission model second, core user journeys third, system boundaries fourth. I adopted C4 diagramming — Context, Container, Component, Code — as a way to think about architecture at progressive levels of detail. I fell into a rabbit hole learning to render Mermaid diagrams and building a viewer that could display C4 diagrams inline. The rabbit hole was worth it: I now had a visual language for describing systems before building them.

I wrote 12 Architecture Decision Records for the salesbot — each one forcing me to articulate why I was choosing a technology, not just what. Why Pipecat over LiveKit. Why Daily.co for WebRTC transport. Why Cartesia for text-to-speech instead of ElevenLabs. Each ADR was a comprehension gate: if I couldn't explain the trade-offs in my own words, I didn't understand the decision well enough to make it.

The comprehension gates became the core discipline. At three points in the workflow — Design-to-Plan, Plan-to-Build, Build-to-Next-Version — I had to pass a test: can I explain this without reading the document? If not, I stop and fill the gap before moving forward.

This was the direct opposite of the Flywheel approach, where I'd been filling gaps by throwing more agents at them.

V0

I threw out the 26,000-line codebase and started from zero.

Not recklessly. I kept the architectural insights — the C1 context diagram, the C2 container boundaries, the understanding of how Pipecat's inversion-of-control pattern works. But every line of code in the new build would be written fresh, and I would understand every line before it shipped.

The scope reduction was radical. The original codebase had SPIN selling, PDF ingestion, email integration, a guard chain, and a weighted state machine for conversation stage transitions. V0 would do one thing: connect, speak, hear a response. That's it. Three actions. Nothing else.

I watched the Pipecat conference talk and built a reference document explaining the framework's two-layer model — transport layer (WebRTC, audio codecs, real-time streams) and pipeline layer (processors that transform frames of audio, text, and metadata). I documented the gotchas I found before writing code: the runner's inversion-of-control pattern, the fact that two things called "runner" serve completely different purposes, the WebRTC negotiation completing before the pipeline exists.

Then I designed. DESIGN.md grew section by section: C1 context showing the browser, the bot server, and the three external services (Daily, Deepgram, Cartesia). C2 containers showing the frontend SPA, the Pipecat backend, and their communication patterns. C3 components showing every processor in the voice pipeline.

Twelve ADRs forced understanding of every decision. Three nested loops governed the work:

Loop 1 — Design (outer): Generate architecture for the current version. Gate out when Richard can explain what each component does and why.

Loop 2 — Planning (middle): Break the design into sequenced milestones. Gate out when the first task is actionable.

Loop 3 — Implementation (inner): Build task by task. Discoveries feed back into design. Blockers update the plan. Every implementation parity-checked against the design: does what I built match what I designed?

The constraint governing all three loops: VOX must not outrun Richard. If the AI generates a design I can't explain, it's failed. Pace is set by human understanding, not AI output speed.

Where It Stands

Milestone 0.0 (Design): complete. Ten ADRs, C3 diagrams, interface contracts.

Milestone 0.1 (Frontend): complete. Seven components — Navbar, DualWaveform, TranscriptPanel, StatusTabs, PipelineProgress, ConnectionBar, and a mobile variant. A neo-brutalist theme merging the arlabs.tech design language with a "Dark Instrument" aesthetic I prototyped through design probes. Not generated by an agent and accepted without review. Designed, reviewed, and understood.

Milestone 0.2 (Voice Loop): complete. End-to-end voice roundtrip working locally — speak into the mic, Deepgram transcribes, OpenAI responds, Cartesia synthesises, audio plays in the browser. DailyTransport handles WebRTC. I can trace a request from the Connect button click through RTVI client negotiation, Daily room creation, pipeline assembly, frame processing, and audio playback. I can explain it without reading notes.

Milestone 0.3 (Deploy to VPS): active. Nine sub-milestones mapping the path from local demo to salesbot.arlabs.tech serving live traffic.

The pace is slower. Orders of magnitude slower than the Flywheel. Four days produced 26,000 lines I couldn't explain. Several weeks have produced perhaps 2,000 lines I can explain completely. But the codebase is clean. The architecture is documented. The tests test what matters. And when something breaks, I know where to look — because I designed the system, not an agent.

The Close

The experience produced a project I didn't expect: arch-eng/, a repository of engineering methodology. Design frameworks, C4 diagramming guidelines, BDD patterns, specification templates. The kind of process documentation that seems bureaucratic until you've shipped 26,000 lines of mystery code and spent weeks trying to figure out what you built.

The Flywheel wasn't wasted. Wrong-path evidence is still evidence. Every architectural insight in the V0 design — the two-layer transport model, the inversion-of-control runner pattern, the comprehension gates — exists because I ran the experiment and documented what failed. The 26,000 lines taught me more about what not to do than any tutorial could.

There's a broader argument here about AI coding tools, and I made that argument in The Acceleration Paradox. But arguments from research are abstractions. This is what it actually feels like: the moment you realise that everything you've built is hollow, that you've been moving fast and arriving nowhere, and that the only fix is to go back and build the engineering judgment you skipped.

The correction mechanism is slow. It's supposed to be.

The salesbot isn't finished. But for the first time, I can explain every line that exists.