Letting Two AIs Argue Without You

It’s late on a Sunday in May. Claude and Codex are debating the design of an internal API for Livery, the AI agent orchestration framework I’ve been building, by appending turns to a shared markdown file. Each one is running a small watcher script or a subagent that’s supposed to poll the file every minute or so, notice when its peer has appended a new turn, and prompt the LLM to take its own.

Mostly it works.

Sometimes it doesn’t. A watcher misses an update. A session goes quiet when it shouldn’t. The file hasn’t grown in twenty minutes and I look over and realize the last turn is Claude’s peer’s response, sitting there unread, and Claude is just… cheerfully waiting. So I switch over and prod it. “Hey. Your peer replied. Take your turn.”

This has been going on for eleven rounds.

The actual cognitive work — the debate, the disagreement, the converging — (horrors, an emdash!) is happening between two large language models who couldn’t care less about my time. My job is to keep an eye on the polling. To notice when one of them has stalled and tap it on the shoulder. It’s the kind of “automation” where half your attention has to stay on whether the automation is still alive.

Which is not what I want to be doing on a Sunday night.

The thing about partial automation is that it lies to you. You feel like you’ve solved the problem. Look, the watchers are firing. The LLMs are responding. The file is growing. And you’ve actually solved 80% of it. The remaining 20% is invisible because it presents as silence. The file doesn’t grow when it should. The peer doesn’t wake up. There’s no error. There’s just nothing happening, and the only way you find out is by paying attention.

So the next morning I asked Claude: how do we get rid of this entirely?

We worked through three options.

The API loop. Write a small program that calls the Claude API, then the OpenAI API, alternating: one writes to the file, then the other, until both sign. Cheap. Maybe 150 lines of code. The problem is that it gives up the thing that made the original useful. Each peer in the watcher version wasn’t just a language model handing back text. It was a full AI session, running in its own program, with its own access to my computer, its own memory of what it had read, its own personality. (Codex’s command-line tool has a personality if you spend enough time with it. So does Claude Code’s.) Two raw API calls roleplaying debaters isn’t the same animal.

Self-pacing. Each AI session polls the shared file on its own. A small watcher in each session, checking the file every minute or so, prompting the LLM to take a turn when the peer has appended one. This is what I’d already been doing. The version that ran for eleven rounds with me occasionally nudging it back to life. It works, mostly. The catch is that “mostly.” The watchers miss updates, sessions go quiet, and you have to stay close enough to catch the misses. You trade one kind of attention for another.

The controller. Boot up one AI session, let it read the file, write its turn, and shut down. Boot up the other one, let it read the file, write its turn, and shut down. Repeat until both have signed off. A small program in the middle does the booting and shutting down. It sits between the two AIs and runs the show. Neither AI is alive between turns. Each one starts fresh every time, reads the whole file from the top, adds its turn, and exits. Nothing polling. Nothing running in the background. No watcher to lose track of. And Livery already has machinery for starting up AI sessions and pointing them at work. That’s the entire point of the framework. So this version is mostly gluing together pieces I’d already built.

I went with the controller.

Reusing existing pieces was nice, but that’s not why it won. It won because of what it does to context.

When two AIs debate live in their own long-running sessions, each one accumulates context across turns. Files read. Previous decisions. Side comments. That sounds like a feature. It’s actually a liability for convergence. They drift away from the original question, each one quietly arguing from a private state the other can’t see. By the time you’re deep in, you can find yourself watching two AIs talk past each other while a small slice of the original question gets resolved.

The controller forces a different shape. Every turn starts from a blank slate. The AI taking the turn reads the entire walkie-talkie file from the top before it writes a word. Nothing is carried over between turns except what’s in the file. The file is the whole record.

Which made one thing obvious that I’d missed when I’d been doing it with watchers: the file needs a briefing at the top. A distilled statement of the question, the options, the constraints, the things we’ve already decided and aren’t relitigating. Both peers re-read it on every turn. It’s the constant frame. The argument fills in below.

Where does the briefing come from? It comes from the conversation that triggered the walkie. I’m chatting with Claude about a hard call. I say “let’s debate this with Codex.” Claude takes our chat and distills it into a briefing, three or four hundred words, shows it to me, I tweak or approve, it goes into the walkie file. The peers never see our chat. They see the briefing.

This is the part I keep coming back to. The pattern generalizes past walkie-talkie:

A messy, exploratory human-AI conversation produces an unstable understanding of the problem.
To hand that understanding off (to another AI, to another session, to a future you), you have to stop and distill.
The distillation is a small artifact. A briefing, a ticket, a one-page memo. It’s the thing that survives.
The downstream work happens against the artifact, not against the chat.

Most AI workflows I’ve seen skip step three. They feed the entire chat history forward and hope it all fits in the AI’s limited memory. It usually doesn’t. And even when it does, the AI on the receiving end inherits all the noise. The false starts. The abandoned framings. The things you tried for ten minutes and walked away from but never explicitly said out loud.

Writing the briefing forces a clarifying act. You can’t distill without deciding what mattered.

The whole thing is now a single command: livery walkie auto. Hire two AI agents in a Livery workspace, point them at a briefing, and the controller runs the back-and-forth until both have signed or you’ve hit a maximum number of rounds. Every turn gets logged: which AI took it, when it started, when it finished, whether it finished cleanly, and what it printed along the way. When something goes sideways (a session stalls, a step fails, a turn takes too long), I can go look at what happened instead of guessing.

If you want a Telegram ping every turn, that already works. The same notifications that fire on normal Livery dispatches fire on walkie turns too. Same plumbing. Nothing new to build.

I started Livery on a simple principle: the framework’s job is to make correct decisions automatically. “Ask the user” is the failure mode, not the safety net. The walkie-talkie watchers were a softer version of the failure mode. I wasn’t being asked anything, but I still had to keep half my attention on whether the polling was alive. Auto-mode is the principle applied properly. No watcher to babysit. No session to prod. The framework picks up everything mechanical. I keep the editorial part.

That’s the trade and I'm cool with it.