Parallel Agents Actually Work (Sometimes)

January 26, 2026Christopher Daly7 min read
AI AssistantsAgent DevelopmentComposition

Parallel Agents Actually Work (Sometimes)

I've been building agent systems for a couple of years now, going back to the early Sophia experiments where I was trying to figure out how to get different specialized components to work together without stepping on each other. The cognitive architecture I'm working on is fundamentally about composition: different specialized pieces that work together, each handling what it's good at while contributing to a larger whole. So when coding assistants got capable enough to be interesting, the obvious question wasn't whether they could be useful (they clearly can, and anyone still debating that point is behind the curve), but whether you could run a bunch of them in parallel on independent tasks without the whole thing collapsing into chaos. Turns out you can, but getting there was messier than I expected, and the answer comes with significant caveats that are worth spelling out.

The Question I Actually Had

The question was never "are agents useful?" because that's been settled for a while now. The question was whether parallel composition would work in practice, whether the theoretical benefits of farming out independent tasks to separate agents would survive contact with reality. If I spawn fifteen agents on independent bugs, and five of them produce garbage that I have to clean up, have I actually saved time or have I just created a QA problem that eats all the gains from parallelism? There's also the isolation problem to consider: agents don't share context, so if one agent figures something out that would help another, too bad, they're operating in separate universes and there's no mechanism for that knowledge to flow between them. Does that matter? For some tasks it clearly does, and for others it might not, and I didn't have good intuition for where the boundary was until I'd tried it enough times to see the patterns.

What Went Wrong at First

The first attempts were, to put it charitably, a mess. You'd ask an agent to fix something and it would 'fix' it by deleting the part that was causing the error, which technically makes the error go away but isn't exactly what you had in mind. Or it would add the feature you asked for, then helpfully 'improve' the surrounding code by removing type annotations it deemed unnecessary or refactoring functions it thought were too long, introducing three new bugs in the process. This isn't a judgment about the capabilities of the underlying models, it's just what happens when you let them loose without structure. They optimize for the immediate task as they understand it, which often diverges from what you actually wanted in ways that are obvious in retrospect but hard to predict in advance. The models are doing exactly what they're supposed to do given their training objectives; the problem is that their training objectives don't include "understand what the human actually meant" in enough detail to handle edge cases gracefully.

Constraints Helped

So I started building constraints, and this is where agent-swarm came from. Phase enforcement: you can't write code until tests exist, which forces the agent to think about what success looks like before it starts hacking. Classification gates: if the task looks complex according to certain heuristics, trigger a checkpoint before proceeding so a human can verify the approach. Coverage thresholds: you can't call something done until you hit 80%, which prevents the "I fixed the bug by deleting the test" failure mode. Each constraint came from watching what went wrong without it. That type annotation removal thing I mentioned? Now there's a hook that calls a smaller model to check if an edit makes sense given the task classification, and while it's not foolproof, it catches the most egregious cases before they land. This turned into agent-swarm, which is now a few thousand lines of enforcement hooks that started as one big file that was impossible to maintain and has since been broken up into modular components: base enforcement, iterate-specific rules, verification gates, monitoring hooks. The architecture has evolved significantly as I've learned what kinds of failures are common and how to prevent them, and it's still evolving because new failure modes keep appearing as I push the system harder.

Parallel Actually Works

With these constraints in place, parallel execution actually delivers on the promise, though it took longer to get here than I expected and I had plenty of moments of doubt along the way. Sequential bug fixing takes hours because you're doing one thing at a time and context-switching between tasks. Parallel execution takes maybe 45 minutes, most of which is waiting for the agents to finish and then reviewing their summaries, and the review is usually fast because the constraints have already filtered out most of the obvious problems. I've also started spawning 'adversary' agents (using Haiku, which is cheaper) to find coverage gaps after the implementers are done, which turned out to be a surprisingly effective pattern. In one session, an adversary agent took coverage from 16% to 88% by writing edge case tests that the implementation agents hadn't thought to create. Have implementation agents do the obvious work, then have adversary agents look for what they missed. The adversarial framing seems to help the model take the task seriously rather than just rubber-stamping what's already there.

What Still Doesn't Work

State isolation is still messy and I haven't solved it yet. One agent might solve a problem that another agent is still struggling with, and there's no mechanism for that knowledge to flow between them, so you end up with duplicated effort or inconsistent approaches across different parts of the codebase. I've thought about a message bus for inter-agent communication, but haven't built it yet because the design space is complicated and I'm not sure what the right abstraction looks like. Manual issue parsing remains tedious because I still extract tasks from code reviews by hand; the automation to do that reliably doesn't exist yet in my setup and building it is on the list but not at the top. SHA tracking for reviews isn't automatic, which means I sometimes lose track of what's been reviewed and what hasn't when I'm juggling multiple branches. These are all solvable problems, but they represent real friction that limits how much I actually use parallel agents versus just doing things myself, and some days the overhead of maintaining the infrastructure gives me pause.

The Larger Question

LLMs are not intelligent. I have no illusions about this. They're sophisticated pattern matchers, impressive in their way, but they don't reason, they don't understand, they don't have goals or intentions. That's not a criticism; it's just what they are. And frankly, that's part of why I'm building LOGOS in the first place: because I think actual intelligence requires something fundamentally different from what LLMs do, and I want to explore what that might look like.

But the engineering question is separate from the philosophical one: can you use LLMs to get work done? The answer is yes, with caveats. You have to treat them as fast but literal executors rather than collaborators with judgment. Think of them as djinn: powerful, fast, and utterly literal. You have to be careful what you wish for because they'll do exactly what you say, not what you meant. Ask one to "fix the bug" and it might delete the code that's causing the error, which technically makes the error go away. The solution is to encode your judgment in constraints, let them execute within those constraints, and review the output.

It's not the "AI collaborator" vision you see in demos where the AI understands your intent and helps you think through problems. It's more like having access to djinn who grant wishes with perfect literalism and no common sense, and who need detailed specifications because they can't infer what you meant from context the way a human collaborator would.

That's useful, genuinely useful, but it's useful in a different way than the marketing suggests, and understanding that difference is key to actually getting value out of these tools.


Building in public, post 2.