I Use AI, and I'm Not Sorry

I use AI agents. A lot of them. They write code, run tests, explore research papers, draft blog posts that I rewrite until they sound like me. If you've read the other posts in this series and assumed a team was behind this, it's one person and a swarm of agents.

I describe a component, say a memory promotion mechanism with TTL-based expiration and salience gating, and agents write the code. They scaffold tests, search for papers, catch mistakes. They're good at the mechanical work of software engineering and they let me move at a pace that would otherwise require a team I don't have. I'm a solo researcher building a system that spans five services, three databases, a knowledge graph, an embedding pipeline, and a perception system. Agents are how that's possible.

The architectural decisions are mine. The hypothesis that an LLM is a codec and not a mind, the three-tier memory with salience-based filtering, the idea that a knowledge graph should grow its own ontology, the bet on unified embeddings. Agents implement those ideas. They don't originate them, and they've hallucinated entire subsystems I didn't ask for often enough that I've learned to keep a close eye on them. They are tools, and like any tool, they're useful to the extent that you know what you're building.

Approximating Determinism

The interesting problem isn't whether to use AI agents. It's how to get deterministic quality out of a non-deterministic process. LLMs don't produce the same output twice. They hallucinate. They lose track of context. They confidently implement the wrong thing and tell you it's correct. If you just let them run, you get chaos, which is what happened to me early on and what I wrote about in an earlier post.

I've built two systems to deal with this. The first, agent-swarm, constrains individual agents. The second, parallel-orchestration, coordinates groups of them.

Constraining Individual Agents

agent-swarm enforces workflows at the tool level. When an agent gets a task, a classifier determines whether it's trivial, simple, or complex before any work begins. Trivial tasks execute immediately. Simple tasks get a confirm-implement-verify cycle. Complex tasks route through a full workflow with phase gates: you don't move from planning to implementation without approval, and you don't move from implementation to completion without passing verification. This is enforced by hooks that intercept tool calls, not by instructions the agent can ignore.

The orchestration layer builds a task queue from whatever input it gets, whether that's a spec, a set of requirements, or a code review with comments. Each task is a self-contained work order with exact interfaces, expected behavior, and explicit file paths. The orchestrator makes architectural decisions; subagents execute them. This is a deliberate constraint. Early on I let agents make design choices, and they'd optimize locally in ways that broke the larger system. Now they get opinionated instructions and work within those boundaries.

Each task group gets its own worktree on a separate branch. Agents work in isolation, push when done, and the orchestrator opens a PR. If an agent dies or gets stuck, the task resets to pending and gets dispatched again. No special retry logic, just the same workflow from the top.

Coordinating Groups of Agents

parallel-orchestration handles the case where you have ten or fifteen independent tasks and want them all running at once. You define a YAML manifest that specifies each task: what it does, where the code goes, where the tests go, minimum test count, and what it depends on. The system resolves the dependency graph, spawns TDD subagents for everything that's unblocked, and monitors completion.

Each subagent follows a strict TDD sequence: write tests first, run them to confirm they fail, implement the code, run tests again, commit only when everything passes. This isn't a suggestion. Gateway conditions enforce it. The system checks that branches exist, that enough test functions are written, that all tests pass, that branches merge cleanly. If a subagent fails, it gets retried with the error context from the previous attempt so it can fix what went wrong rather than starting blind.

Once all tasks complete, the system merges branches in dependency order and runs the full test suite. If verification fails, you get options: fix and retry, rollback, or continue anyway. The whole pipeline, from a design document through planning, manifest generation, parallel orchestration, to branch completion, runs as a single invocation with checkpoints where a human can intervene.

Adversary Agents

The piece I think has the most potential is adversarial verification. After an implementation agent finishes, a separate agent reviews the work with an explicitly adversarial mandate: find gaps in test coverage, find side effects, find things that look correct but aren't. The adversary doesn't care about style or convention. It's looking for substantive problems, and because it's a different agent with a different context window, it's not subject to the same blind spots as the implementer.

In one session, an adversary agent took coverage from 16% to 88% by writing edge case tests the implementers hadn't considered. That's not a fluke. It happens consistently because the adversary is specifically looking for what's missing, while the implementer was focused on making the happy path work.

What I've Learned

The biggest insight is that the constraint system matters more than the model. A weaker model inside a good workflow produces more reliable output than a stronger model with no constraints. The model provides capability. The workflow provides reliability. You need both, but if you had to choose, reliability wins.

The second insight is that LLMs are better at verification than generation. An agent that writes code will produce something plausible. An agent that reviews code against a specific checklist will catch real problems. This asymmetry is useful: make the generation step cheap and fast, then invest the quality budget in verification. Multiple verification passes with different mandates (structural correctness, test coverage, side effect analysis) catch different classes of problems.

The third insight is structural. Non-deterministic processes can produce bounded, useful output if the constraints are right. Phase gates prevent premature transitions. Classification gates route tasks to appropriate workflows. Gateway conditions enforce prerequisites. Adversarial review catches what slipped through. Each layer narrows the variance. None of them makes the process deterministic, but stacked together they keep results within a quality band you can build on.

I Use AI, and I'm Not Sorry

Building in Public

I Use AI, and I'm Not Sorry

Approximating Determinism

Constraining Individual Agents

Coordinating Groups of Agents

Adversary Agents

What I've Learned

Building in Public