AI Swarm Audited My 840-Post Blog

Technical Content Has a Half-Life

The moment you publish a how-to, the clock starts. Not dramatically — more like a slow exhale. A config key gets renamed in a minor release. An SDK drops a deprecated method in the next major version. A base image tag goes EOL six months later. The self-hosted app you documented last year quietly swapped its backend between releases.

None of this is anyone’s fault. Time just happens to documentation. The problem is that every one of these small drifts is invisible from the outside — the article looks fine, gets found in search, and confidently walks a reader into an error they didn’t expect.

Now imagine that library is 840 articles deep — a back catalog you’ve written and piled up over several years. Reading every one of them manually, checking current docs against each claim, and making surgical corrections isn’t a weekend project. It’s a multi-month slog nobody actually does.

So I didn’t do it manually.

The Setup: One Agent Per Document, Run in Parallel

The idea is pretty simple once you say it out loud: if you have a large language model that can read a technical document, check its claims against current knowledge (with the occasional targeted web search for anything version-sensitive), and propose minimal corrections — you just need to run that at scale.

The architecture I landed on was a fan-out swarm: one independent agent per article, no shared state between them, run in parallel batches. Each agent gets exactly one job — read its assigned article, check it for three things, and either report “nothing to fix” or produce a diff of minimal corrections.

The three things every agent checks:

Factual/technical accuracy — Are code samples, CLI invocations, and API signatures consistent with current versions? Does the config shown actually match how the tool works today?
Freshness/drift — Has anything in the ecosystem moved since this was written? Version numbers, project status, renamed resources, deprecated patterns?
Metadata hygiene — Tags, description length, frontmatter fields, SEO cleanliness. The stuff that’s easy to let slide when you’re in writing mode.

Running them in parallel batches of about 70 at a time was the practical constraint. Rolling usage limits meant spreading the work across a few days — five rounds total to get through the full catalog.

2,441 agent runs. That’s the number once you account for reruns, the handful of articles that needed a second pass, and the orchestration layer itself.

What Drift Actually Looks Like

Here’s the thing: the stale content isn’t usually wrong in an obvious “this never worked” way. It worked perfectly when it was written. The ecosystem just moved on without updating the article.

A few categories that show up constantly in any aging technical library:

Renamed CRD kinds. Kubernetes controllers version their custom resources, and a kind name occasionally changes between operator releases. An article whose manifest still references the old resource name is silently broken for anyone on a current version — same concept, but the YAML a reader copy-pastes no longer matches what the cluster actually expects.

SDK method signatures. A secrets manager SDK that released a v2 often rewrote its client initialization and its get-secret call. An article using the old pattern compiles, runs, and fails in a way that’s confusing if you don’t know to look for the version bump. These are death by a thousand Stack Overflow visits.

Compose examples pointing at deprecated backends. A self-hosted service might have migrated from MongoDB to Postgres+Redis between major versions. An article with a Compose file referencing the old backend just… doesn’t work anymore. Pinned EOL base image tags are in the same bucket — they keep working until they don’t, and the failure mode is rarely obvious.

Forward-auth endpoint changes. Auth proxies like Authelia and Authentik have both moved their forward-auth endpoint paths across major versions. An Nginx or Caddy config using the old path silently fails authentication in a way that looks like a networking problem, not a version mismatch.

Obsolete Compose v1 syntax. The old docker-compose CLI (v1, the standalone Python one — not the docker compose plugin, which is alive and well) is dead. The version: key at the top of a Compose file is deprecated and will nag at you on current Docker. These are low-stakes but they add up to a wall of yellow warnings in CI, and they’re the kind of thing readers notice.

Model names. These age faster than almost anything else in the AI/LLM space. An article recommending a specific model from 18 months ago might be pointing at something that’s two generations stale. The agents flagged these for human review rather than auto-correcting — model comparisons involve judgment calls, not just version lookups.

None of this is embarrassing. It’s just the universal half-life of a technical how-to. The only question is whether you do anything about it.

The Guardrails That Matter

Running 2,400 agents loose on your content without guardrails is how you turn a maintenance pass into a disaster. There were four things I wouldn’t compromise on:

Surgical, not rewrite. Every agent was instructed clearly: fix only what’s technically wrong or demonstrably stale. Preserve the voice, the structure, the examples, the jokes. Do not improve the prose. If nothing genuinely needs fixing, output nothing. This sounds obvious but it has to be explicit — models have a natural tendency to “help” more than you want.

A build gate after every batch. After each batch of commits landed, the full site rebuilt: type-check, static build, Pagefind search index over all ~840 posts. Nothing ships if the build is red. This caught a handful of cases where an agent introduced a fenced code block with an unsupported language identifier (build error, easy fix) or tweaked frontmatter in a way that broke schema validation. The gate makes the whole thing safe to run against a live site.

Batch commits for resumability. Commits happened in batches locally, not one per article. If the session hit a usage limit mid-run, you could pick up from the last committed batch without re-auditing work that was already done. The five-round structure wasn’t planned — it emerged from this.

Human-review flags for anything risky. Any edit above a low confidence threshold, or touching anything that looked like a substantive technical judgment call (as opposed to a clear version mismatch), got flagged rather than auto-applied. The agent would still propose the change, but it went into a review queue instead of directly committing. In practice this was maybe 8–10% of total edits.

Here’s the rough shape of the orchestration loop:

for batch in chunked(pending_articles, batch_size=70):
    results = run_agents_parallel(batch)  # fan-out

    for article, result in results:
        if result.needs_edit:
            apply_patch(article, result.diff)
            if result.risk_level > LOW:
                flag_for_review(article, result.rationale)

    git_commit(f"audit batch {batch_num}: {len(batch)} articles")
    build_result = run_site_build()

    if not build_result.success:
        rollback_batch()
        raise BuildGateFailure(build_result.errors)

Not the actual code, but that’s the mental model. Fan out, collect, patch what passes the threshold, commit the batch, build, gate.

The Numbers, Honestly

Here’s where it all went. Five rounds, spread across a few days as usage limits reset:

Round	Agents	Tokens
1	821	7.4M
2	680	8.7M
3	488	8.6M
4	318	9.2M
5	134	4.6M
Orchestration	—	17.9M
Total	2,441	~56.4M

The thing that surprised me: the agent count drops by more than 80% from round 1 to round 5 — fewer articles still needed a pass each time — but tokens-per-round barely moved until the very end. The work per article is roughly fixed. Reading it, thinking about it, checking the claims costs about the same whether you’ve got 800 left to do or 130.

That ~56M headline is also a little misleading. About 14.4M of it was cache reads — the orchestrator re-reading its own context between batches, which bills at roughly a tenth of the price of fresh tokens. Strip out the cache replay and the genuinely new work is closer to 42M tokens, give or take 50K per article.

What it actually costs. I ran this on Claude Opus 4.8 — specifically because I had a pile of prepaid credits about to expire (more on that in a second). But “what would this cost at list prices?” is the more useful question, so here’s the same workload priced three ways. Most of those tokens are input — reading is far cheaper than generating — which is why the output-priced frontier model isn’t quite as brutal as the raw count suggests:

Model	Est. cost at list price
Claude Opus 4.8 (what I used)	roughly $400
GLM-5.2	~$90
DeepSeek V4	~$25

Before you rage-quit Opus and pipe everything through DeepSeek: the whole pass lives or dies on the model catching subtle drift — a renamed forward-auth endpoint, a client signature that changed in a point release, a model name that’s quietly two generations stale. Different models have different strengths here, so don’t assume the priciest one wins by default. Try a few options, eyeball the edits they hand back, and judge them on the voice, tone, and accuracy you actually want. If the results are off, tweak the prompt, swap the model, or rethink the whole approach before you turn it loose on hundreds of articles. The cheap sweep that nails it beats the expensive one that confidently ships a wrong “fix” to readers.

The “is it worth it?” math is pretty context-dependent:

For a large, aging technical library (hundreds of docs, years of accumulated drift), the cost per corrected article is well below what you’d pay a contractor for even a light review pass. The agents also don’t get bored at article 200 the way humans do.

For a small blog you update regularly and already re-read yourself: this is massive overkill. Just read your own posts.

The sweet spot is exactly where I was: a large back catalog that drifted over multiple years, where doing it by hand was theoretically possible but practically never going to happen.

What came out of it: roughly 94% of articles reviewed got at least one edit — usually small: a deprecated flag, a version bump, a renamed config key, a frontmatter field that was missing or wrong. The other 6% were already clean and got left alone.

Where It Falls Short

Agents aren’t perfect reviewers and I’d rather be upfront about where this breaks down than pretend it’s magic.

It occasionally wants to “fix” things that are intentional. Sometimes an article uses an older syntax deliberately — because the section is about how something used to work before explaining the current approach. An agent with limited context will flag that as stale. The human review queue catches most of this, but it’s a source of false positives.

Context budget blowouts on long-form pieces. A handful of very long, detailed articles — the kind with multiple complex code examples and extensive comparison tables — blew past the effective context budget and had to be handled as a separate manual pass. Not a blocker, but worth knowing: the technique works best on typical-length articles. Monster posts need special handling.

Technical judgment calls still need humans. When is a deprecated pattern deprecated but still working versus actually broken? When is an old model name worth keeping for historical context versus actively misleading? The agents flag these correctly, but answering them requires knowing the domain. The queue doesn’t eliminate human judgment — it just focuses it where it’s actually needed.

It can’t fix what it can’t know. If a self-hosted project was abandoned and the article doesn’t mention that, an agent might miss it unless it has web search enabled and specifically looks. Broader “is this project still maintained?” checks require intentional tooling.

Steal This Workflow

If you’re maintaining a large technical content library — blog, wiki, docs site, internal knowledge base — here’s the shape of what to build:

One agent per document, parallel batches. Don’t try to process everything sequentially; the wall clock time gets brutal. Batch size depends on your usage limits.
Three-task brief per agent: accuracy check, freshness/drift scan, metadata hygiene. Keep the brief tight or agents scope-creep into rewrites.
Explicit non-rewrite instructions. “Fix only what’s wrong. Preserve voice and structure. Output nothing if nothing needs fixing.” This has to be in the prompt, not assumed.
A build gate. Whatever your CI looks like, run it after every batch before committing to the next one. Non-negotiable if you’re running this against a live site.
A human review queue. Any change above a risk threshold, any edit the agent isn’t confident about — flag it, don’t auto-ship it.
Batch commits, not one per article. Makes the job resumable when you hit rate limits or usage caps.

The whole thing took a couple of days of elapsed time and a pile of AI credits I needed to burn before they expired anyway. The site came out the other end a lot cleaner, and I know where the remaining human-review flags are.

Not a silver bullet. But for a content library that’s too big to maintain by hand? Honestly, it’s one of the more efficient maintenance passes I’ve run.

Your 2 AM self dealing with a reader’s confused GitHub issue about a config that changed in v2.3 will appreciate it.