The Footnote That Deserved Its Own Post Link to heading

At the bottom of The Forge, I dropped a footnote:

Local LLM inference can also be used to reduce API calls for the Boss and the QA roles, but frontier models are critical for workers and reviewers.

That footnote glossed over something worth unpacking. Running a team of twelve autonomous agents sounds like it should cost a fortune. It doesn’t. But getting the economics right requires a bit of design thinking about which models go where, and why.

Two Tiers of Intelligence Link to heading

Not every agent needs a frontier model. This is the single most important insight for anyone building agent systems in 2026.

A writer producing production code needs the best model available. It needs to reason about architecture, maintain consistency across a large codebase, and make decisions that a human reviewer would approve. That’s Opus 4.6 territory. No compromise.

A boss running triage on incoming issues? It’s reading a ticket, classifying it, and deciding whether to accept or reject. That’s pattern matching with some contextual awareness. A good local model handles it fine.

The same logic applies to test orchestration, status reporting, and the kind of “glue” reasoning that keeps the pipeline moving. These tasks need competent inference, not frontier intelligence.

So you split. Frontier models for the roles that demand peak capability. Local models for everything else.

The Local Tier Link to heading

I’ve been running Ollama on my development PC since I built it last year. The machine was designed for gaming and development, but it turns out a modern GPU with 16GB of VRAM is also a perfectly serviceable local inference server.

Until two weeks ago, I was running qwen3:14b-q8_0 as the local workhorse. It was competent but comparatively slow at 14.5 tok/s. When Alibaba released the qwen3.5 family in early March, I benchmarked the new qwen3.5:9b-q8_0 against several alternatives on the same hardware, asking each model to produce a senior architect’s analysis of event-driven vs request-response patterns for financial systems. The results:

Model	Output	Speed	Time
deepseek-r1:14b	1,383 tokens	30.4 tok/s	46s
qwen3.5:9b-q8_0	3,832 tokens	28.0 tok/s	142s
phi4-reasoning	3,111 tokens	25.4 tok/s	123s
qwen3:14b-q8_0	2,234 tokens	14.5 tok/s	157s

qwen3.5:9b produced the highest quality value by a clear margin; structured as a professional memo with domain-specific examples, a detailed hybrid architecture with rationale, and the most comprehensive diagram. It was also nearly twice as fast as the qwen3:14b it replaced.

deepseek-r1:14b was the fastest but shallowest; about a third of the output, reading more like a student working through concepts than an architect reasoning about trade-offs. Good for quick tasks, not deep analysis.

The VRAM Ceiling Link to heading

Here’s the constraint that shapes the whole system. The sweet spot model is actually qwen3.5:27b. When I tested it, it produced the most tokens (3,786) with quality comparable to the 9b variant. But at 4.7 tok/s, it took over 13 minutes to generate a single response.

The reason: 27 billion parameters at q8_0 quantisation needs more than 16GB of VRAM. The model spills into system RAM, and inference speed falls off a cliff. It’s technically running, but it’s not usable for interactive or agent workloads.

So the 9b stays. It fits entirely in VRAM, runs at 28 tok/s, and produces output that’s genuinely good for non-production-code tasks. The two-tier split isn’t a preference; it’s physics. You design your agent roles around what fits in the GPU, and push everything that needs peak intelligence to the frontier API.

The Frontier Tier Link to heading

For the roles that need real capability; writers, reviewers, and anything touching production code; I use Claude Code with a Max subscription.

The economics here are unusual. Claude Code Max is a flat monthly subscription with usage gates rather than per-token billing. Max5 runs at US$100/month (about AU$150) and Max20 is double that for four times the capacity. Both give access to Opus 4.6 with 5-hour and weekly usage windows.

The per-token equivalent value is where it gets interesting. If you can keep the tokens burning; and this is the critical “if”; the effective cost per token drops to a fraction of what you’d pay on the API. Anthropic’s own estimates suggest a 10x to 40x multiplier on value compared to API pricing, depending on how efficiently you use the gates.

Humans Sleep, Agents Don’t Link to heading

The usage gates are time-based. Burn through your allocation and you wait for the gate to reset. For a human developer working 8-10 hours a day, that’s a natural constraint; you run into the 5-hour gate halfway into your day (assuming you use the hello hack to start your session before 7am), you pick it up again after lunch. Or for those working after hours - you go to bed, it resets overnight.

But agents don’t sleep.

Before the Forge, the best we could manage was maybe an hour of overnight agent work before the task list ran dry. The Forge changes that completely. With twelve agents and a backlog of well-specified tasks, we can sustain token burn through the entire overnight cycle. The 5-hour gate resets and the agents are already queued up with more work.

We’re now consistently exhausting the 5-hour windows and hitting the weekly limits. The subscription is being used at its theoretical maximum. Human attention for the week has dropped from about five hours a day of hands-on work sitting at the keyboard to roughly two; the rest goes into architectural research, planning, and specification writing. All of which feeds back into the agents as structured input.

We’re force-multiplying our own brains. The humans write the intent, the agents write the code, and the output rate is something neither could achieve alone.

The GPU Decision Link to heading

The hardware choice feeds directly into the economics. My GPU is an AMD Radeon 9060 XT with 16GB of VRAM. Here’s what the alternatives looked like at time of purchase:

GPU	VRAM	Price (AUD)
AMD Radeon 9060 XT	16GB	~$600
AMD Radeon 9070 XT	16GB	~$1,200
Nvidia RTX 5070 Ti	16GB	~$1,600

All three cards have 16GB of VRAM; the same local models fit on all of them. The 9060 XT is slower for gaming, but for inference workloads the bottleneck is memory bandwidth and VRAM capacity, not raw shader count. The same model runs on all three cards; the 9060 XT just takes a bit longer per token.

The $1,000 saved over the RTX 5070 Ti is almost seven months of a Max5 subscription. The lower power draw on the AMD card saves roughly another month’s worth per year in electricity. So in year one, the GPU choice alone funds about seven to eight months of frontier API access. After that, the power savings keep compounding.

Put differently: I get both tiers of the AI workshop; local inference and frontier API access; for less than the Nvidia card alone would have cost - for the first year at least. But things are moving so fast, I don’t suspect we will be using GPUs in development rigs this time next year. AI specific units with high VRAM and low energy overhead are already here, and expect more to come as 2026 goes on, despite the RAMpocolypse.

Two Projects, Two Tiers Link to heading

This isn’t theoretical. Two projects on my homelab demonstrate the split in practice.

LLM Gateway Link to heading

The LLM Gateway is an authenticated OpenCALL endpoint sitting in front of Ollama. Same POST /call envelope, same contract, but the backend is local inference rather than a cloud API.

{
  "op": "v1:llm.chatCompletion",
  "args": {
    "model": "qwen3.5:9b-q8_0",
    "messages": [{ "role": "user", "content": "summarise this ticket" }]
  }
}

Firebase-hosted apps can call back to the gateway. Local network requests bypass auth; external requests need a JWT. The gateway supports chat completion, model listing, and embeddings; everything the Forge’s Boss agent needs for triage and coordination tasks.

Cost per query: electricity. Which is to say, effectively $10/m. It also means the Forge can run off-box, in the cloud on someone else’s hardware, in someone else’s data centre. It might even become a SaaS product, but the SaaSpocolypse is also real, so bootstrapping that idea can be done with someone else’s money.

NAS Agent Link to heading

The NAS Agent sits at the other end of the spectrum, but not in the way you might expect. It’s built on the Claude Agent SDK, accepts natural language through Slack, an OpenCALL API, or CLI, and manages Podman containers, ZFS pools, systemd services, and PostgreSQL databases.

Three entry points, one agent:

Slack - @nas how much disk space is left?
OpenCALL API - POST /call for structured commands from other services
CLI - interactive REPL or single-shot queries

But here’s the critical design constraint: the LLM never touches the shell directly. It interprets the request, selects the appropriate operation from a predefined set of OpenCALL tools (v1:pod.ps, v1:zfs.status, v1:service.restart, etc.), and the agent framework executes that operation through validated, parameterised handlers. The LLM is the translator, not the executor.

This matters. Letting an LLM loose with bash on a device that accepts text instructions from an uncontrolled source; Slack, in this case; is asking for something to go very wrong. The OpenCALL contract acts as a security boundary. The model can only call operations that exist in the registry, with arguments that pass validation. It can’t rm -rf / because that’s not an operation.

The per-query budget is capped at US$0.05; cheap enough that it doesn’t matter. But it uses a frontier model via the API, not a local one. Not because the task is complex, but because intent interpretation from natural language needs to be reliable, plus I also want it to work when the development PC is powered down. A local 9b model picking the wrong tool from a list of 25 operations is a worse outcome than a slightly higher cost per query, so this is a risk decision, not a raw cost decision. Besides that, it only gets used a handful of times per week.

The contrast makes the economics tangible. The gateway runs the local tier and costs electricity. The NAS agent runs the API tier and costs cents. Both use the same OpenCALL envelope. The intelligence tier is an implementation detail hidden behind a consistent API contract.

The Force Multiplier Link to heading

In January, I wrote about the shift from prompts to specifications. The bottleneck moved from “can the AI implement this?” to “did we tell the AI exactly what we wanted?”

The economics post is the other side of that coin. Once you’re writing specifications instead of code, the cost of producing software shifts from developer hours to token consumption. And token consumption is something you can optimise relentlessly; right model for the right task, subscriptions over per-token billing, overnight agent runs to maximise gate utilisation.

Human attention becomes the scarcest and most valuable resource. Two hours a day of focused specification and architectural work, fed into a system that runs twenty-four hours a day, produces more output than eight hours of traditional hands-on development. The humans think, the machines type.

Beyond Software Link to heading

This pattern extends beyond code. We’ve used agents to process years of accumulated email, culling thousands of messages to reclaim storage and cancel unnecessary subscriptions. The personal agent; one that knows your preferences, your filing system, your priorities; is something every knowledge worker will need to learn how to leverage.

The challenge is that no training course exists for this, and probably never will. The tooling moves too fast for traditional knowledge dissemination. By the time a training organisation designs a curriculum, the models and patterns have moved on. The best bet right now is local networking; communities like Coding from Beach here on the Sunshine Coast where practitioners share what’s working in real time.

With Anthropic opening up their partner program, consulting on AI implementation is about to become a significant market. The organisations that figure out these economics early will have a substantial head start. The ones waiting for a textbook will be waiting a long time, but in the meantime, there will be a lot of slow movers wanting a leg up - and willing to pay for it - in the very near future.

The Moving Target Link to heading

The economics only improve from here. The next generation of inference hardware will be decoupled from discrete GPUs built for gaming, and will push the VRAM ceiling higher, moving larger models into the local tier and off the API bill. Model distillation keeps producing smaller models that punch above their weight, but that sweet spot in affordability still needs 24GB or 32GB of RAM. Subscription pricing hasn’t increased despite the models getting dramatically more capable, but that might change if Anthropic or OpenAI decide that they need to financially cover that $5000/m of compute we are using for just one $200/m subscription. And we are not the only ones doing this, so expect the age of cheap access to be a brief but productive one while you (still) can.

Twelve months ago, running a team of autonomous agents was science fiction. Today, an entire month of it costs less than a single junior developer’s daily rate, runs around the clock, and gets better every quarter.

The real question isn’t whether you can afford to run an AI workshop. It’s whether you can afford not to, and whether or not it will still be an option when you finally make that decision.

This post is part of a series on autonomous agent systems. See also: The Forge for the agent architecture, OpenCALL In Practice for the API contract, and From Prompts to Specifications for the shift in how we work with AI.