By · ·

An agent built around not calling the LLM

An agent built around not calling the LLM

Most agent frameworks default to call the model again. Need to check a webpage every hour? Run the agent every hour. Need to watch for a change? Ask the LLM whether it changed. Need to extract a field from a tool result? Another round trip. The running bill for an agent built this way is roughly the cost of the model multiplied by the number of times the agent woke up, which for any useful system is a lot.

The personal agent project this post is about was built to answer a different default — do we actually need to call the model this time? — and to structure every piece of the system so the answer is usually no. The goal I set on day one was an agent that is extremely cost-effective to run, and every architectural choice downstream of that goal is what the rest of this post is about.

The overall shape

The project is an agent runtime around four pieces that lean on each other:

  • A main loop that speaks to a pluggable model provider and drives a small registry of tools the model can call.

  • A scheduler that runs ordinary Python jobs on a timer, reviews their output with an LLM only when something changes, and sends you an email when it does.

  • A provider layer that lets the loop talk to a cheap remote endpoint or to a local model interchangeably.

  • A self-hostable setup flow — install wizard, authentication, mail and runtime configuration — so the path from “clone this” to “something is running for me” does not go through a README and an .env file.

There are three runtime profiles the user can pick between: cheap remote, local, or cheap remote with a local model used only for preprocessing. Those are three points on the cost/capability curve, and the whole reason the provider layer exists is so that picking between them is a dropdown and not a rewrite.

Why the cheap remote is the main model

The main model is a cheap remote one for a boring reason: at the time I built this, it was the cheapest remote option that actually held up in an agent loop. A frontier model would have done the same work more reliably and would have billed me enough per run to kill the “cost-effective agent” premise inside a week. An agent loop is not a one-shot — it is a long-running thing where every call is one of many, and cost compounds linearly in the number of calls.

The more interesting tradeoff is why not local. I did start there — the first version of the agent loaded a small local model directly, with grammar constraints forcing boolean-shaped outputs so tool calls would parse, and fell back to a download URL if no weights were on disk. What I kept running into was that small local models were both too slow and not accurate enough for the hot path. Inference was slow enough to make the debug loop painful, and outputs were inconsistent enough that I was writing more prompt scaffolding than actual agent logic. The moment I ran into my own honesty was a note I ended up deleting — the model gets a separate prompt that says: use only tool results as evidence, ignore its own prior reasoning and world knowledge. That is the point at which I had already given up on the local model reasoning its way to the answer. The switch to a cheap remote was just the admission.

I still think locally running models are the future. The compute cost of frontier LLMs is not a sustainable base to build software on top of, and small-model capability is improving fast enough that the picture a year from now is probably different from the picture today. The architecture I ended up with is built to absorb that shift when it comes — the provider layer exists precisely so that the next time a local model is viable for the hot path, swapping it in is cheap. It was not viable today.

Tool design is the part I liked most

The posture I kept coming back to on this project, and the one I would carry into almost anything else I build with an LLM in the loop, is that the tools are the design surface, not the model. Once you accept that the model is fixed — you picked your provider, you chose your size, you are not going to fine-tune it on a weekend — everything that is left to design is the set of affordances the model sees when it sits down to do a task. What can it fetch. In what shape. What can it produce. What does it look at versus what does it hand off.

That reframing turned out to be more useful than any scaffolding trick I tried first. You stop asking “how do I get the model to reason its way to X” and start asking “what would I hand a reasonably smart intern who had to produce X, and what tools would be on their desk?” When the model is struggling, the honest diagnosis is usually not “the model is bad” but “I have not given it a tool that matches the shape of the job.” Rewriting the prompt is the tempting shortcut; writing a better tool is usually the right move.

A concrete example. A generic “fetch a webpage” tool returns the full HTML of the page, which is both expensive in tokens and noisy enough that the model ends up doing DOM parsing in its head. The fix is not to tell the model “please ignore the nav and footer.” The fix is a second tool — a targeted extractor that takes a CSS selector or a semantic description and returns just the part of the page the model actually asked for. Same web, radically different shape, and the model stops drowning.

The other piece of tool design I did not appreciate until I was in the middle of it is that the tool description is itself part of the prompt. Every tool’s description travels into the model’s context whenever tools are offered, and that description is where you get to communicate not only what the tool does but when the model should and should not reach for it. The one I am proudest of is the escalation tool:

ONLY use this tool as a last resort when you have already reasoned carefully and are genuinely unable to answer on your own — for example, a hard coding problem, a subtle logical puzzle, or a specialised technical question beyond your knowledge. Do NOT call this for simple facts, arithmetic, general knowledge, or anything you can answer yourself. Each call has a real monetary cost; overuse is wasteful.

The tool it describes sends a single self-contained question to a more capable (and more expensive) model and returns the answer. The pattern I was aiming for is cheap model for most of the loop, smarter model only when the cheap model is genuinely stuck — an escape hatch that keeps the average call cost low while still leaving a lifeline for the rare question that actually needs more horsepower. The idea was born when the main model was a small local one — small local models lack the kind of higher-level reasoning you need for harder tasks, and giving them an escape hatch to a more capable model on demand is cheaper than either upgrading the local model permanently or routing everything through a frontier API.

Honest caveat: I switched away from the local main model before I could confirm that a local model actually reaches for this escape hatch correctly when it should. The tool is still in place and the pattern is still sound in principle, but its intended pairing is with the local runtime profile, and that pairing is untested. With a cheap remote as the main model the escalation is less load-bearing, because the remote handles most of what would otherwise trigger it. When I reopen the project and bring a newer small local model into the hot path, seeing whether this pattern works in practice is one of the first things I want to look at.

Other tools I grew to like for the same design-surface reason: a persistent key-value memory so state is not refabricated every turn; a structured page-change checker the agent can call directly instead of re-reading a page into the loop; and a tool that lets the agent write a new scheduled job into the project and have it picked up on the next scheduler tick, which is the subject of the next section.

The scheduler is the thesis made executable

The architectural choice I am most attached to on this project is the scheduled background-job system. The realisation behind it is straightforward: a lot of useful agent work — check if this page changed, notify me when this release lands, watch this feed for a new item — obviously does not need a metered LLM call every hour. An LLM is more than capable of writing the check itself, once. A cron can then run that check forever without touching the model again. The LLM only needs to come back into the loop when the check reports something worth reviewing.

That is how the system works. The agent has a tool that drops a new Python module into a jobs directory. The scheduler auto-discovers the module on the next tick, runs it on its own interval, keeps a persistent state file so it knows what “changed” means between runs, and only invokes the LLM to review a detected change before sending the user an email callback. The first job I wrote using it watches a model provider’s homepage for release-related language, because I had been checking it manually and that is exactly the kind of task where paying for an LLM every hour would be absurd.

The broader pattern, though, is not “build a website watcher.” The pattern is most repeatable automations only need the LLM at construction time, not at every tick, and the system’s job is to make that shape the default rather than the workaround. If there is one feature that separates this project from every other “here is my local LLM wrapper” weekend build, it is this one. Most agent frameworks answer “how do I run this automation every hour?” with “run the agent every hour.” This project’s answer is “have the agent write you a cron.”

Local models in a narrower role

Sitting next to the scheduler is a webpage-compression preprocessor. It runs a small local model against fetched HTML with a strict structured-output prompt producing exactly this shape:

TITLE: ...
SUMMARY: - ...
SECTIONS: [Heading] - ...
ENTITIES: - ...
KEY FACTS: - ...
OPEN QUESTIONS: - ... or 'None'

The web-fetch tool optionally pipes HTML through this preprocessor before it ever reaches the main model. The original reason was local-on-local: when the main loop was a small local model, it didn’t actually need to read the HTML — it needed the content of the page, and specifically the content in a shape compact enough to fit in its tight context window and structured enough for it to reason over. Raw HTML failed on both counts. A second motivation surfaced once a different small local model I had to hand turned out to be genuinely good at writing these summaries: putting it in front of the pipeline cut the token count on every downstream call that had to look at the result. For a project whose whole point is to minimise metered model calls, a token cut applied to every web fetch is not a small win.

Both of those motivations still apply today, and they compose neatly with the third runtime profile: cheap remote main loop, with a local preprocessor in front of it. Same preprocessor role, different beneficiary. The remote model never sees the raw page. Its context window stays clean, its per-call token count stays bounded, and the small local model earns its keep on a role where its ceiling does not matter — it is not trying to reason, it is trying to produce a compact, predictable summary of something opaque. It fails well. It is also approximately free to run on the same box as everything else.

The generalisation beyond web scraping is what I would actually carry into any future project. Almost every tool that returns opaque, size-unbounded output — long file reads, verbose API responses, transcripts, scraped content — is a candidate for a small-local-model compression step in front of the main loop. The savings are bigger than they look because they compound across every tool call in a long conversation, not just the worst one. I did not build this system as a “local vs. remote” story, and I do not think that framing holds up. I built it as a cheap model absorbing the boring preprocessing job so the expensive model sees less and does more with it, which is a story that survives any shift in which model is expensive on which day.

The provider layer is the enabler

Everything above — the runtime profiles, the preprocessor role, the eventual local-main bet — depends on the agent loop not caring which model it is talking to. That decoupling is the piece of plumbing I am least excited to talk about and the piece that makes the rest of the architecture possible. A small interface sits between the loop and whichever client is active, the loop only speaks to that interface, and the interface does the rest. The payoff is that a future small local model that holds up in an agent loop is a one-file swap away from being usable for the hot path. No other part of the system needs to know.

The broader point past this project: every agent project I have worked on eventually wants an abstraction like this, and the sooner it exists the less painful the eventual provider swap is. There is always an eventual provider swap.

The setup flow is a first-class surface

The intended user of this project is anyone who wants to run it — not “me and maybe a collaborator.” That is the reason the web layer is as substantial as it is. The install wizard, the authentication, the mail configuration, the runtime-profile picker, the user management — all of it exists so that the path from “clone the repo” to “your background job is running and emailing you” does not route through hand-editing config files. If the value proposition is “deploy this and it stops costing you LLM tokens on routine work,” the setup experience has to actually work, or the whole proposition falls over on first contact with the user.

I would make that argument more generally for any self-hostable tool. Setup is a first-class surface worth treating like a feature, not a chore you write prose about in a README.

What I took away from building this

The more time I spent designing tools for the model to reach for rather than trying to make the model itself smarter, the more the work started to look like the shape of the bet I want to make on LLMs in general.

My read, having done this: the axis that matters going forward is not how much information a model has stuffed into its weights, but how well it can go and find the right information from a trustworthy source and form an opinion on what it just read. A model that is slightly less knowledgeable but knows how to reach for the right source, read it carefully, and tell you what it actually means is more useful than a bigger model confidently repeating something it half-remembers from pretraining. The first kind of model is a problem you solve with tools. The second kind is a problem you solve with parameter count, and parameter count is the thing that makes inference expensive.

This is not a new pattern. It is the same thing that was true of my own work long before LLMs existed. The hard part of almost every task I have ever actually been paid to do is not acting on the information — it is finding the right information to act on in the first place. Acting on poor or inaccurate information is worse than not acting, because everything downstream of it is already compromised; you end up shipping something confidently wrong, and the cost of being confidently wrong is almost always larger than the cost of being slower. If the input is bad, the output cannot be good, and no amount of polish further down the pipeline fixes the upstream problem.

The LLM version is the same pattern. Pretraining data is static and eventually stale, and the model’s honest answer to “what do you know about X” is a weighted average of what was true about X at some fixed point in the past. A tool that goes and reads the current version of X, structured for the model to consume, gives the model a shot at being right today. The design question stops being how do I pick a smarter model and becomes what does the model need to read, from where, and in what shape — which is exactly the design question the tools layer of this project was built around.

That is the bet the architecture was built for, and it is the bet I want to keep making in the next version.

Where it sits now, and what comes next

The project is currently shelved. Not dead — I intend to come back to it, and the reason I intend to come back is the reason I started it. Optimising LLM use is the way forward, not ignoring it, and compute-efficient automation where the LLM is consulted at construction time and then stays out of the loop is a better default than the “run the agent every hour” pattern that most frameworks encourage. Small locally running models are going to get good enough to sit on the hot path, and an architecture already built around that bet is a much shorter path to a working system than retrofitting a remote-first stack.

When I do reopen it, the things that survive into the next iteration are the scheduler, the preprocessor pattern, and the provider abstraction. The thing I most want to actually test is the escalation pattern in its intended pairing — a small local model as the main loop, escalating to a bigger remote only when genuinely stuck. That test never ran the first time because the local main loop never got far enough to need it. Some of the surrounding machinery — the install flow in particular — is probably over-built for where the project currently is, and is a candidate for aggressive simplification before anything new goes on top of it.

If any of this matches a problem you are also trying to avoid solving by burning tokens on it, I would be interested in comparing notes.

Related reading