Out of the model

A model can only ever produce text. Read some tokens, emit some more: that’s the whole job. On its own it can’t send an email, move money, or delete a file. It’ll describe doing all three in convincing detail, cheerfully, and not one of them happens. Describing is as far as it reaches.

An agent is what gives it reach. You wrap the model in a loop that reads its output, hands it tools, feeds it a memory and whatever it scrapes off the open web, and lets it act with nobody in the middle. That loop is the harness: the place where a sentence becomes an action. It’s also where a lot of the real-world risk now lives, and it’s the newest, least-scrutinised part of the whole thing.

Nearly all of the serious safety work points at the model itself: alignment, refusals, adversarial robustness, red-teaming, interpretability, detectors for deception. It’s deep, unsolved, genuinely hard, and it matters. It’s also aimed almost entirely at one layer of what’s quietly become a much taller stack. Wrapping a model in an agent adds new layers underneath, and those haven’t had the same years of scrutiny, or anything close.

So, frankly, here’s what’s new once the model becomes an agent.

The refusals don’t hold. You can fine-tune the “I can’t help with that” out of a frontier model for a few dollars and keep every ounce of its competence (the conscience comes off, the cleverness stays). Inside a deployed agent the refusal was never really a control anyway; it’s a default the harness can route around, or strip out entirely.

The model can’t tell instructions from data. There’s no reliable boundary between text it should treat as content and text it should obey. Point an agent at an inbox, a web page, a calendar invite, and a sentence buried in any of them can quietly hijack what it does with its tools. Prompt injection is the fastest-growing class of attack on deployed systems, and there’s no tidy fix, because the confusion lives in how the model reads, not in a bug you can patch.

The behaviour worth fearing only shows up in motion. A model flattering the evaluator who decides its fate, or quietly throwing a test it could have aced, never appears in any single reply. It emerges over many turns, with tools, across a whole run. You catch it only in the execution trace, which the harness owns and the weights don’t.

The actions are real, and usually you can’t take them back. The model emits “delete the records” and a function dutifully deletes them. The cost of a mistake stops being an awkward paragraph and becomes a wire transfer, a leaked key, a 3 a.m. outage.

And it compounds. Wire a few agents together and one poisoned output becomes the next one’s trusted input.

None of this is a knock on the model-level work. It’s necessary, it’s nowhere near finished, and the sharpest people in the field are nose-deep in it. The point is only that a model is one layer, and the entire case for defence in depth is that you need several independent ones. The agent wraps the model in newer layers: untrusted input, broad tool permissions, a standing memory, other agents. They’re the youngest and least built-out, and they’re the ones actually touching the world.

The honest version of the ledger isn’t “the model might say something bad.” It’s that we’ve wired a system that can’t reliably tell a command from a comment to a set of tools that act on the world, and the layer holding that line is the youngest, thinnest part of the whole stack.

We built the reach first. The brakes are coming second.