Context in, verification out

Two people open the same chat with the same model. Same task, roughly the same prompt. One walks away with a usable draft. The other walks away with a confident-sounding mess that takes an hour to unwind.

The model didn’t change between the sessions. The operator did. AI assistants are bounded on two sides — what you put in, and what you trust coming out. Neither side belongs to the model. Both are habits you have to build, and both survive every model upgrade.

Most arguments about “is AI useful?” are arguments about whether the speaker has built those habits. The “10x productivity” stories are usually first-half stories — got the context right, didn’t bother to verify, will find the bug in production. The “AI is useless” stories are usually second-half stories — verified everything carefully, but fed the model garbage going in.

The two halves are worth naming.

Context in

The first habit is choosing what the assistant sees, and what it doesn’t.

This sounds obvious until you watch how often it fails in both directions. Dumping everything — paste the whole repo, paste the whole conversation, paste the whole spec — buries the signal under noise. The model loses the thread, anchors on something irrelevant, and produces an answer that’s technically about your input but no longer about your question. Being too terse — “fix this” with no surrounding code, no constraints, no failing test — leaves the model to fill the gaps from its priors. The output will sound competent. It will solve a different problem than the one you have.

The discipline is treating the context window like a desk. What’s on the desk shapes what gets done. Clutter the desk and nothing useful happens. Clear it entirely and you’re working from memory you don’t have.

It shows up as small, boring choices:

Pin the relevant doc, not the whole repo. The model doesn’t need every file you have. It needs the three files that matter to the change you’re making.
Restate the constraint in the message. Don’t assume earlier turns will hold. Long sessions drift; the model’s notion of “the constraint” decays as the conversation grows. If the database has to stay backward-compatible, say so in the turn where you’re asking for the migration — not five turns ago.
Start a fresh session when context is contaminated. Once you’ve had three failed attempts at the same task, the failed attempts are part of the context. A new session with what you’ve actually learned beats trying to dig out of the hole.

Bigger context windows do not retire this discipline. They sharpen it. A larger window is a larger desk — more room to be cluttered, more competition for the model’s attention, more ways to bury the question you actually asked. Putting in the right things is the skill that keeps paying.

Verification out

The second habit is never trusting an output you can’t check.

The trap is fluency. Modern assistants produce confident, well-formed answers regardless of whether the answer is correct. The prose reads the same when it’s right and when it’s wrong. There is no tone-of-voice signal you can rely on. The fluent wrong answer is exactly as comfortable as the fluent right answer. Distrust the comfort.

Fluency reads like correctness. That's the trap.

The discipline is to make verification cheaper than trust. The specifics depend on what the model gave you, but the moves are limited:

Run the code. The “this should work” answer is not the same as the “I tested it and it works” answer. The asymmetry between the two is the entire job.
Follow the citation. When the model offers a source, open it. About half the time the source exists and says something close to the claim. The other half is interesting, in the bad sense.
Sanity-check the number. A figure produced by a confident sentence is still a figure produced by a confident sentence. If the answer is “the API costs $0.003 per call”, check the pricing page. If the answer is “this saves 30%”, check it against an actual measurement.
Ask for the disagreement. “Where would someone push back on this?” or “What’s the strongest case against this approach?” surfaces the soft spots the model wouldn’t flag on its own.

The cost asymmetry is the whole point. A confident wrong answer costs more than a hesitant right one, because hesitation triggers the check and confidence skips it. A workflow that always triggers the check, regardless of how confident the output sounds, is a workflow that earns the productivity claim.

Why these two and not “better prompting”

There’s a category of advice — prompt engineering, prompt frameworks, prompt patterns — that points at the same territory from a different angle. Most of it is downstream of the two habits above. A “good prompt” is one that supplies the right context and makes the answer verifiable. The framing isn’t wrong; it’s just narrower than it advertises.

Calling the umbrella “prompting” hides the two things that actually matter. The prompt is a delivery mechanism for context-in. The output format is a delivery mechanism for verifiable-out. Optimize the wrapper without doing the work and you’ve moved tokens around without changing what came back.

Why they survive the model

Models change. The window will get bigger, the hallucination rate will drop, the tool integrations will improve, the agent harness around it all will keep absorbing capabilities. None of that retires the operator’s job. The opposite:

A bigger window means more competition for the model’s attention. Curation matters more, not less.
A lower hallucination rate means failures are subtler and harder to catch. Verification matters more, not less.
Better tools mean a single mistake travels further before someone catches it — the assistant that can run code can also run the wrong code. Verification matters even more.

The disciplines compound. Model upgrades happen at the margin.

This is the same pattern that shows up around every other tool with an outsized capability ceiling. The leverage is real. The leverage doesn’t replace the user.

The argument worth having

Most public AI arguments are model arguments — which one is best, which one is overrated, which one is on top this month. These get answered on a forum thread and then re-litigated next quarter when the rankings shift.

The argument worth having is private. Was your week better with the assistant than without it? If yes, what did you do that the unhappy version of you didn’t? If no, where did the loop break — going in, or coming out?

Whichever model you’re using, that’s the only question that compounds.

Context in

Verification out

Why these two and not “better prompting”

Why they survive the model

The argument worth having

Sources

Tweaks reading