Scaffolding Is the Leverage
The argument I keep watching is the wrong one.
People line up behind Karpathy and say AGI is ten years out because LLMs can’t learn on the job. People line up against him and point at benchmark curves. Both sides stare at the model. That’s the mistake. The model hasn’t been the load-bearing variable for a while now. The scaffolding around it has.
This is not a contrarian hot take. It’s the thing I keep bumping into every time I sit down at my machine.
What Actually Got Better
Claude Opus 4 versus Claude Opus 4.7 is a real delta, sure. But it is not the delta that changed my working life in the last twelve months. What changed my working life is that Claude Code grew planning, then skills, then subagents, then better context handling, then hooks, then memory. Miessler makes this point cleanly: Claude Code launched around March 2025 as roughly a 5x Opus multiplier, and ten months later it is night-and-day better — and most of that delta did not come from the base model. It came from iterative improvements in how the AI talks to itself.
That is scaffolding. Planning loops, context routing, tool access, working memory, feedback. Not parameters. Not training. The engineering around the engine.
Karpathy is the sensei on naked LLM limitations. I concede that fully. But he is measuring the wrong variable for the question most of us actually care about. The question is not whether a model achieves continuous learning on its own. The question is whether a stitched system replaces the work of a knowledge worker. Trillions of dollars are hunting that bar, and they are not waiting for GPT-9. They are engineering around today’s limitations until the economics flip.
The Receipts
Three receipts from the last year that should end this debate if anyone is actually listening.
One. XBOW is the #1 HackerOne hacker in the United States. Not a benchmark. Not a press release. A fully automated AI agent is beating every human bug hunter in the US on a platform that only pays for real, first-to-find, non-duplicate, actually-exploitable vulnerabilities. If your position was “AI can’t really do this” or “it’ll take five to ten years,” I have news. It already happened. And XBOW is not winning because its underlying model is smarter than a pentester. It is winning because the system around the model is relentless, parallel, and never tires.
Two. Jason Haddix’s AI-driven recon stack found a live P1 inside fifteen minutes of being pointed at an admin login page. First move: add a parameter to the POST request. id=1. Instant bypass. Miessler and Haddix both know that move lives somewhere in their methodology. Neither would have reached for it first. The AI did. Not because it was smarter. Because scaffolding flattens the cost of trying moves humans deprioritize.
Three. AI is finding real kernel-class bugs in production systems. Not hallucinating CVEs. Real ones. The XBOW work, the AIxCC results, the trail of accepted reports at Google and Microsoft — this is a system story, not a model story. Matthew Brown at Trail of Bits, who led the team that took second at AIxCC, put it flatly when Miessler asked him: model or system? System. Every time.
The naked model did not suddenly become a security researcher. A system built around the model became one.
What I See From Where I Sit
I built my digital assistant Isidore on top of DAI — Digital Assistant Infrastructure — and I have spent the last stretch watching exactly how the leverage actually shows up. It does not show up on the days the model gets an upgrade. It shows up on the days I add a skill, wire in a new hook, refine the routing, tighten a subagent’s context window, give the planning phase a better checklist.
When Isidore gets sharper, it is almost never because Anthropic shipped something. It is because I shipped something. A better algorithm. A cleaner context file. A subagent with its own scoped CLAUDE.md so the main thread does not get polluted by a giant DOM. A hook that fires at the right lifecycle moment and enforces a rule the model would otherwise forget.
Miessler describes the same architecture pattern for his security stack: separate subfolders, separate CLAUDE.md files, separate large context per subagent so the heavy stuff stays compartmentalized. This is not exotic. This is plumbing. And the plumbing is where the intelligence lives.
I built a small internal tool in an afternoon recently that would have been a two-week project a year ago. Not because I was faster. Not because the model got smarter in those twelve months. Because the scaffolding around the model — the DAI algorithm, the skills, the hooks, the context routing — did the work of a small team. The model was the engine. Everything else was the car.
Dynamic Context Is the Real Frontier
Miessler names the missing capability precisely: dynamic context. A cheap, fast system that pulls the perfect knowledge into the perfect moment for the perfect decision. Humans do this automatically — decades of experience silently folded into every small call we make. Models still have to manually stuff context via retrieval, prompt tricks, skills files, subagent handoffs.
The gap between a model with great dynamic context and a model without it is larger than the gap between two frontier model generations.
This is the thing I want more people to internalize. You do not close that gap by waiting. You close it by building. Chunking strategies. Real-time vector stores. Tiny local retrievers. Skills as modular expertise packs. Memory files that persist across sessions. Planning loops that know when to ask for more context and when they have enough.
This is engineering. It is not glamorous. It does not fit in a benchmark chart. It compounds anyway.
The Model Lab Cosplay
Miessler wrote a companion piece in March arguing that he has never believed in training custom small models or fine-tuning for enterprise tasks. Best SOTA model plus sharp context management, every time. I agree — not out of loyalty, but because the math is obvious. Fine-tuning locks in yesterday’s knowledge, caps broader reasoning, and recreates at great cost something the next frontier release will ship for free.
Every enterprise I have seen try to play model lab ended up with a brittle specialist that got stale within a quarter. Every builder I know who invested in context, retrieval, tools, and evaluation loops kept compounding.
Context is the product surface. Weights are not.
Why This Ends the Karpathy Argument
If you define AGI as the naked model doing continuous learning through pure gradient flow, Karpathy wins. Sure. Congratulations. That definition is true and irrelevant.
If you define AGI the way Miessler does — a system that replaces the average knowledge worker — then scaffolding already closed most of the gap, and the rest is a funding and engineering question. Trillions of dollars against the lowest bar we have ever set. Not “beat the top 10%.” Beat the median. The median is not a moving target. AI capability is. You do the math.
Karpathy is watching the engine. The rest of us are shipping cars.
The Part That Should Wake You Up
I am not writing this because I think the model doesn’t matter. It does. A better engine makes every car faster. I am writing this because I watch smart people argue about engines while the world around them is being rebuilt by people who are quietly, relentlessly improving the chassis.
If you are a builder, this is the most leveraged moment of your life. Not because models got smart. Because the gap between a bare model and a well-scaffolded model is where all the value is sitting, and almost nobody is building there yet. Skills, hooks, memory, context routing, subagent orchestration, planning loops, verification layers. Pick any one. Go deep. The compounding will surprise you.
If you are waiting for AGI to arrive so you can start building, you are watching the wrong door. It is already coming through the door behind you, carrying a pile of YAML and a skills folder.
The scaffolding is the intelligence multiplier. The scaffolding is the leverage. Build there.