home

Models are getting faster and cheaper on a curve that outpaces their improvement in quality. Inference cost drops by orders of magnitude per year while output quality improves incrementally. This asymmetry is the entire story of AI-assisted coding right now, and almost everyone in the space is ignoring it.

The common explanations for why AI coding tools underperform are wrong. It is not a context window problem. Look at SWE-Bench. A large portion of the test cases do not require broad contextual understanding to solve. Longer context helps at the margins, but the failures are not "the model couldn't see enough code." It is not an information retrieval problem either. RAG is mature. If the bottleneck were retrieval, we would expect to clear 40% on SWE-Bench with current techniques. We do not.

The problem is that we ask models to code the way no human would ever agree to code.

When a human learns to program, they get an IDE, a debugger, a terminal, documentation, and the ability to run what they write and see what happens. They experiment. They execute. They read stack traces. They try again. When we ask a model to write code, we give it a text prompt and expect finished output. We ask it to write code in a Google Doc and push to prod. No human would accept those working conditions, and humans are better at coding than these models. Why would a less capable agent produce good results with less tooling?

The right approach accepts that models are worse than humans at reasoning about code but faster and cheaper at producing it. This means the viable workflow is: generate, execute, test, revise. The model writes code. It runs the code. It sees what breaks. It fixes what breaks. Humans provide the test cases and the acceptance criteria. When the tests pass, the program is done. The model compensates for weaker reasoning by iterating at the speed inference permits.

This is where the cost curve matters. If a generation-execution-test loop takes 200 iterations to converge on a working solution, and each iteration costs a fraction of a cent and takes under a second, the total cost is negligible and the total time is minutes. The model compensates for lower reasoning quality with higher iteration volume. This only works when inference is fast and cheap, which it already nearly is and will definitively be within a year.

The companies in this space are mostly solving the wrong problem. Enterprise search tools like Glean build fine-tuned RAG over internal documents. Useful for finding which HR manager handles reimbursements. Useless for writing code. Startups like Blitzy advertise "unparalleled context" and "up-to-the-minute intelligence," which are incremental improvements on the wrong architecture. "Idea to product in days" is a modest claim if the underlying model can iterate at millisecond timescales. Documentation tools like Mintlify automate the easy part — writing docs is simple, understanding them is hard.

These companies all treat the model as an oracle that should produce correct output on the first pass given sufficient context. That framing is wrong. The model is a fast, cheap, unreliable generator that converges on correct output through rapid iteration with execution feedback. The loop is the product.

One more thing. Humans think whether or not they receive input. There is a constant internal process of asking and answering questions, forming and revising mental models, that runs independent of any external task. Current models sit inert between API calls. There is no self-directed exploration, no background reasoning, no accumulation of understanding between sessions. A model that maintained a continuous internal dialogue and only oriented toward a task when given one would be a fundamentally different kind of coding agent. That does not exist yet. But the iteration loop does, and it is enough to start.

9/17/2024