o1 and “Reasoning” Models Change the Game

Archive note: this is a backdated post, written years later while rebuilding this site. It’s dated to the moment it covers, but the hindsight is real.

On September 12, OpenAI released o1-preview, and the headline mechanic sounds almost too simple: the model thinks before answering. Ask it something hard and it visibly pauses — seconds, sometimes a minute — burning “reasoning tokens” on an internal chain of thought before producing output. The results on the right problem class are startling: competition math, algorithm puzzles, multi-step debugging — benchmark jumps that two years of ordinary scaling didn’t deliver.

The shift under the hood matters more than the scores: capability is now purchasable at inference time. The scaling story since GPT-3 was train bigger — capability fixed at training, spending at answer time constant. o1 adds a second dial: train the model (apparently via reinforcement learning) to use extended deliberation, then let it spend variable compute per question. Intelligence becomes, partly, a runtime resource — metered, like everything else in our industry, by the unit consumed.

Field notes from ten days

Calibration against the daily-pairing baseline:

On genuinely hard problems, the difference is real. A gnarly scheduling-logic bug that GPT-4o and Claude both confidently fumbled, o1 worked through correctly — you can feel the difference between pattern-matched and reasoned-through. Tricky migrations, concurrency questions, algorithmic work: new tool of choice.
On everything else, it’s worse. Slower, pricier, and oddly stilted for the 90% of work that is ordinary CRUD-and-glue. Asking o1 to scaffold a controller is hiring a chess grandmaster to alphabetize your bookshelf. The frontier is no longer one model but a portfolio — fast-cheap for volume, frontier-chat for breadth, reasoning for depth — and routing between them is suddenly an architecture decision with a budget attached.
The thinking is hidden, and that’s a choice worth noticing. You see a summary, not the actual chain of thought — competitive secrecy, openly admitted. So the auditability gap widens exactly as the answers get more authoritative: trust me, I thought about it is now a product tier. The verification discipline doesn’t relax; it gets harder to apply, because the work product you’d check is the part withheld.

The economics, which are the actual story

Reasoning tokens are billed. Hard questions cost visibly more than easy ones — and for the first time the cost curve points the right way: at the problem’s difficulty rather than the answer’s length. But notice what else arrives with that: a future where the best available cognition is priced per deliberation, and the gap between what rents and what runs free becomes a gap in thinking depth, not just fluency. The open-weights floor has matched fluency before; whether it can match deliberation is the question that decides how concentrated this technology’s benefits get. Early replication attempts are already underway, naturally.

Filed prediction

Banked for the December audit: the “reasoning” framing will prove half-right. The capability jump on verifiable domains (math, code, logic — anywhere an answer can be checked) is real and will compound fast, because checkable domains are where RL training feasts. The soft domains — judgment, taste, ambiguity — will improve far less than the marketing implies, and the industry will spend 2025 discovering which of its jobs were secretly checkable. Mine, uncomfortably, is more checkable than most.

The slow, expensive model that’s right about hard things. Somewhere, every senior engineer just felt a disturbance in their job description.