GPT-4 and the First Real Pair-Programming AI
· Jerwin Arnado
Archive note: this is a backdated post, written years later while rebuilding this site. It’s dated to the moment it covers, but the hindsight is real.
OpenAI released GPT-4 on March 14, and after ten days of using it through ChatGPT Plus, I’m retiring the hedge from January’s post. The 3.5-era model was a brilliant autocomplete with a gambling problem. This one is something closer to a colleague.
What changed, concretely
Benchmarks (bar exam, 90th percentile, etc.) make headlines; the working differences make converts:
- It holds a system in its head. GPT-3.5 answered question-by-question, goldfish-style. GPT-4 tracks a multi-file Laravel feature across a long conversation — the migration, the model, the form request, the policy, the test — and keeps them consistent. Pair-programming requires shared context; this is the first model that maintains one.
- It pushes back, correctly. I deliberately fed it flawed approaches I’d seen in real code review: an N+1 disguised by a loop, a race condition in a balance check. The old model would politely implement the bug. GPT-4, more often than not, flags it — “this works, but under concurrent requests…” That’s the difference between an intern who agrees with everything and a peer.
- Hallucinations dropped from constant to occasional — which is more dangerous, not less. With 3.5 you verified everything by reflex. With GPT-4 the failure rate is low enough to build false trust, then it invents a config option with absolute fluency at the worst possible moment. New tool discipline: the more plausible the output, the more the unusual claims deserve checking.
- It reads images. Multimodal input means screenshot-of-error-message or photo-of-whiteboard-schema as a prompt. Early days for access, but the direction is clear: the interface to the machine is becoming “however humans naturally show things to each other.”
- Context windows up to 32K tokens (rolling out) — that’s “paste the entire legacy class plus its test file” territory. The unit of conversation is shifting from snippet to module.
How my workflow actually changed
GitHub also announced Copilot X this week — chat, PR descriptions, the works — so the IDE-native version of all this is coming. Meanwhile, my honest loop: I write the contract (route, interface, test expectations), GPT-4 drafts the implementation, I review like it’s a PR from a talented stranger with no stake in our production uptime. Net effect on a typical CRUD-heavy day: significant. On genuinely novel problems: it’s a sounding board, not an oracle — the Sydney lesson about confident nonsense still applies above a certain difficulty.
The deeper shift is psychological. I’ve stopped asking “can it do this?” and started asking “is it worth delegating this?” — which is a management question. Eight years into this career, the job description is quietly being edited from writes the code toward specifies, reviews, and owns the code. Those were always the senior skills; the surprise is watching them become the whole skill.
The pin for future me
Prediction bank, March 2023 edition: within two years, “I pair with an AI daily” will be as unremarkable as “I use Git.” The interesting fights will be about codebase access (these models are an API call to someone else’s servers — supply chain rules apply) and about whether the open-weights pattern reaches this capability tier too. Watch that second one closely; my homelab is.