Three Years of LLMs: What the Benchmarks Actually Show

Where Things Stood in Late 2022

ChatGPT launched in November 2022 on GPT-3.5. The context window was 4,096 tokens — roughly three pages. On MMLU, a benchmark covering 57 subjects from physics to law, it scored around 70%, which felt impressive at the time. GPT-4 followed in March 2023 at 86.4% and added the ability to read images.

The problem was multi-step reasoning. Ask a model to write a paragraph and it was great. Ask it to solve a competition math problem that required five logical steps in sequence, and it would confidently give you a wrong answer at step three and keep going. GPT-4 scored 52.9% on the MATH benchmark — a set of high-school and early-college competition problems. Nearly half of those problems beat the best model available. That number matters because it anchors how far things have moved.

Context Windows: The Change Nobody Talks About Enough

The spec number that changed most dramatically isn't benchmark scores — it's context length. GPT-3 launched with 2,048 tokens. Claude 2, in 2023, went to 100K. Gemini 1.5 Pro in early 2024 hit 1 million tokens — enough for five full novels, around 100 research papers, or a mid-sized codebase all in a single prompt.

Epoch AI tracked this and found roughly 30× annual growth in maximum context length since mid-2023. More importantly, effective long-context performance — the model's ability to actually retrieve and use information from deep in a long document, not just hold it in context — improved roughly 250× in nine months. That's not the same thing as a bigger context window. Early long-context models would reliably miss facts from the middle of long documents. That changed.

From a practical standpoint this is bigger than most benchmark improvements. It's why agentic coding tools can now read your entire repository before making changes. It's why document analysis pipelines that would have required chunking and retrieval can now just... work directly.

"Context window scaling is arguably the most practically impactful architectural change in LLMs since the transformer itself. The ability to reason over an entire codebase in one shot changes what AI agents can do."
— Epoch AI Research Blog, 2024

The Benchmark Numbers, 2023–2025

By mid-2024, MMLU was effectively a solved benchmark. The flagship models from all three major labs crossed 90%. The competition shifted to harder evaluations that were harder to game.

GPQA Diamond is a benchmark of PhD-level questions in biology, chemistry, and physics. The key design choice: domain experts — actual PhD students in the field — score around 65–70%. In 2025, Claude 3.7 Sonnet with Extended Thinking hit 84.8% on GPQA Diamond. Gemini 2.5 Pro reached around 91%. Models are now outperforming the people who wrote the questions on their own subject matter, which is a genuinely strange sentence to write.

For coding specifically, SWE-bench became the meaningful benchmark. It's a set of real, open GitHub issues — model has to read the repo, understand the bug, write a patch, pass the tests. Devin scored 13.86% in early 2024, which shocked people. Claude 3.5 Sonnet reached 49% by late 2024. By 2025, top agents were past 70%.

Model	Release	MMLU	HumanEval	MATH	GPQA Diamond	SWE-bench
GPT-4	Mar 2023	86.4%	67.0%	52.9%	35.7%	~1.7%
Claude 2	Jul 2023	78.5%	71.2%	44.6%	32.6%	—
GPT-4o	May 2024	88.7%	90.2%	76.6%	53.6%	~33%
Claude 3.5 Sonnet	Jun 2024	88.3%	92.0%	71.1%	59.4%	49%
Gemini 1.5 Pro	Feb 2024	85.9%	84.1%	67.7%	46.2%	—
o1 (OpenAI)	Sep 2024	92.3%	92.4%	94.8%	77.3%	48.9%
DeepSeek R1	Jan 2025	90.8%	~90%	92.3%	71.5%	49.2%
Claude 3.7 Sonnet (Thinking)	Feb 2025	—	—	96.2%	84.8%	70.3%
Gemini 2.5 Pro	Mar 2025	~90%	—	—	~91%	63.8%

Key benchmark scores, 2023–2025. Sources: Anthropic, OpenAI, Google DeepMind, Papers with Code, Artificial Analysis.

The Reasoning Model Shift

The architecture change that mattered most in 2024 was reasoning models — not generating an answer token by token, but spending compute actually thinking through a problem before responding. OpenAI's o1, out in September 2024, was the first one most people could use. It scored 94.8% on MATH (versus GPT-4o's 76.6%) and 77.3% on GPQA Diamond. The improvement wasn't marginal. Multi-step problems that previous models confidently fumbled, o1 would work through systematically.

o3 followed with roughly 20% fewer major errors than o1, 91.6% on AIME 2024 (the American Invitational Mathematics Exam — one of the hardest pre-college competitions in the US), and 87.7% on GPQA Diamond.

Anthropic's response was Claude 3.7 Sonnet with Extended Thinking — you can give it up to 128,000 tokens of reasoning budget before it writes its answer, and you can see the thinking. That visibility is actually useful in practice: you can read whether the model understood the problem correctly before it committed to an answer. Google released Gemini 2.0 Flash Thinking. DeepSeek R1 matched o1's reasoning performance at a fraction of the cost. Reasoning became table stakes across the whole industry in about six months.

"Reasoning models represent a qualitative shift, not just a quantitative one. They can check their own work, explore multiple solution paths, and revise conclusions — capabilities that were simply absent in 2022."
— Artificial Analysis AI Benchmark Report, Q1 2025

DeepSeek Changed the Cost Narrative

Through 2022 and 2023, frontier capability was in exactly three places: OpenAI, Anthropic, and Google. Meta's LLaMA release in February 2023 started shifting that — the first open-weight model that wasn't embarrassing to use. LLaMA 2 in July 2023 was genuinely useful for commercial applications. By 2024, Llama 3 70B was competitive with GPT-3.5.

Then DeepSeek happened. DeepSeek V3 in December 2024, DeepSeek R1 in January 2025. Both open-weight. R1 scored 92.3% on MATH — matching o1. The reported training cost was around $5.6 million. Comparable proprietary models cost hundreds of millions. Inference cost for open-source models was running around $0.83 per million tokens versus $6.03 for proprietary equivalents.

The practical effect: a Chinese lab put a frontier reasoning model in the open with full weights available, at a cost structure that made the established labs' pricing look like a rounding error. That changed what developers could afford to run in production overnight.

LMSYS Arena: What Humans Actually Prefer

Benchmarks are one signal. The LMSYS Chatbot Arena is a different one: real people compare two model outputs side-by-side without knowing which model wrote which, and vote for the better one. Millions of votes, head-to-head.

As of early 2025, the top of the arena leaderboard was tight. Google (Gemini 2.5 Pro), OpenAI (GPT-4.1, o3), Anthropic (Claude 3.7 Sonnet) — all within a narrow Elo range. What's useful about the arena is that it captures things benchmarks miss: does the response feel useful, is it the right length, does it understand what you were actually asking. The fact that arena rankings now correlate well with hard academic benchmark scores suggests those benchmarks are actually measuring something real, not just pattern-matching on eval datasets.

The Cost Drop — The Number That Changes What You Can Build

GPT-4's API cost $30 per million input tokens when it launched in 2023. By early 2025, GPT-4o was at $2.50 per million — an 8× reduction in 18 months. Claude 3 Haiku runs at $0.25 per million input tokens. Gemini 2.0 Flash is $0.30 per million.

That's not just cheaper API calls. It's a different set of use cases that become economically viable. Processing a large codebase on every PR, running AI-assisted test generation across a full repo, building real-time AI features in user-facing products — all of those had a cost structure that made them impractical in 2023. By 2025 they're just normal engineering decisions.

"The cost of intelligence is falling faster than the cost of electricity did during electrification. That rate of change rewires what's economically viable to build."
— Benedict Evans, Technology Analyst, 2024

What's Actually Left to Solve

In late 2022, the most capable available model could draft an email well and fail freshman calculus. By early 2025, reasoning models are outscoring domain experts on PhD-level subject matter questions, autonomously patching real production bugs, and working across entire codebases in context.

The remaining hard problem isn't capability on individual tasks — it's reliability. Models that ace one problem will fail unpredictably on something that looks simpler. They hallucinate citations, miss edge cases, and have no persistent sense of what they did yesterday. The current convergence point for the industry is agentic systems: models that plan, call tools, verify their own outputs, and operate over long horizons without constant human prompting. That's where the unsolved work is. Whether it gets solved in another three years or ten is genuinely unclear, but the pace of the last three years makes the optimistic timelines harder to dismiss than they used to be.