How to game the METR plot

Shashwat Goel

Dec 20

Unpacking AI's favourite exponential curve of 2025

Read →

5 Comments

Rohan Kapoor

this was a very useful intuition check for me

Expand full comment

Samuel Albanie

Thanks for the write-up.

Expand full comment

Ben

Very good write-up, thank you!

Expand full comment

Pierre Liljekvist

Great analysis. I’m not an AI researcher but I think there is a theoretical proof that explains exactly why the data you found is so noisy.

In the recent Sellke & Yin (2025) on "Learning-Curve Monotonicity." They prove that learning curves are only guaranteed to be smooth and monotonic if the model is "well-specified" (i.e., the model's logic matches the data structure).

If you apply their framework to your METR critique, the "schizophrenia" of current AI makes perfect sense:

Well-Specified: Math and Code are well-specified. There is a correct answer. This is why GPT-5.2 Pro was able to derive the proofs for the Sellke/Yin paper itself (as noted on page 2). In this domain, the curve is smooth and exponential.

The Agent (Mis-Specified): "Agency" (human intent, office work) is inherently mis-specified. As the paper notes (citing Viering et al. [VML19]), when a problem is mis-specified, more data doesn't guarantee better performance. You get noise, dips, and "double descent."

Your graph isn't just showing bad data collection; it’s showing the mathematical signature of a mis-specified problem.

Expand full comment

Reply (1)

Peter Mernyei

Your comment reads to me like an honest attempt to wrestle with these questions that falls into the quagmire of LLM aided research / brainstorming leading to superficially plausible takes that don't actually make sense on closer inspection. It's an easy trap!

Specific issues:

- Code/math as problems being more "well defined" makes sense, but what does that have to do with "model's logic matching the data"? Seems like two different questions.

- Does this mean no possible model can go beyond code/math to less well defined domains? Then why is eg gpt 5 better at poetry than eg gpt 2 (to pick a big gap for maximally obvious improvement)

- There are various intuitions about "inductive biases" in ML that are sort of a soft vague version of "the model's structure should match the data", but this tends to be quite handwavy and not something that is a hard and fast rule

- Re the theory paper you cited: tbh I don't have the familiarity / time to parse what this paper is saying exactly, but generally ML theory work often needs to make a huge number of simplifying assumptions and/or reason about toy models that are very different from eg LLMs. So I'm not sure what the paper is about, but I'd be willing to bet it's not applicable here 😅

- Superficial but still you should know: your last sentence is a very obvious pithy "it's not just X but Y" LLM-ism that made it very clear that you drafted this with an LLM ;)

I think we all still need to figure out how to best use these models to learn and brainstorm without misleading ourselves... Funnily I stumbled on this post and your comment just after reading about Dwarkesh falling for a similar trap: https://open.substack.com/pub/vishalblog/p/vibethinking-as-bullshit?utm_source=share&utm_medium=android&r=d3y1h

It makes me wonder how do I avoid similar issues when learning / thinking with LLM help about new domains where I'm not an expert... It's clearly very useful a lot of the time but hard to calibrate yourself :/

Expand full comment

Shashwat’s Substack

How to game the METR plot