Great analysis. I’m not an AI researcher but I think there is a theoretical proof that explains exactly why the data you found is so noisy.
In the recent Sellke & Yin (2025) on "Learning-Curve Monotonicity." They prove that learning curves are only guaranteed to be smooth and monotonic if the model is "well-specified" (i.e., the model's logic matches the data structure).
If you apply their framework to your METR critique, the "schizophrenia" of current AI makes perfect sense:
Well-Specified: Math and Code are well-specified. There is a correct answer. This is why GPT-5.2 Pro was able to derive the proofs for the Sellke/Yin paper itself (as noted on page 2). In this domain, the curve is smooth and exponential.
The Agent (Mis-Specified): "Agency" (human intent, office work) is inherently mis-specified. As the paper notes (citing Viering et al. [VML19]), when a problem is mis-specified, more data doesn't guarantee better performance. You get noise, dips, and "double descent."
Your graph isn't just showing bad data collection; it’s showing the mathematical signature of a mis-specified problem.
Your comment reads to me like an honest attempt to wrestle with these questions that falls into the quagmire of LLM aided research / brainstorming leading to superficially plausible takes that don't actually make sense on closer inspection. It's an easy trap!
Specific issues:
- Code/math as problems being more "well defined" makes sense, but what does that have to do with "model's logic matching the data"? Seems like two different questions.
- Does this mean no possible model can go beyond code/math to less well defined domains? Then why is eg gpt 5 better at poetry than eg gpt 2 (to pick a big gap for maximally obvious improvement)
- There are various intuitions about "inductive biases" in ML that are sort of a soft vague version of "the model's structure should match the data", but this tends to be quite handwavy and not something that is a hard and fast rule
- Re the theory paper you cited: tbh I don't have the familiarity / time to parse what this paper is saying exactly, but generally ML theory work often needs to make a huge number of simplifying assumptions and/or reason about toy models that are very different from eg LLMs. So I'm not sure what the paper is about, but I'd be willing to bet it's not applicable here 😅
- Superficial but still you should know: your last sentence is a very obvious pithy "it's not just X but Y" LLM-ism that made it very clear that you drafted this with an LLM ;)
It makes me wonder how do I avoid similar issues when learning / thinking with LLM help about new domains where I'm not an expert... It's clearly very useful a lot of the time but hard to calibrate yourself :/
this was a very useful intuition check for me
Thanks for the write-up.
Very good write-up, thank you!
Great analysis. I’m not an AI researcher but I think there is a theoretical proof that explains exactly why the data you found is so noisy.
In the recent Sellke & Yin (2025) on "Learning-Curve Monotonicity." They prove that learning curves are only guaranteed to be smooth and monotonic if the model is "well-specified" (i.e., the model's logic matches the data structure).
If you apply their framework to your METR critique, the "schizophrenia" of current AI makes perfect sense:
Well-Specified: Math and Code are well-specified. There is a correct answer. This is why GPT-5.2 Pro was able to derive the proofs for the Sellke/Yin paper itself (as noted on page 2). In this domain, the curve is smooth and exponential.
The Agent (Mis-Specified): "Agency" (human intent, office work) is inherently mis-specified. As the paper notes (citing Viering et al. [VML19]), when a problem is mis-specified, more data doesn't guarantee better performance. You get noise, dips, and "double descent."
Your graph isn't just showing bad data collection; it’s showing the mathematical signature of a mis-specified problem.
Your comment reads to me like an honest attempt to wrestle with these questions that falls into the quagmire of LLM aided research / brainstorming leading to superficially plausible takes that don't actually make sense on closer inspection. It's an easy trap!
Specific issues:
- Code/math as problems being more "well defined" makes sense, but what does that have to do with "model's logic matching the data"? Seems like two different questions.
- Does this mean no possible model can go beyond code/math to less well defined domains? Then why is eg gpt 5 better at poetry than eg gpt 2 (to pick a big gap for maximally obvious improvement)
- There are various intuitions about "inductive biases" in ML that are sort of a soft vague version of "the model's structure should match the data", but this tends to be quite handwavy and not something that is a hard and fast rule
- Re the theory paper you cited: tbh I don't have the familiarity / time to parse what this paper is saying exactly, but generally ML theory work often needs to make a huge number of simplifying assumptions and/or reason about toy models that are very different from eg LLMs. So I'm not sure what the paper is about, but I'd be willing to bet it's not applicable here 😅
- Superficial but still you should know: your last sentence is a very obvious pithy "it's not just X but Y" LLM-ism that made it very clear that you drafted this with an LLM ;)
I think we all still need to figure out how to best use these models to learn and brainstorm without misleading ourselves... Funnily I stumbled on this post and your comment just after reading about Dwarkesh falling for a similar trap: https://open.substack.com/pub/vishalblog/p/vibethinking-as-bullshit?utm_source=share&utm_medium=android&r=d3y1h
It makes me wonder how do I avoid similar issues when learning / thinking with LLM help about new domains where I'm not an expert... It's clearly very useful a lot of the time but hard to calibrate yourself :/