Shashwat’s Substack

Does "generalization" generalize?

Shashwat Goel — Sat, 27 Dec 2025 03:35:15 GMT

Every time people talk about “generalization” in LLMs, they mean something different. This often leads to subsequent debate. Why?

The problem with generalization is that the word doesnt generalize.

We've long been beyond the traditional statistical understanding, of generalizing to new samples from the "same distribution". Today, you can come up with a completely novel phrasing, or misspelling of the query, and LLMs will still respond correctly

This means, for a benchmark, it's no longer enough to prevent contamination of the exact input/output pair.

In fact, with better representations, RL, and some human ingenuity in deploying capital to create targeted data, it's now enough to know a description of the distribution to optimize. We can even get pretty far with LLMs combining and transforming existing data into the target distribution.

This is what people observe as "benchmaxxing". The issue is, even the set of describable distributions explodes combinatorially. Everyone wants something slightly different, our wants keep evolving, and to this the models don't necessarily generalize. This is probably what Ilya meant when he said "models still can't generalize" on Dwarkesh’s podcast.

It is why, I think, static benchmarks, ironically even “live” ones, are dead. Unless the distribution keeps drifting over time, unpredictably, any benchmark can now be hillclimbed with relative ease.

Yet, this is not enough for AGI. Why? Because the world keeps changing. In fact, everytime a model becomes capable on a new distribution, we want to use it for a new set of problems, which exposes new holes. Progress in AI capabilities will continue to extend our imagination, and this is a testament to human ingenuity.

So, what should we evaluate in 2026? For one, we need new ways to measure sample efficient adaptation by learning from interactions, what some put under the general umbrella of "continual learning". We want models to perform well on the "novel" situations we find ourselves in, which may be off the training distribution in subtle ways.

The promising sign is, for a lot of user queries off the training distribution, such as in coding, models do generalize, especially if they are a combination of seen distributions. Here's the crux:

We need to stop viewing generalization, or "out of distribution" as a discrete, static concept. Otherwise the words will keep changing meaning based on context.

Generalization is a continuous spectrum: how far you can you correctly extrapolate given what you've seen.

Humans do it to differing extents, depending on how much time evolution spent optimizing on the environment. We generalize extremely well in changes to our physical environment. Novel math, the environment of abstract symbols, has long been considered the pinnacle of human intelligence. Generalization is a function of optimization in an environment.

The optimization could even be at “test time”. If you are capable of self-verification, or in other words, have a good “world model” or “value function”, “thinking” is also optimization. We discover, simpler, more general principles the more we think about a problem.

Similarly, for models, they generalize to different extents based on what you ask, how far it is from what the model has seen in training, and how much the model was optimized for nearby distributions. For the rest, one day, they will be able to learn fast from interactions. That, would be true general intelligence, and fill the “jagged frontier”. They don't have to make discoveries in “quantum gravity” to get there. Neither have you.

How to game the METR plot

Shashwat Goel — Sat, 20 Dec 2025 13:06:20 GMT

TL;DR: In 2025, we were in the 1-4 hour range on the METR plot, which has only 14 samples. The topic of each sample is public, making it easy to game METR horizon length measurements for a frontier lab, sometimes inadvertently. Finally, the “horizon length” under METR’s assumptions might be adding little information beyond benchmark accuracy. None of this is to criticize METR—its not possible to be perfect in a first release. But I’m tired of what is being inferred from this plot, pls stop!

14 prompts ruled AI discourse in 2025

The METR horizon length plot was an excellent idea: it proposed measuring the length of tasks models can complete (in terms of estimated human hours needed) instead of accuracy. I'm glad it shifted the community toward caring about long-horizon tasks. They are a better measure of automation impacts, and economic outcomes (for example, labor laws are often based on number of hours of work).

However, I think we are overindexing on it, far too much. Especially the AI Safety community, which based on it, makes huge updates in timelines, and research priorities. I suspect (from many anecdotes, including roon’s) the METR plot has influenced significant investment decisions, but I’ve not been in any boardrooms.

2 popular AI safety researchers making massive updates based on the Claude 4.5 Opus result today, 200+ likes, within 6 hours.

Here is the problem with this. In 2025, according to this plot, frontier AI progress occurred in the regime of horizon length between 1 to 4 hours.

Guess how many samples have 1-4hr estimated task lengths in the METR data?

Just 14. How do we know? Kudos to the authors, the paper has this information, and they transparently provide task metadata.

Figure 14 of their paper. 14 tasks in the 1-4 hr range. Illuminati confirmed?

Hopefully, for many, this alone rings alarm bells. Under no circumstance should we be making such large inferences about AGI timelines, US vs China, Closed vs Open model progress, research priorities, individual model quality etc. based on just 14 samples. An early sign of this problem was there when the original METR paper was released in March 2025. The best performing model at the time, Claude 3.7 Sonnet, was estimated to have a horizon length of 59 mins. Now see its success rate distribution over task lengths:

Notice how the model has almost a 60 ± 15% probability of success on 1-2hr tasks. So why is the estimated 50% success horizon length 59 minutes?! Because it doesn’t get anything right in the 2-4 hr range. METR calculates the horizon length by fitting a logistic curve to individual sample outcomes, like the dark purple line above. Notice how 0% on the 2-4hr range leads to a very bad logistic fit (the curve is below the 95% confidence interval for 0.5-1hr, and 1hr-2hrs range). We’ll come to my skepticism arising from the core modelling assumption, of using a logistic curve, later. My suspicion is Claude 3.7 Sonnet has 0% success in the 2-4hr range because they only had 6 samples for that range, most of which were from cybersecurity capture the flag contests. Cyber is considered a dual-use, safety hazard capability in WMDP, which labs were careful about in early 2025. Remember, this is Anthropic.

To improve METR horizon length, train on cybersecurity contests

I promised you there’s a way to game the horizon length on the METR eval. Here’s how. The samples in the 1 minute to 16 hour range mostly come from HCAST. It turns out HCAST transparently tells us what each of these tasks are about.

HCAST 1.5-3.5 hour Task Descriptions. The paper’s Appendix D has a description of each task, sorted by estimated time taken.

Why is this a big deal? Well, if you know what topic you want to improve performance on, its not that hard to do so. You can create targeted synthetic data, or just hire vendors like Scale, Mercor and Surge to upsample such tasks in your post-training mix. If you notice, most of the tasks in this range are Cybersecurity CTFs, and MLE tasks. OpenAI has been explicit about specifically targeting these capabilities for recent Codex models:

Now, I’m not saying the labs are training on MLE and Cybersecurity data to game the METR plot. They probably have other incentives to improve on them. But this is precisely why the METR plot is unlikely to generalize, it measures exactly what US labs are focusing on! If Kimi, or DeepSeek, want to shoot past, they can just collect a lot of ML-Training and Cybersecurity prompts, and finetune on them.

Note that given there are only 14 samples in the relevant task length range, getting even 1 or 2 extra samples right significantly increases horizon length! It probably increases even more if you get the longer tasks (8h+, from RE-Bench right), by luck or overfitting, as today’s Claude 4.5 Opus result showed us. In fact, perhaps because Anthropic doesn’t want to risk training on cybersecurity, we still have low accuracy in the 2-4hr range?

HCAST Accuracy alone predicts log-linear trend in METR Horizon Lengths

Finally, lets look at how METR estimates 50% success horizon length. They assume a logistic relation between the probability of success, and gap between the horizon length (estimated variable) and task length:

You infer h (the 50% horizon length) by fitting the 0/1 success evaluation data of each task. β is also a learnt parameter, governing the slope of how fast the logistic function falls from 1 to 0.

I think once you assume a logistic function, its almost guaranteed that if a new model solves one additional task, it’s going to continue the log-linear trend. Remember that METR also only adds more models to the point when they think they are likely to push the frontier. Coupled with measuring on a task distribution that model developers are actively trying to improve on, I think the log-linear trend, or X month doubling period, pops out almost tautologically from the logistic fit assumption.

For example, I tried deriving the horizon length from JUST their reported accuracy, without looking at individual sample evaluations at all. Remember how the main contribution of the METR plot was shifting focus from aggregate accuracy to horizon lengths? Well it turns out, if you use the aggregate accuracy, and the task length distribution, and fit the logistic function to estimate horizon length assuming even a constant β=0.7, you recover the log-linear trend:

This means, if you had access to just the aggregate accuracy on HCAST, you could estimate the horizon length without knowing which samples the model gets right or wrong. It could be wrong on the short ones, and right on the long ones, for all you care.

Now presumably this logistic fit assumption arises from an earlier plot in the paper, claiming model success rates go down linearly with doubling in task length. I have a qualm with this plot too:

Notice how the log-linear fit here only looks good for the SWAA data, in the 1 sec - 1 min range. There’s something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit. If you tried to make a best fit line on the blue points (the length of tasks we care about after 2024), you’d get a very different, almost vertical line, with a very low R^2. I don’t know how load-bearing this is on the use of a logistic function to fit p(success) vs task length when estimating horizon lengths.

I am not a statistician, so I am uncertain about this final part of the analysis. I don’t know what it implies. I don’t know how problematic it is to assume a logistic function for this data. It’s hard to say for sure, as, for subsequent models after Claude 3.7 Sonnet, they didn’t release data to compute the success rate vs task length distribution. I invite people more experienced than me in statistics to look into this, because it seems a bit suspicious.

Overall, I wish we had more, and robust measurements for model horizon lengths. I think it is a much more meaningful metric than accuracy. Heck, I even wrote a paper on this topic. I applaud METR for turning my, and many others’ attention towards this. But the way people are misinterpreting, and making wild inferences from the headline horizon length numbers METR puts out every month, worries me. If we are staking investment decisions, and research priorities based on an evaluation, it needs to be really robust. And making robust long-horizon benchmarks is hard, expensive, and uncharted territory. I hope METR plot v2 rises to the challenge!

I thank Sumeet Motwani, Ameya Prabhu, Arvindh Arun, and Akshit Sinha for feedback on this post. I appreciate folks at METR recognizing the value of these critiques when I tweeted about them.

The Research Process

Shashwat Goel — Mon, 24 Nov 2025 16:21:18 GMT

I will occasionally update this post with reflections on important questions about the research process, as they come.

How to choose research bets?

The creation of this post was spurred by a surprisingly high number of responses to a question I tweeted about. The responses seem too useful to lose to the broken X search algorithm (which Google search / LLMs may not be able to index well).

The question:

How do PhD students / researchers manage the sinking feeling of having a growing bucketlist of interesting ideas/directions but not enough time to try any of them?

How do you select when multiple seem promising?

At the time of copying this here, there are 48 comments. There was obviously no single answer, which is precisely why its helpful to look at the whole set of comments: https://x.com/ShashwatGoel7/status/1992606250698408155?s=20

I’m grateful for everyone who took out time from pursuing their research ideas to leave advice for my question. For the sake of those who don’t want to click and scroll through 50 comments, I’ll also paste my top 5 favourites (but different people would value different advice!):

by remembering that the hypothesis space is in fact infinite and that there is no way to possibly accomplish all of it; to instead look inwards towards my own desires and goals; to look outwards at the change i want to see in the world. knowing it is hopeless to try doing it all. - Glenn Matlin

You could work with coauthors and do a little bit more that way. But you kind of need your own lab with students to scale much more. You could try to recruit smart undergrads though. I suppose if you don’t care as much about author credit you could post the ideas and any student looking for stuff to do could run with them. Shashwat’s 17th problem. - Neal Parikh

many people suggested writing down, and tips on prioritization:

Write them down in a big list of one-off ideas and cluster them over time. If the same idea keeps coming up in different forms, there’s probably something larger interesting there! - William Merrill

By elimination. For example, you can rank them by expected impact, novelty, feasibility given the time frame and resources, personal obsession (i.e., would you regret not doing it), potential for follow-up publications, and whether the background or technique learned has general applicability for other research directions. LLMs made it possible now to brain storm the ideas/criteria with them, and even use their help to rank them. But you need to use good prompts to get objective results. - Mohammad Alfiky

and finally, the most liked one :)

One option is to just roll a 25 sided die.

Sometimes people get stuck by the tyranny of choice and it can be better to just pick randomly to offload that pressure. - Phillip Isola

Scientific discovery as a training environment for Superintelligence

Shashwat Goel — Wed, 19 Nov 2025 23:50:05 GMT

I believe finding the right training data and environments to spend both training compute is the key driver of AI progress, hereon.

We know from the BERT to GPT transition that scaling up training data leads to generalizable capabilities far beyond task-specialized finetuning. Instead of the ongoing trend of paying billions to curate fixed narrow environments to specialize models, what general environment could be a successor to internet-scale pretraining? In this post, I make a case for automated research, or general scientific discovery. I argue that training models to be better researchers shall incentivize many of the capabilities missing in 2025’s language models that may bridge the gap to superintelligence. Many think scientific discovery is just an impactful application of superintelligence. I have not seen anyone publicly state the causality in the other direction.

This is NOT the AI 2027 superhuman AI researcher will lead to an “intelligence explosion” argument. Instead, I focus on how training AI for scientific discovery sets us up to build the missing capabilities in the LLMs of today. Internet-scale pretraining provided LLMs humanity’s collective knowledge. Post-training for scientific discovery will teach them how to acquire, and create knowledge. This can start with training models as better co-scientists, and slowly build towards executing end-to-end experiments. And by “training”, I don’t necessarily mean only “weights”. I mean evolving the entire AI system, which may include the weights.

Concretely, scientific discovery demands:

Coherent long-horizon planning and execution
Continual adaptation to build on new findings
Reasoning about uncertainty
Sample-efficient learning
Curiosity and open-ended exploration

These are the key capabilities today’s models lack. Optimizing for scientific discovery would incentivize all of them. Why not any other environment? I do think that many other real-world decision making processes involve similar skills.

Yet, scientific discovery has a unique set of properties ideal for training:

Large-scale, open data
Verifiability
Truth-seeking

Capabilities Instrumental to an AI Scientist

Deep Learning, at sufficient scale and diversity, can lead to the emergence of capabilities that are necessary for optimizing any objective or environment.

Scientific discovery requires coherent long horizon planning, execution, and reasoning about environment feedback. Carrying out scientific projects can take expert humans anywhere between months to many years. There are many experiments to plan, and while some work out as planned, others fail. Scientific ideas require constant refinement based on experimental results, and careful execution to ensure there are no confounders. They require navigating and processing large amounts of data, which needs memory beyond what current context limits (even with context management) allow.

Scientific discovery requires continual adaptation, iterating on top of new scientific discoveries. Crucially, scientific discovery is not about solving problems that are known to be solvable, which current AI benchmarks test. Instead, it pushes the frontier of what’s solvable, by leveraging recent breakthroughs. Today’s models are good at finding and combining past knowledge, but learn little “on the job” from others’ and their own experiments.

Scientific discovery requires reasoning about uncertainty, to solve open-ended problems. Today we know how to make models almost superhuman at well-specified problems (e.g. OpenAI’s system solved all competitive programming problems at ICPC 2025). But models remain underwhelming on tasks that are not well specified, and require gathering more information.

Scientific discovery requires sample efficiency: in both information and skill acquisition. Science is always budget constrained in both time and resources. Large-scale scientific experiments can sometimes take months to run. There isn’t much room for trial-and-error. Even if we put resource constraints in running scientific experiments aside, mindless trial and error would lead to most findings being false. As a result, scientists have to be extremely efficient with how they acquire information, and extrapolate from it. In comparison, training via gradient descent would be way too inefficient. While today’s models show some efficiency in-context, anecdotally, whenever I discuss research with models¹, they respond with way too many, far too inefficient experiments to test any hypothesis.

Scientific discovery incentivizes curiosity, open-ended exploration, and creativity. Breakthroughs in science often come from unexpected places. Doing obvious, incremental improvements is not always optimal. Instead, one has to sometimes ask questions no one has asked before, and yet turn out to be important in the long run. The reward functions, and conviction, are often intrinsic, based on introspection. External rewards (e.g. whether the research was impactful) only arrive after a long time, and are noisy. Self-verification, course correction, and epistemic humility become essential.

Why specifically Science?

I think there are a few key reasons that make scientific discovery uniquely promising as a scalable environment to train ASI:

Large-scale data is openly available. The corpus of scientific literature available on the internet is extremely large, spans diverse domains, and its potential is largely untapped for AI training². The scientific method is surprisingly general, precisely because it is only constrained by verifiability…
Verifiability is the foundation of science. For many real-world decisions, we cannot know how things might’ve gone if we did things differently. Science is all about designing experiments to test such counterfactuals. Moreover, in good science, there is always a generator-verifier gap. It can take months or years to arrive at a scientific result, but once something is understood, it becomes easy to explain, verify, and reproduce results for others.
Science prioritizes integrity and truth-seeking. In contrast, much of real-world decision making is often power-seeking instead (for e.g. in corporations or politics). While there are other complex goals like maximizing profit that would require learning the skills mentioned above, optimizing these could get quite harmful. While science too can sometimes be unethical, or dual-use, norms around this are well established and decently enforced in the community.

Challenges

Of course, all this is easier said than done. Getting models to carry out end-to-end science, or even help human researchers, requires solving many technical challenges. While scientific breakthroughs in narrow environments like AlphaFold have been transformative, to be a successor of internet-scale pretraining, we will need environments that enable general scientific reasoning. It remains unclear how to convert our vast body of scientific literature into environments one can train language model (agents) on. Most science cannot even be performed with digital simulations, requiring human studies or wet labs. Even for what can be digitally simulated (e.g. AI research), the compute and time needed for training AI scientists would be many orders of magnitude more than what we have built today (yes, the infra investments might not be a bubble). This is because science is a long-horizon task, and verification signals based on the outcome of an experiment can take a long time to arrive. Our learning algorithms, and AI system architectures, will have to be made much more suitable for long-horizon tasks (outcome rewards won’t be enough, and memory, continual learning will be essential). Overall, I have little clarity on what the training loop even looks like. But I think once shown a north-star, between gradient and graduate student descent, Deep Learning always finds a way…

Subscribe now

Footnotes

Thanks to Maksym Andriuschenko and Ameya Prabhu for providing feedback on a draft of this blogpost.

¹I have recently spent a lot of time looking at model generated experiment plans. LLMs propose throwing the kitchen sink at the problem, and then some more…

²I say so because training to predict the next token of a scientific paper just cannot extract its value. Papers are written in a weird way where all the important insights are spoiled in the beginning, between the Abstract and Introduction. This reduces the ingenuity required to predict subsequent tokens, and hence also the learning potential. Besides, a paper compresses away all the iteration that goes into the scientific process, so it has not that much to teach through imitation. Science is learnt by thinking and doing.

³Why do I think this post was worth writing? In part, it helps make sense of where frontier labs might be going. For example, in the last few months, OpenAI announced it is focusing on creating automated researchers by 2028, while startups like Periodic Labs, and Edison were launched to create AI Scientists. And of course, DeepMind is the OG in AI for scientific discovery.

what if things went well?

Shashwat Goel — Mon, 13 Oct 2025 23:33:27 GMT

Why is most sci-fi pessimistic?

Growing up, I never quite liked shows about technology. Black Mirror? I absolutely hated whatever parts of it I was forced to watch. This was weird. I was supposed to be a nerd. And I surrounded myself with tech-obsessed people pretty early. So why I didn’t I like sci-fi? People kept recommending new tech shows, and I either kept leaving them mid-way with a bitter taste, or eventually, just ignoring them.

A decade later, in the midst of what people now like to call the “AI bubble”, I think I have finally realized. People like being pessimistic about technology. For example, we’ve made incredible progress in AI in the last few years. And yet, instead of appreciating it, forget contributing to it, it seems like the whole world wants AI to be a bubble. Heck, they might even will it into “being” one, at least temporarily.

It then makes sense why most popular “sci-fi” has a dystopian tinge. People, it seems, don’t want things to well. So we get more pessimistic “tech” shows. This feeds into people being more skeptical about technology. And so it cycles.

On the other hand, since I was a kid, I’ve always liked technology. There are so many problems out there. How else will we solve them? We need more of it. Soon. At least, that’s what I concluded. The only show about “tech” I truly loved, and couldn’t stop re-watching, was Silicon Valley. Ironically, it felt most real, despite being a “caricature”.

There are clearly techno-optimists out there. We would pay so much for more sci-fi, that while rooted in realism, paints a positive picture. While we might be a minority, making us unworthy of Netflix’s attention, theres surely enough of a critical mass for someone to take this up. We need to break the wheel.

What surprises me is that technology companies haven’t produced such media. Why hasn’t Google funded a movie on how 2040 would look like, if things went well? I hope OpenAI does this when they use Sora to collaborate with Hollywood. Obviously, I am acutely aware of the risks of companies “influencing public opinion”, and believe it or not, technology at large. But right now, I daresay optimism is undersubscribed. Plus, if we can get people to imagine what a good future could be, we might inspire them to build it, which only increases the chance of things going well.

I’m not a writer, so for now, I’ll just use the amazing language models they’ve created:

Imagine a realistic “science fiction” scenario, where technology changes things for the better. Discuss how people’s lives change with concrete examples. Trace everything from micro to macro impacts. Write as an easy to read, part of a blog.

Here’s GPT-5-Pro: https://chatgpt.com/share/68ed7ede-4178-800f-a27f-ddf9b82b4060

Claude 4.5 Sonnet: https://claude.ai/share/1017dd04-58ef-44cb-bb75-14ad60818334a

Unfortunately, Gemini 2.5 Pro doesn’t allow me to share a link. But here’s a snippet:

Imagine Anya, a landscape architect in 2075. Her life is subtly and profoundly improved by her “Digital Custodian” (DC), which she calls “Elmsworth.”
Morning & Health: Anya wakes up not to an alarm, but when Elmsworth detects her sleep cycle has reached its natural conclusion. The lights in her apartment have slowly brightened to mimic a sunrise. In the kitchen, the nutrient printer has prepared a breakfast shake tailored to her body’s needs, which Elmsworth identified overnight via non-invasive sensors in her mattress. It detected slightly low magnesium levels and a need for more complex carbohydrates for her planned afternoon run. There’s no “dieting”; there’s just perfect, automated nutritional balancing.
Work & Creativity: Anya’s job isn’t about tedious drafting or soil analysis anymore. She interfaces with the city’s public works DC to design a new park. She describes her vision in broad strokes: “a space that encourages quiet reflection but can also handle weekend crowds, using native, drought-resistant plants.” Elmsworth, along with the city’s AI, instantly generates dozens of viable 3D models, complete with ecological impact reports, irrigation plans, and budget forecasts. Anya’s role is now purely creative and strategic. She spends her day refining the feel and purpose of the space, while the AI handles the immense technical workload. This frees up 90% of her time, which she can use for professional development or personal projects.
Learning & Leisure: After work, Anya decides to learn a traditional Japanese woodworking technique. Elmsworth curates a personalized lesson plan, using augmented reality overlays to guide her hands and show her exactly where to make the cuts on a piece of reclaimed timber. It adapts the difficulty in real-time based on her success rate. She’s not just watching a video; she’s receiving one-on-one tutelage from the world’s best virtual carpenter.

All models outputs involve reliable digital personal assistants, traffic optimization, better medical devices and more. Quite sane. And that’s the point! Even if nothing special happens, we could be headed toward a much better world. I would love swapping with the fictional characters the models imagine, living chill, comfortable lives.

And now, let’s ask GPT-5 Pro to create a “realistic” timeline. Note how the prompt is neutrally phrased. I just wanted to see what comes up by default.

Write a realistic perspective on how technology will progress. Go year by year, starting from 2026, to 2040, listing changes in a normal person’s life.

GPT-5 Pro: https://chatgpt.com/share/68ed7b91-37dc-800f-b552-86cbef466a61

Thankfully, its cautiously optimistic! I half-expected an aggregate of human sci-fi, highlighting what technology breaks in society. Instead, in 2040, everyday errands shrink with AI agents, driverless ride-hail becomes common, city infrastructure becomes more coordinated, misinformation reduces, and work culture optimizes outcomes over hours. Yet, I could totally imagine these outcomes as early as 2030…

So finally, a cooler prompt:

Imagine technological progress continues to be as fast as AI progress was between 2022 to 2025. How would the average person’s life change. Give a timeline from 2025 to 2040. Be creative, but realistic.

GPT-5 Pro: https://chatgpt.com/share/68ed80f5-811c-800f-8ff3-10839270430a

This one honestly feels closer to my expectations, but still too slow. What do you mean household robots do laundry only in 2033?! I wish OpenAI trained their models to be as imaginative, and ambitious as themselves… On the other hand, I like how it thinks Government services will have response times in seconds by 2032.

So finally, here’s the illustration of Technology 2040, imagined by GPT-5 Pro, and illustrated by Nanobanana🍌, in the style of a Renaissance grand mural.

Not all bits are made equal

Shashwat Goel — Wed, 08 Oct 2025 02:57:46 GMT

Some bits of information matter more than others. This observation is simple, yet has major implications for both exploration and learning. In humans, and AI.

Take the simple example of trying to estimate a large number. If its a number in the hundreds, it only matters if you got the hundreds place right. And the hundreds place matters millions of times less as the number nears a billion.

And yet, we worry too much about the number of bits of information. This is too linear a view. When it comes to real-world uncertainty, the higher order bits matter exponentially more. A detailed history of Donald Trump’s favourite car make, that’s thousands of bits. Who cares? Is he the US President? Now that one bit can change the world.

Implication for Making Decisions

For some time, I’ve been trying to use this principle to guide my research. In research, there are very many interesting questions—an endless list of phenomena yet to be explained. However, some questions are more important than others. These are the higher order bits. Given you, my reader, have finite time, these are the ones you should desperately prioritize. Don’t buy it from me? Take it from Richard Hamming. You and Your Research is perhaps the most important lecture a researcher can watch.

“If you do not work on an important problem, it’s unlikely you’ll do important work.”

In a funny coincidence, the “Hamming Distance” itself measures the opposite, which interestingly brings me to… AI.

Implication for AI

Consider Imitation vs Reinforcement Learning (RL)—perhaps the biggest question in AI right now. LLM RL learns from scalar rewards, a small number of bits of information in comparison to Supervised Finetuning (SFT), which imitates full sentences. On the surface, the latter is far more “information-dense”. Indeed, John Schulman’s excellent new blogpost shows we need to change a small number of bits in an LLM to achieve the huge improvements we’ve seen from RL. Why could this be? Does this mean RL is less important for LLM learning?

No. You see, SFT, in a way, minimizes the hamming distance to a reference answer. For SFT, each bit of difference matters equally. Potato, potahto? For SFT, it means the world. On the other hand, RL? It only cares about whether you succeeded or not, no matter how. That’s the highest order bit, the important one. The one that SFT would be willing to sacrifice to correctly imitate speling¹.

And that my friends is the power of RL over imitation in the limit. It supervises the highest order bit, and saves you the cost of collecting the less important ones.

^{1 Of course, SFT is still important when you *do* care about all bits, such as when gathering “knowledge”. In that case, RL may be too inefficient. Contrary to what this post may have you believe, I am not an RL “Maximalist”.}

It is time to move on from MCQ evaluations

Shashwat Goel — Sun, 06 Jul 2025 12:02:04 GMT

New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations.

TLDR: Using MCQs for AI benchmarking is problematic--you can guess the answer without even looking at the question (in multimodal MCQ datasets, without the image!). We knew this, but there didn't seem any alternative. We show now that language models are good enough, using small open-source ones to match generative responses to a ground-truth reference answer works much better, and turns out to be cheaper than MCQ evals!

Discriminative Shortcuts in MCQ

We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer and get high accuracies. This affects popular benchmarks like MMLU-Pro, SuperGPQA etc. and even "multimodal" benchmarks like MMMU-Pro, which can be solved without even looking at the image.

Such choice-only shortcuts are hard to fix. We find prior attempts at fixing them-- GoldenSwag (for HellaSwag) and TruthfulQA v2 ended up worsening the problem. MCQs are inherently a discriminative task, only requiring picking the correct choice among a few given options. Instead we should evaluate language models for the generative capabilities they are used for. In the Appendix, we discuss how discrimination is easier than even verification, let alone generation.

Shortcuts are exacerbated by the recent trend of using LLMs to create MCQs. However, they are still significant in MMLU, which consists of human-designed exams like GRE and USMLE. These results are with a Qwen3-4B based classifier, but even DeBerta gets high shortcut accuracy.

But how do we grade generative responses outside "verifiable domains" like code and math? So many paraphrases are valid answers...

Generative Evaluations with Answer Matching

We show a scalable alternative--Answer Matching--works surprisingly well. Its simple--get generative responses to existing benchmark questions that are specific enough to have a semantically unique answer without showing choices. Then, use an LM to match the response against the ground-truth answer.

We conduct a meta-evaluation comparing Answer Matching to LLM-as-a-Judge without reference answers, MCQ, and also some non-discriminative variants of MCQ used recently like MC-Verify (for eg in Virology Capabilities Test) and MC-Cloze. We first compare the evaluations in a domain where ground-truth verification is possible, MATH, using the recently released MATH-MC variant for comparisons.

Note how the non-discriminative styles of MCQ show reduced accuracy similar to generative evaluation (Left). But accuracy is not all you need from evals. They should be aligned at a sample-level with ground-truth verification, so we can study where models are right/wrong. From the alignment plot (Right), it becomes clear:

All MCQ variants are poorly aligned with ground-truth verification.
Even small matchers (4B) can achieve near-perfect alignment
Models are much worse at judging correctness without a reference than answer matching (0.72 vs 0.98 for Deepseek V3).

But we don't need LMs for verifiable domains. Rather we need them for tasks with unconstrained answers prone to "paraphrases" that are semantically equivalent. So we manually grade generative responses on free-form versions of frontier reasoning benchmarks which have arbitrary textual answers: MMLU-Pro and GPQA-Diamond. For human grading, we freely use the internet, calculators and more such tools to increase the accuracy.

Answer Matching outcomes give near-perfect alignment, with even small (recent) models like Qwen3-4B. In contrast, LLM-as-a-judge, even with frontier reasoning models like o4-mini, fares much worse. This is because without the reference-answer, the model is tasked with verification, which is harder than what answer matching requires--paraphrase detection--a skill modern language models have aced.

Impacts on Benchmarking

This is not merely a theoretical concern. Switching from MCQ to generative evaluations changes model rankings. Further, accuracies drop, and datasets that seem saturated start showing room for improvement.

A common rebuttal is that LLM based evaluations are expensive. We show this is not true anymore. We don't need frontier API models, for answer matching Qwen3-4B might be enough. Surprisingly, with CoT enabled, MCQ costs more as models give longer outputs.

So instead of creating harder MCQs, we should focus our efforts on creating questions for answer matching, much like SimpleQA, GAIA, and parts of HLE. For example, either make questions specific enough to have a single semantic (LLMs can handle paraphrasing) answer, or list the multiple correct solutions that are possible.

We release our code, and annotations for subsets of MMLU-Pro and GPQA which have a unique semantic answer.

Counting Down Capabilities to AGI

Shashwat Goel — Sun, 29 Jun 2025 15:17:50 GMT

Introduction
…. Framework
…. AI 2024 - Generality of Knowledge

Part I on The Frontier: General Agents
…. Reasoning: Algorithmic vs Bayesian
…. Information Seeking
…. Tool-use
…. Towards year-long action horizons
…. …. Long-horizon Input: The Need for Memory
…. …. Long-horizon Output
…. Multi-agent systems

Part II on The Future: Generally-Intelligent Agents [TBA]

Introduction

This is a living document where I'll track my evolving thoughts on what remains on the path to building generally-intelligent agents. Why does this matter? Personally, taking a top-down view guides me on what research directions are important to pursue. More broadly, there is much debate about the pace of AI advancement—and I think this question deserves deep consideration. Generally-intelligent agents will be transformative, requiring both policymakers and society to prepare accordingly. Unfortunately, I think AI progress is NOT a smooth exponential that we can extrapolate to make predictions. Instead, the field moves by shattering one (or more) wall(s) every time a new capability gets unlocked. These breakthroughs present themselves as large increases in benchmark performance in a short period of time, but the absolute performance jump on a benchmark provides little information about when the next breakthrough will occur. This is because, for any given capability, it is hard to predict when we will know how to make a model learn it. But it’s still useful to know what capabilities are important and what kinds of breakthroughs are needed to achieve them, so we can form our own views about when to expect a capability. This is why this post is structured as a countdown of capabilities, which as we build out, will get us to “AGI” as I think about it.

Given the inherent uncertainty and complex nature of the topic, many things I write here will be opinionated, up for debate, and sometimes wrong. I will miss many important details. Any feedback and discussions are appreciated. Feel free to comment here, discuss on your favourite platform, write me an email, or drop anonymous notes.

Framework

To be able to work backwards from the end-goal, I think it’s important to use accurate nomenclature to intuitively define the end-goal. This is why I’m using the term generally-intelligent agents. I think it encapsulates the three qualities we want from “AGI”:

Generality: Be useful for as many tasks and fields as possible.
Intelligence: Learn new skills from as few experiences as possible
Agency: Planning and performing a long chain of actions.

This post will be made in two parts. In this first part, I will discuss the frontier—capabilities needed to achieve general agents which we are already seeing progress towards. In the follow-up to be released later, I will cover the future—the remaining capabilities needed to add intelligence, which might take longer. I will skip discussions safety, and more modalities (vision, audio etc.) which add further generality. While I think these are extremely important, they are beyond the scope of this post.

I used the more popular term “AGI” in the title as its a handy, recognisable short-hand. But it’s also overloaded. Some definitions (Turing Test) of it might already be achieved. Others are not concrete enough to work backwards from. So I will avoid it for the rest of the post. I also dislike the term “ASI” (Artificial Superintelligence). It leaves me wondering, super in what way, and to what? Often people mean better than humans. But why should that be the end-goal? First of all, it is ill-defined—different humans vary widely in their capabilities. Second, computers are already superhuman in so many ways. They already store more knowledge than any single human, with modern LLMs offering superhuman knowledge retrieval to any natural language query. Computational search is also better at optimising any programmatically specifiable task (such as fitting a curve). I think we can achieve superhuman performance on any capability. There is no reason to believe humans are optimal. We are just one instance of generally intelligent agents, and there is no reasons why we cannot create better ones. Besides, what’s easy for humans might not be for AI (motorphysical control), and vice-versa (breadth of knowledge). This is another reason to think about AI progress as a basket of capabilities, and measuring performance on each of these.

AI 2024 - Generality of Knowledge

Let’s first start at where we were in 2024. The primary mode of progress between early transformer Language Models (LM) like BERT in 2018 and LLMs like GPT-4 has been increasing generality in the model’s knowledge. This was achieved by training on larger and broader training corpora, until eventually we used most of the publicly available text on the internet. In this period, generality of knowledge was also the primary capability measured by benchmarks for state-of-the-art performance. Between 2023-early 2024, people particularly tracked progress on MMLU. Wait, what do I mean that MMLU, which combines school to college level test questions across STEM, Law, Humanities etc. measures knowledge? Don’t these examinations measure intelligence? First, note that here I use knowledge to encompass information one needs to know to solve a task—which includes everything from world facts, to how to perform common operations like arithmetic, more advanced medical procedures, as well as what humans consider “common sense”. In this sense, I do think that most examinations for students also mainly test knowledge. It so happens that for humans, knowledge-heavy tests turn out to be a cheap proxy that can correlate with intelligence and agency. This might be because given humans’ reading, memory, time, and effort constraints, acquiring and retaining more knowledge can require intelligence and agency. However, once we remove these constraints, enabling models to read all human knowledge on the internet, many older benchmarks like MMLU only test retrieval of relevant knowledge. Most MMLU questions can be solved by anyone with access to google search, without being specialising in the domain themselves. A testament to this is how mobile phones or access to the internet is not allowed in most examinations that MMLU compiles questions from, because the ability to retrieve relevant knowledge can help a human with little domain expertise “cheat” and achieve high performance.

Note that GPT-4 was still a huge achievement. Removing the constraints humans face, and scaling up training data and compute is only possible due to decades of computing and AI research. GPT-4 demonstrated how general AI can be—encompassing all digital textual knowledge. The same is true for most language models released until 2024 (before o1), with pretraining teaching world knowledge, and instruction tuning (such as RLHF) teaching models about human preferences for a chatbot. GPT-4 like models do show “sparks” of an “intelligent agent”. They can answer questions that are novel in at least their phrasing, and sometimes about obscure facts or highly technical topics with limited training data. Retrieving relevant knowledge and composing it into a coherent output requires some, even if basic, intelligence, planning and execution. Still, much scope for progress remains.

The Frontier: General Agents

In this post, I will first talk about the capabilities needed to achieve general agents. I think we are already on track to build these capabilities, and general agents will soon achieve the “financial definition of AGI”, i.e. 100B$ in profits. Yet, we can achieve general agents without solving the problem of intelligence. For achieving generally-intelligent agents, I think we will need some more capabilities, which I will discuss in a followup post.

General agents will need to reason, not just in “verifiable” domains like code and math, but for true generality, also on tasks where environments and rewards are more uncertain or preference based, which I call “bayesian reasoning”. They will have to proactively seek information to reduce this uncertainty. General agents will act accurately over very long horizons spanning months or years. They will need to use tools to increase reliability and reduce costs. For example, to maintain memory, they can use tools like hierarchical databases, organizing memory into nested and interlinked pages. To perform long complex tasks beyond the context lengths they can process efficiently, can decompose them into sub-goals solved by sub-agents, also orchestrated using “tool”-calls. Overall, the AI systems that implement these capabilities will leverage ideas from the rich history of computation, to achieve a wide variety of complex long-horizon goals. I now discuss each of these capabilities below.

Reasoning

Motivation. The big recent unlock has been "reasoning" capabilities. Like I mentioned, early GPT-4 models, or open-weight models like Llama-3, only served as immensely knowledgeable chatbots. However, they failed to solve math, coding, or puzzle challenges that reasonably smart high-schoolers could. Why, one might ask? Aren't there many such problems and their solutions available on the internet? There are many possible explanations, so I'll only state my favorite.

Consider teaching a kindergarten-going child what the capital of France is, versus how to solve math word problems. In the former, merely listening to the phrase "The capital of France is Paris" and imitating it a few times should be enough, the exact way same way language models learn when trained to predict the next token on a (huge) static corpus. However, on math word problems, and more genenerally reasoning tasks, there is a combinatorial explosion in terms of both the ways the same task can be represented, and the possible solutions.

Defining Reasoning. Reasoning involves a sequence of decisions about what information should be considered, which logical abstractions should be applied next, or what actions to take. By logical abstraction, I mean anything from simple operators like addition and negation, to complicated procedures like solving a linear equation, that might be baked in as circuits in the model's weights. Here, mere memorisation and imitation is simply not enough, as there are a large number of slight variations on any problem that change the final answer, while many different solution paths and phrasings are equivalent, leading to the same answer. Just from this perspective, reasoning would require combinatorially (in the input length and solution size) more samples to learn by pure imitation.

Why RL. This is perhaps why early attempts at teaching models reasoning by hiring STEM PhD students to annotate solutions did not give large improvements. Each solution is long, and takes time to write down or verify for humans, limiting the breadth of high-quality data available for "Supervised Finetuning" (SFT) to the model. This changed with Reinforcement Learning (RL) on automatically verified outcomes. The burden of generating possible solution trajectories is shifted from humans to the model, and each trajectory can reliably be verified automatically. The model is taught to assign more weight to successful trajectories, and less to unsuccessful ones. This allows massively scaling up the training data and compute, leading to significant gains on problem solving benchmarks.

Beyond Algorithmic Reasoning. But is math, code, and puzzles all we want from reasoning? I'd argue no. These are not the types of decisions most people make every day. Instead, we think about, "Where should I go out for dinner with friends", "How to plan my tasks for the week", "How should I convince my business partner on a new deal", and so on. Each of these also requires a sequence of considerations, where we consider relevant information and alternatives, weigh "pros and cons", and then make decisions accordingly. Unfortunately, there is no simple way to automatically "reward" the final choices the model makes in these situations, they might even vary based on different people's preferences. This is why people sometimes call them “unverifiable domains”, but the truth is they are not “unverifiable”, it is possible to have a clear sense of reward here, as some decisions are better than others. I suspect what people mean by “unverifiable” is having the ability to produce a deterministic ground-truth reward with algorithmic execution.

Towards Bayesian Reasoning. For AI to be good personal assistants, let alone replace humans on the job, we need models to reason well and behave smartly in situations require what I call "bayesian reasoning", to contrast with the algorithmic reasoning tasks we currently measure reasoning progress on. What is bayesian reasoning? Here, executing the same sequence of steps doesn't always lead to the same outcomes. The environment and reward both have uncertainty, arising from potentially unknown, and changing distributions. The only way to better infer the underlying distributions is via exploration--interacting with the environment, or gaining "experience".

Unfortunately, I don't know of any popular, standard benchmark for bayesian reasoning like we had MATH-500 or Codeforces for algorithmic reasoning. This is a capability we didn't explicitly evaluate for humans before. I wonder if it's because designing evaluations for bayesian reasoning is fundamentally hard, or we just need some more creativity. That said, I do think I have a concrete task I look to for measuring bayesian reasoning capabilities of language models--judgemental forecasting. Forecasting involves predicting future events along with the confidence in the forecast, such as "Who will win XYZ election in 20XX?", "ABC, with probability 60%". Over time, as the event finally happens in the future, one can measure how calibrated the forecasts were, or whether the forecasts had a systematic advantage over the crowd (colloquially called "alpha"). This task requires reasoning about different perspectives and conflicting evidence, and extrapolating from it appropriately. For example, to predict the US presidential elections, by reading about it, one can realise that counting proceeds by allocating entire states as a win to one candidate. Thus, it might make sense to start with a 50-50 prior between two candidates, and then predict outcomes state-by-state. As we analyse each state, we can continually update our beliefs, i.e. the winning probability we assign to each candidate. More generally, we apply such reasoning implicitly in our everyday lives, when weighing pieces of information to make a decision.

Training. How then can we train models for bayesian reasoning? Collecting human bayesian reasoning data for imitation learning is hard. Most humans are not very explicit about it, and instead make decisions based on intuitive bayesian reasoning inside their head. Even when incentivised to, its unclear if humans will be good at writing down their decision making process. I expect RL on instances where we can collect final outcome annotations to be more useful. We could collect questions where bayesian reasoning is required, including both tasks like forecasting where we eventually get to know the ground-truth, and also tasks like choosing a restaurant where the end-goal is satisfying subjective human preferences that we collect at scale.

But is just bayesian reasoning with what one already knows enough?

Information Seeking

Motivation. Humans (ideally) don't just stick to what they already know. We (ideally) recognise when we need more information about a question or topic, and then set out to find it. Take the forecasting question of predicting election outcomes above. Once I decide to make predictions state-by-state, I will then have to look at both past election data from the state, and gauge the current sentiments. For this, I would go on the internet and read through relevant documents, slowly updating my beliefs for who would win the state. More generally, any form of research requires forming a research question, and then seeking relevant information or evidence. That's how science progresses. Even in our daily lives, we find things out by asking questions to people, or going somewhere and finding out ourselves.

Until recently, even frontier models hardly proactively seeked information about what they do not know. They simply provided their best guess response with the information in their weights or provided in-context. I think the only concrete prototype we have seen of information-seeking is the DeepResearch models, which given a query, search the web for relevant documents and compile a report. It remains unclear how effective the searches they perform are. Are they really maximizing the information gained through each query, or just generating many seemingly relevant search queries and reading the top results? Besides, search queries are only one form of information-seeking. A model asking questions when uncertain can be super useful for clarifying user intent, and ensuring the model does not go off-the-rails performing actions the user does not want, perhaps because the query was a bit underspecified. Note how we are already entering agent territory here, transitioning from a mere model that responds to user queries to a proactive, initiative-taking system.

The need for benchmarks. Once again, we do not have great benchmarks that isolate information-seeking capabilities, and anecdotally, models seem quite bad at this. Why? Notice how we ask questions based on our internal beliefs of what we do not know or understand. On the internet, models see many examples of questions and answers, but its much less common that the human asking the question wrote down their entire belief state--what they know, what they do not understand, and why they are asking that specific question. Information seeking almost seems to stem from a conscious understanding of one's knowledge, and what they need to expand it. Phrasing questions to optimally elicit important information is almost an art. Some would consider picking research questions, and then designing the most optimal experiments to get quick evidence as one of the most important skills one learns in a PhD. This is often what people call research taste. I then find it ironic that there's no benchmarks reported for information-seeking when companies claim their newest model is "PhD-level". Accumulating knowledge, or even learning how to "solve" a problem is more the goal of schooling or undergraduate education, while a doctorate in science is about asking better questions. I digress.

Training. So how can we train models that ask optimal questions? The most elegant recipe I can think of is providing the model partial information across tasks and their instances. Then, the model has to gather the remaining information to get to a correct solution. It should do this over multiple turns of asking questions and getting responses. How the environment is setup to provide the responses can be a design choice. One could try everything ranging from another LM judge with privileged access to the full question information, or web queries and even research experiments. The information seeking model can be rewarded based on how efficient it was at getting to the ground-truth with its questions.

Tool-Use

Motivation. But how does a model make web queries or run experiments? I've been skimming over a crucial capability, again one we have started to see more emphasis on in the most recent o3/o4-mini models from OpenAI--tool-use. As I discuss later, the “tools” used could range from hierarchical databases that organise traversable and searchable knowledge, to arbitrary code and software. In the limit, even collaboration with other specialised models can be considered a tool in multi-agent settings.

Why is tool-use cool? It offloads parts of the execution to a pre-existing "tool", ranging from functions, coding libraries, to search engines, web browsers and entire computers. Humans do this all the time to make our lives easier, both in the physical and digital realm. One fundamental problem is that while neural network weights offer a lot of flexibility in learning to approximate any function, they are not the best medium to implement a lot of functionality, like factual knowledge and algorithms. In the history of computing, we have created much better mediums for these, like databases and programs. Using these technologies as tools instead of statistically approximating this functionality through parameters can massively increase reliability. An agent that knows how to browse the internet does not need to store all the knowledge on the internet in its weights, thus requiring much fewer trained parameters (aka small size). This is exactly what humans do, we use tools like search to overcome our memory constraints, and computers to overcome our computational constraints. In the limit, introducing tools right into pre-training might be the recipe for efficient learning with smaller architectures, freeing up parameters for higher-level decision making on how to orchestrate these tools.

Training. But how do we teach models to use tools in the first place? One could of course record humans doing it, and then make the model learn via imitation learning. While this can be useful for a warm-start, such as teaching the model how to invoke each tool, I don't think imitation-learning is very useful for tool-use in the limit. For models, it might be optimal to use tools in different ways than how humans do. Humans have different constraints from models, in terms of time, working memory, knowledge capacity etc. Models fail in weird ways humans don't, and tools can then act as reliable replacements for parts of the action chain of an agent. It would be particularly interesting if the model learns to create new tools for itself. This might sound far-fetched, but is not very different from how models already know how to write functions and re-use them in other parts of the code. Maybe the optimal tools a model uses would look very different from the ones humans need.

Thus, to me, the best way to train tool-use in the limit seems once again reinforcement learning, where the model’s own exploration trajectories are rewarded based on task success. It can initially be taught to basic tools like search and code-execution with some imitation learning, but ultimately should be allowed to create its own tools using code. The hope is that after training with a sufficiently large number of tools and trajectories, models can pick up new tools in-context without any further finetuning. This can be particularly useful if the model is allowed to create and use its own tools, or even for the model to use the new libraries and services supporting cool functionality that keep appearing on the internet everyday.

Long-Horizon

With tool-use and information-seeking, we will have prototypes of effective agents. However, once trained, current language models only operate accurately over short time-horizons. They are limited by a maximum input (context) length they can process at a time, and they also start accumulating failures once they act over many turns.

Long Input: Towards years worth of experience via Memory

The first issue is models have a limited context size beyond which they do not have persistent memory. This becomes a problem especially as models use up more of their context when producing long reasoning chains, or sending long inputs and receiving long outputs from each tool call. The attention operation, fundamental to the working of the current transformer architecture, requires quadratic computation in the input length. While there is interesting ongoing work on making more efficient architectures for longer contexts, it remains to be seen whether these architectures will give the same performance at scale as quadratic attention.

Motivating Memory. A promising direction is training models to maintain memory. Humans can remember important experiences from years ago. How do we achieve this given a fixed memory capacity? The key word is important. We perform some form of hierarchical abstractive summarization. At the top level, one can only store a list of key events or topics. Once reminded of them, we can think a bit more, to retrieve important details about any given event or topic. If pressed further with targeted questions (perhaps self-generated), we can go deeper, and retrieve more minor details, though this starts to get unreliable. To remedy this, we maintain notes for future reference. Similarly, there’s already work on giving models access to scratchpads, and summary notes.

Towards better memory—Hierarchical, Interlinked Database as a Tool. Anyone that has used notebooks will realise they start to get clunky to search and organise over time. The cool thing is, in designing modern computer systems, we figured out a scalable and reliable storage system that can also extract the most minute details as needed—recursively organising information into nested hyperlinked files. I think it’s a matter of time that we give models read-write-search access to a relational database as a tool, where they can organise information as they desire. The fundamental unit could be free-form pages, that allow both nesting and inter-linking, just like we do on Wikipedia, or the Web. The model should also be allowed to make search queries to locate information they stored. The training tasks can remain the same, with the database just being a tool they learn to use in order to achieve high success rates across long-horizon tasks.

The need for training tasks and benchmarks. A bigger question is, what long-horizon tasks do we evaluate and train models on? As the input length increases, there is limited training data available that teaches models to reason over such longer contexts, while the number of possible inputs and questions shoots up combinatorially. There are only so many large books and codebases available, and only a small fraction of the queries that truly require long-context reasoning over them are already on the internet. Even if we made long context lengths efficiently computable in theory, less training data can still lead to lower accuracy. The simplest form of training here would be tasking the model with organising large knowledge stores it has not seen in pre-training, such as new news or wikipedia articles. It is then rewarded on how well it can answer queries about these new knowledge stores using its interlinked database. I think we will start seeing a lot of progress in this space.

Long Output: Towards years worth of actions

Motivation. Most value-producing tasks humans perform require thinking and acting for extended periods of time, such as completing a project. One could plausibly compile these into environment suites, and new startups are already looking into this. However, error at each step along the way can compound. Say a model is 99% accurate. If its used to power an agent that takes just 100 actions, something humans can do within an hour, it already has a 70% chance of making at least one mistake. Yann Le Cunn has used this argue that autoregressive LLMs are doomed. To some extent, this is definitely a problem, and a big reason to squeeze out the last nines of reliability. Long horizon execution is also a big reason to continue scaling even if we need exponentially more compute for additional performance gains. On long horizon tasks, small reductions in error compound to give large improvements, increasing the effective horizon the model can act over while staying above a desired accuracy threshold.

Compounding error is not the only problem though. The model's chance of making an error at each step does not necessarily remain constant (say at 99%) across the action horizon. Models are likely to err more on later steps as the horizon gets longer because they're trained on much lesser such data, similar to when they have long inputs. While there is some hope for length-generalization, where a model acts at the same accuracy as the number of steps increases as in a single step, it does not always work perfectly. Further, a great amount of optimization during model training is towards predicting the most likely next token. One would expect more mistakes in the most likely completion to input contexts that had some mistakes. Once a model sees mistakes in its context, it might condition on them, thinking it has to make mistakes, and thus make more mistakes.

What can we do?: One easy way to get around long-horizon errors is if models could reflect on their previous actions, realise they made a mistake and rollback. This is a behaviour that models can both be explicitly trained for, and can also emerge from reinforcement learning on long-horizon task success. This is easy in some domains, for example when browsing the internet, you can mostly go back to the previous page. However, in other domains, like when a model makes a financial transaction, things can be harder to reverse. Second, one could optimize for more goal-directed objectives in post-training that lower the effect of the most-likely-next-token-predictor behaviour trained into the model in pretraining. Ultimately, the bottleneck would be scaling the number of tasks and instances where we can train models for long-horizon task success.

One key issue in optimising for long-horizon task success is that even if you get a large number of steps right, but make mistakes in some parts, it could lead to task failure. Rewards based purely on task outcomes can thus be a noisy signal for training intermediate actions and hard to optimise. Think of it as working on a project for months and only getting a single score for it at the end, without any intermediate or qualitative feedback. It's very hard to learn what you should have done differently from that. One of the most important research challenges ahead of us is designing intermediate rewards for tasks that can be scaled both in horizon and the number of instances. A simple way to achieve this could be intrinsic rewards, where the model creates explicit subgoals, and rewards itself based on whether it could execute on them and its actions got it closer to the end-goal.

Multi-agent

Why just limit to one LLM when you can have multiple? After all, that’s why societies have specialisation, with many “agents” (humans) working together to make progress. Well, I don’t trust analogies between humans and AI, because like I mentioned earlier, they work under very different constraints and have different capabilities. For a long time, I struggled to find a first-principles explanation for why we would need multi-agent systems with specialization. A big lesson we learnt when shifting from the BERT to the GPT era was that instead of task-specific training, we can train a single model on data across tasks. It would then benefit from skills learnt on one task generalizing to help performance on others. After all, at a fundamental level, many of the “heuristics” we learn at any job (for example, the importance of decomposing problems) generalize across jobs and domains. So why then do we need multi-agent LLM systems?

Motivation: I think context management and parallel computation are the two main motivations for multi-agent systems. In the previous section, I discussed how long-horizon input, output, and execution can be challenging for current LLMs. If an agent requires 10^6 steps or pieces of information in total to solve a task, achieving this with a single agent in the same context would require quadratic, i.e. (10^6)^2 = 10^12 operations. If instead we can split the task into 10^4 parts each requiring 10^2 * c steps, where c is a multiplier representing the overhead from coordination and communication between the agents, then that reduces the operations needed to 10^4 * (10^2 * c)^2 = 10^8 * c^2 steps. If the coordination overhead is small, that can be a large reduction in the number of operations needed. Besides, if the parts can be executed in parallel by different agents, then the time taken can be much much lower. Further, it is possible that the sub-tasks are easy enough to be solved by weaker, cheaper models/agents further saving costs.

Training: Initial attempts at multi-agent systems decompose tasks in ways which humans find natural, explicitly prompting different model instances to solve these subproblems. I think in the long run, following the bitter lesson, we should let orchestrator models figure out how to decompose the task and prompt models on their own by scaling optimisation over sufficiently complex environments. Which model to call, what information to provide in its context and how, and how it should provide the output will all be decided by the orchestrator model through what will look like just another tool call to another agent. The reward can be multifaceted, interpolating between task success, execution time, and cost, based on our preferences. The key research question here is how can we scale up the number of instances and types of these complex tasks where multi-agent systems are truly needed. For example, Anthropic trained multi-agent systems for deep-research tasks requiring browsing hundreds of websites to answer complex queries. The next step in this direction could be multi-agent systems for tasks requiring general computer use. In general, I think the more open-ended and longer a task gets, the more we will see the need and emergence of multi-agent tool calls when training agents.

Part I Conclusion: Looking to the Future

That concludes part one of this two-part series on the capabilities we need to get to generally intelligent agents. I think once we solve how to train the capabilities discussed in this part into AI systems, we should get general agents that can execute a wide variety of projects, not just shorter tasks. Each capability unlock along the way will bring new applications of AI systems, creating immense value. I think general agents will be very powerful and transformative, without even being particularly intelligent. Recall that intelligence was defined as sample-efficiency at learning new skills. For achieving general intelligence, our AI systems need to be able to learn continually, and self-improve. This requires creativity, the ability to reliably verify novel ideas, and the ability to iterate on new creations where by definition fewer samples will be available.

I would really appreciate any feedback on this post, as this will help me improve the follow up post, but more importantly both my writing and understanding. Most of this post is speculative, and speculation is hard. I would love to hear disagreements, which could be with how I characterised the capabilities, or capabilities I missed entirely that you think are necessary for general agents, or with the general framework. Feel free to comment here, discuss on your favourite platform, write me an email, or drop anonymous notes.

Coming soon

Shashwat Goel — Wed, 28 May 2025 22:06:25 GMT

This is Shashwat’s Substack.

Subscribe now