Scientific discovery as a training environment for Superintelligence
Why I think automated research is the means, not just the end, for ASI
I believe finding the right training data and environments to spend both training compute is the key driver of AI progress, hereon.
We know from the BERT to GPT transition that scaling up training data leads to generalizable capabilities far beyond task-specialized finetuning. Instead of the ongoing trend of paying billions to curate fixed narrow environments to specialize models, what general environment could be a successor to internet-scale pretraining? In this post, I make a case for automated research, or general scientific discovery. I argue that training models to be better researchers shall incentivize many of the capabilities missing in 2025’s language models that may bridge the gap to superintelligence. Many think scientific discovery is just an impactful application of superintelligence. I have not seen anyone publicly state the causality in the other direction.
This is NOT the AI 2027 superhuman AI researcher will lead to an “intelligence explosion” argument. Instead, I focus on how training AI for scientific discovery sets us up to build the missing capabilities in the LLMs of today. Internet-scale pretraining provided LLMs humanity’s collective knowledge. Post-training for scientific discovery will teach them how to acquire, and create knowledge. This can start with training models as better co-scientists, and slowly build towards executing end-to-end experiments. And by “training”, I don’t necessarily mean only “weights”. I mean evolving the entire AI system, which may include the weights.
Concretely, scientific discovery demands:
Coherent long-horizon planning and execution
Continual adaptation to build on new findings
Reasoning about uncertainty
Sample-efficient learning
Curiosity and open-ended exploration
These are the key capabilities today’s models lack. Optimizing for scientific discovery would incentivize all of them. Why not any other environment? I do think that many other real-world decision making processes involve similar skills.
Yet, scientific discovery has a unique set of properties ideal for training:
Large-scale, open data
Verifiability
Truth-seeking
Capabilities Instrumental to an AI Scientist
Deep Learning, at sufficient scale and diversity, can lead to the emergence of capabilities that are necessary for optimizing any objective or environment.
Scientific discovery requires coherent long horizon planning, execution, and reasoning about environment feedback. Carrying out scientific projects can take expert humans anywhere between months to many years. There are many experiments to plan, and while some work out as planned, others fail. Scientific ideas require constant refinement based on experimental results, and careful execution to ensure there are no confounders. They require navigating and processing large amounts of data, which needs memory beyond what current context limits (even with context management) allow.
Scientific discovery requires continual adaptation, iterating on top of new scientific discoveries. Crucially, scientific discovery is not about solving problems that are known to be solvable, which current AI benchmarks test. Instead, it pushes the frontier of what’s solvable, by leveraging recent breakthroughs. Today’s models are good at finding and combining past knowledge, but learn little “on the job” from others’ and their own experiments.
Scientific discovery requires reasoning about uncertainty, to solve open-ended problems. Today we know how to make models almost superhuman at well-specified problems (e.g. OpenAI’s system solved all competitive programming problems at ICPC 2025). But models remain underwhelming on tasks that are not well specified, and require gathering more information.
Scientific discovery requires sample efficiency: in both information and skill acquisition. Science is always budget constrained in both time and resources. Large-scale scientific experiments can sometimes take months to run. There isn’t much room for trial-and-error. Even if we put resource constraints in running scientific experiments aside, mindless trial and error would lead to most findings being false. As a result, scientists have to be extremely efficient with how they acquire information, and extrapolate from it. In comparison, training via gradient descent would be way too inefficient. While today’s models show some efficiency in-context, anecdotally, whenever I discuss research with models1, they respond with way too many, far too inefficient experiments to test any hypothesis.
Scientific discovery incentivizes curiosity, open-ended exploration, and creativity. Breakthroughs in science often come from unexpected places. Doing obvious, incremental improvements is not always optimal. Instead, one has to sometimes ask questions no one has asked before, and yet turn out to be important in the long run. The reward functions, and conviction, are often intrinsic, based on introspection. External rewards (e.g. whether the research was impactful) only arrive after a long time, and are noisy. Self-verification, course correction, and epistemic humility become essential.
Why specifically Science?
I think there are a few key reasons that make scientific discovery uniquely promising as a scalable environment to train ASI:
Large-scale data is openly available. The corpus of scientific literature available on the internet is extremely large, spans diverse domains, and its potential is largely untapped for AI training2. The scientific method is surprisingly general, precisely because it is only constrained by verifiability…
Verifiability is the foundation of science. For many real-world decisions, we cannot know how things might’ve gone if we did things differently. Science is all about designing experiments to test such counterfactuals. Moreover, in good science, there is always a generator-verifier gap. It can take months or years to arrive at a scientific result, but once something is understood, it becomes easy to explain, verify, and reproduce results for others.
Science prioritizes integrity and truth-seeking. In contrast, much of real-world decision making is often power-seeking instead (for e.g. in corporations or politics). While there are other complex goals like maximizing profit that would require learning the skills mentioned above, optimizing these could get quite harmful. While science too can sometimes be unethical, or dual-use, norms around this are well established and decently enforced in the community.
Challenges
Of course, all this is easier said than done. Getting models to carry out end-to-end science, or even help human researchers, requires solving many technical challenges. While scientific breakthroughs in narrow environments like AlphaFold have been transformative, to be a successor of internet-scale pretraining, we will need environments that enable general scientific reasoning. It remains unclear how to convert our vast body of scientific literature into environments one can train language model (agents) on. Most science cannot even be performed with digital simulations, requiring human studies or wet labs. Even for what can be digitally simulated (e.g. AI research), the compute and time needed for training AI scientists would be many orders of magnitude more than what we have built today (yes, the infra investments might not be a bubble). This is because science is a long-horizon task, and verification signals based on the outcome of an experiment can take a long time to arrive. Our learning algorithms, and AI system architectures, will have to be made much more suitable for long-horizon tasks (outcome rewards won’t be enough, and memory, continual learning will be essential). Overall, I have little clarity on what the training loop even looks like. But I think once shown a north-star, between gradient and graduate student descent, Deep Learning always finds a way…
Footnotes
Thanks to Maksym Andriuschenko and Ameya Prabhu for providing feedback on a draft of this blogpost.
1I have recently spent a lot of time looking at model generated experiment plans. LLMs propose throwing the kitchen sink at the problem, and then some more…
2I say so because training to predict the next token of a scientific paper just cannot extract its value. Papers are written in a weird way where all the important insights are spoiled in the beginning, between the Abstract and Introduction. This reduces the ingenuity required to predict subsequent tokens, and hence also the learning potential. Besides, a paper compresses away all the iteration that goes into the scientific process, so it has not that much to teach through imitation. Science is learnt by thinking and doing.
3Why do I think this post was worth writing? In part, it helps make sense of where frontier labs might be going. For example, in the last few months, OpenAI announced it is focusing on creating automated researchers by 2028, while startups like Periodic Labs, and Edison were launched to create AI Scientists. And of course, DeepMind is the OG in AI for scientific discovery.


https://www.lesswrong.com/posts/9JbGq4t4ihDkXan5e/daniel-paleka-s-shortform?commentId=9coeTiaSv5CGxHTYN compare list of tasks