Counting Down Capabilities to AGI

What remains on the path to building generally-intelligent agents?

Jun 29, 2025

Introduction
…. Framework
…. AI 2024 - Generality of Knowledge

Part I on The Frontier: General Agents
…. Reasoning: Algorithmic vs Bayesian
…. Information Seeking
…. Tool-use
…. Towards year-long action horizons
…. …. Long-horizon Input: The Need for Memory
…. …. Long-horizon Output
…. Multi-agent systems

Part II on The Future: Generally-Intelligent Agents [TBA]

Introduction

This is a living document where I'll track my evolving thoughts on what remains on the path to building generally-intelligent agents. Why does this matter? Three compelling reasons:

Top-down view: AI research papers (and product releases) move bottom-up, starting from what we have right now and incrementally improving, in the hope we eventually converge to the end-goal. This is good, that’s how concrete progress happens. At the same time, to direct our efforts, it is important to have a top-down view of what we have achieved, and what are the remaining bottlenecks towards the end-goal. Besides, known unknowns are better than unknown unknowns.
Research prioritisation: I want this post to serve as a personal compass, reminding me which capabilities I believe are most critical for achieving generally intelligent agents—capabilities we haven't yet figured out. I suspect companies have internal roadmaps for this, but it’s good to also discuss this in the open.
Forecasting AI Progress: Recently, there is much debate about the pace of AI advancement, and for good measure—this question deserves deep consideration. Generally-intelligent agents will be transformative, requiring both policymakers and society to prepare accordingly. Unfortunately, I think AI progress is NOT a smooth exponential that we can extrapolate to make predictions. Instead, the field moves by shattering one (or more) wall(s) every time a new capability gets unlocked. These breakthroughs present themselves as large increases in benchmark performance in a short period of time, but the absolute performance jump on a benchmark provides little information about when the next breakthrough will occur. This is because, for any given capability, it is hard to predict when we will know how to make a model learn it. But it’s still useful to know what capabilities are important and what kinds of breakthroughs are needed to achieve them, so we can form our own views about when to expect a capability. This is why this post is structured as a countdown of capabilities, which as we build out, will get us to “AGI” as I think about it.

Given the inherent uncertainty and complex nature of the topic, many things I write here will be opinionated, up for debate, and sometimes wrong. I will miss many important details. Any feedback and discussions are appreciated. Feel free to comment here, discuss on your favourite platform, write me an email, or drop anonymous notes.

Framework

To be able to work backwards from the end-goal, I think it’s important to use accurate nomenclature to intuitively define the end-goal. This is why I’m using the term generally-intelligent agents. I think it encapsulates the three qualities we want from “AGI”:

Generality: Be useful for as many tasks and fields as possible.
Intelligence: Learn new skills from as few experiences as possible
Agency: Planning and performing a long chain of actions.

This post will be made in two parts. In this first part, I will discuss the frontier—capabilities needed to achieve general agents which we are already seeing progress towards. In the follow-up to be released later, I will cover the future—the remaining capabilities needed to add intelligence, which might take longer. I will skip discussions of more modalities (vision, audio etc.), and safety, which I think are extremely important, but beyond the scope of this post.

I used the more popular term “AGI” in the title as its a handy, recognisable short-hand for these ideas. But it’s also overloaded. Some definitions of it might already be achieved. Others are not concrete enough to work backwards from. So I will avoid it for the rest of the post. I also dislike the term “ASI” (Artificial Superintelligence). It leaves me wondering, super in what way, and to what? Often people mean better than humans. But why should that be the end-goal? First of all, it is ill-defined—different humans vary widely in their capabilities. Second, computers are already superhuman in so many ways. They already store more knowledge than any single human, with modern LLMs offering superhuman knowledge retrieval to any natural language query. Computational search is also better at optimising any programmatically specifiable task (such as fitting a curve). I think we can achieve superhuman performance on any capability. There is no reason to believe humans are optimal. We are just one instance of generally intelligent agents, and there is no reasons why we cannot create better ones. Besides, what’s easy for humans might not be for AI (motorphysical control), and vice-versa (breadth of knowledge). This is another reason to think about AI progress as a basket of capabilities, and measuring performance on each of these.

AI 2024 - Generality of Knowledge

Let’s first start at where we were in 2024. The primary mode of progress between early transformer Language Models (LM) like BERT in 2018 and LLMs like GPT-4 has been increasing generality in the model’s knowledge. This was achieved by training on larger and broader training corpora, until eventually we used most of the publicly available text on the internet. In this period, generality of knowledge was also the primary capability measured by benchmarks for state-of-the-art performance. Between 2023-early 2024, people particularly tracked progress on MMLU. Wait, what do I mean that MMLU, which combines school to college level test questions across STEM, Law, Humanities etc. measures knowledge? Don’t these examinations measure intelligence? First, note that here I use knowledge to encompass information one needs to know to solve a task—which includes everything from world facts, to how to perform common operations like arithmetic, more advanced medical procedures, as well as what humans consider “common sense”. In this sense, I do think that most examinations for students also mainly test knowledge. It so happens that for humans, knowledge-heavy tests turn out to be a cheap proxy that can correlate with intelligence and agency. This might be because given humans’ reading, memory, time, and effort constraints, acquiring and retaining more knowledge can require intelligence and agency. However, once we remove these constraints, enabling models to read all human knowledge on the internet, many older benchmarks like MMLU only test retrieval of relevant knowledge. Most MMLU questions can be solved by anyone with access to google search, without being specialising in the domain themselves. A testament to this is how mobile phones or access to the internet is not allowed in most examinations that MMLU compiles questions from, because the ability to retrieve relevant knowledge can help a human with little domain expertise “cheat” and achieve high performance.

Note that GPT-4 was still a huge achievement. Removing the constraints humans face, and scaling up training data and compute is only possible due to decades of computing and AI research. GPT-4 demonstrated how general AI can be—encompassing all digital textual knowledge. The same is true for most language models released until 2024 (before o1), with pretraining teaching world knowledge, and instruction tuning (such as RLHF) teaching models about human preferences for a chatbot. GPT-4 like models do show “sparks” of an “intelligent agent”. They can answer questions that are novel in at least their phrasing, and sometimes about obscure facts or highly technical topics with limited training data. Retrieving relevant knowledge and composing it into a coherent output requires some, even if basic, intelligence, planning and execution. Still, much scope for progress remains.

Are you smarter than an LLM is a cool web-game that prompts you to try out MMLU Questions. Try a few, and you might realise how wrong we are in measuring smartness based on the breadth of knowledge we can recall aka Sheldon-like

The Frontier: General Agents

In this post, I will first talk about the capabilities needed to achieve general agents. I think we are already on track to build these capabilities, and general agents will soon achieve the “financial definition of AGI”, i.e. 100B$ in profits. Yet, we can achieve general agents without solving the problem of intelligence. For achieving generally-intelligent agents, I think we will need some more capabilities, which I will discuss in a followup post.

General agents will need to reason, not just in “verifiable” domains like code and math, but for true generality, also on tasks where environments and rewards are more uncertain or preference based, which I call “bayesian reasoning”. They will have to proactively seek information to reduce this uncertainty. General agents will act accurately over very long horizons spanning months or years. They will need to use tools to increase reliability and reduce costs. For example, to maintain memory, they can use tools like hierarchical databases, organizing memory into nested and interlinked pages. To perform long complex tasks beyond the context lengths they can process efficiently, can decompose them into sub-goals solved by sub-agents, also orchestrated using “tool”-calls. Overall, the AI systems that implement these capabilities will leverage ideas from the rich history of computation, to achieve a wide variety of complex long-horizon goals. I now discuss each of these capabilities below.

Reasoning

Motivation. The big recent unlock has been "reasoning" capabilities. Like I mentioned, early GPT-4 models, or open-weight models like Llama-3, only served as immensely knowledgeable chatbots. However, they failed to solve math, coding, or puzzle challenges that reasonably smart high-schoolers could. Why, one might ask? Aren't there many such problems and their solutions available on the internet? There are many possible explanations, so I'll only state my favorite.

Consider teaching a kindergarten-going child what the capital of France is, versus how to solve math word problems. In the former, merely listening to the phrase "The capital of France is Paris" and imitating it a few times should be enough, the exact way same way language models learn when trained to predict the next token on a (huge) static corpus. However, on math word problems, and more genenerally reasoning tasks, there is a combinatorial explosion in terms of both the ways the same task can be represented, and the possible solutions.

Defining Reasoning. Reasoning involves a sequence of decisions about what information should be considered, which logical abstractions should be applied next, or what actions to take. By logical abstraction, I mean anything from simple operators like addition and negation, to complicated procedures like solving a linear equation, that might be baked in as circuits in the model's weights. Here, mere memorisation and imitation is simply not enough, as there are a large number of slight variations on any problem that change the final answer, while many different solution paths and phrasings are equivalent, leading to the same answer. Just from this perspective, reasoning would require combinatorially (in the input length and solution size) more samples to learn by pure imitation.

Why RL. This is perhaps why early attempts at teaching models reasoning by hiring STEM PhD students to annotate solutions did not give large improvements. Each solution is long, and takes time to write down or verify for humans, limiting the breadth of high-quality data available for "Supervised Finetuning" (SFT) to the model. This changed with Reinforcement Learning (RL) on automatically verified outcomes. The burden of generating possible solution trajectories is shifted from humans to the model, and each trajectory can reliably be verified automatically. The model is taught to assign more weight to successful trajectories, and less to unsuccessful ones. This allows massively scaling up the training data and compute, leading to significant gains on problem solving benchmarks.

Beyond Algorithmic Reasoning. But is math, code, and puzzles all we want from reasoning? I'd argue no. These are not the types of decisions most people make every day. Instead, we think about, "Where should I go out for dinner with friends", "How to plan my tasks for the week", "How should I convince my business partner on a new deal", and so on. Each of these also requires a sequence of considerations, where we consider relevant information and alternatives, weigh "pros and cons", and then make decisions accordingly. Unfortunately, there is no simple way to automatically "reward" the final choices the model makes in these situations, they might even vary based on different people's preferences. This is why people sometimes call them “unverifiable domains”, but the truth is they are not “unverifiable”, it is possible to have a clear sense of reward here, as some decisions are better than others. I suspect what people mean by “unverifiable” is having the ability to produce a deterministic ground-truth reward with algorithmic execution.

Towards Bayesian Reasoning. For AI to be good personal assistants, let alone replace humans on the job, we need models to reason well and behave smartly in situations require what I call "bayesian reasoning", to contrast with the algorithmic reasoning tasks we currently measure reasoning progress on. What is bayesian reasoning? Here, executing the same sequence of steps doesn't always lead to the same outcomes. The environment and reward both have uncertainty, arising from potentially unknown, and changing distributions. The only way to better infer the underlying distributions is via exploration--interacting with the environment, or gaining "experience".

Unfortunately, I don't know of any popular, standard benchmark for bayesian reasoning like we had MATH-500 or Codeforces for algorithmic reasoning. This is a capability we didn't explicitly evaluate for humans before. I wonder if it's because designing evaluations for bayesian reasoning is fundamentally hard, or we just need some more creativity. That said, I do think I have a concrete task I look to for measuring bayesian reasoning capabilities of language models--judgemental forecasting. Forecasting involves predicting future events along with the confidence in the forecast, such as "Who will win XYZ election in 20XX?", "ABC, with probability 60%". Over time, as the event finally happens in the future, one can measure how calibrated the forecasts were, or whether the forecasts had a systematic advantage over the crowd (colloquially called "alpha"). This task requires reasoning about different perspectives and conflicting evidence, and extrapolating from it appropriately. For example, to predict the US presidential elections, by reading about it, one can realise that counting proceeds by allocating entire states as a win to one candidate. Thus, it might make sense to start with a 50-50 prior between two candidates, and then predict outcomes state-by-state. As we analyse each state, we can continually update our beliefs, i.e. the winning probability we assign to each candidate. More generally, we apply such reasoning implicitly in our everyday lives, when weighing pieces of information to make a decision.

Training. How then can we train models for bayesian reasoning? Collecting human bayesian reasoning data for imitation learning is hard. Most humans are not very explicit about it, and instead make decisions based on intuitive bayesian reasoning inside their head. Even when incentivised to, its unclear if humans will be good at writing down their decision making process. I expect RL on instances where we can collect final outcome annotations to be more useful. We could collect questions where bayesian reasoning is required, including both tasks like forecasting where we eventually get to know the ground-truth, and also tasks like choosing a restaurant where the end-goal is satisfying subjective human preferences that we collect at scale.

But is just bayesian reasoning with what one already knows enough?

Information Seeking

Motivation. Humans (ideally) don't just stick to what they already know. We (ideally) recognise when we need more information about a question or topic, and then set out to find it. Take the forecasting question of predicting election outcomes above. Once I decide to make predictions state-by-state, I will then have to look at both past election data from the state, and gauge the current sentiments. For this, I would go on the internet and read through relevant documents, slowly updating my beliefs for who would win the state. More generally, any form of research requires forming a research question, and then seeking relevant information or evidence. That's how science progresses. Even in our daily lives, we find things out by asking questions to people, or going somewhere and finding out ourselves.

Until recently, even frontier models hardly proactively seeked information about what they do not know. They simply provided their best guess response with the information in their weights or provided in-context. I think the only concrete prototype we have seen of information-seeking is the DeepResearch models, which given a query, search the web for relevant documents and compile a report. It remains unclear how effective the searches they perform are. Are they really maximizing the information gained through each query, or just generating many seemingly relevant search queries and reading the top results? Besides, search queries are only one form of information-seeking. A model asking questions when uncertain can be super useful for clarifying user intent, and ensuring the model does not go off-the-rails performing actions the user does not want, perhaps because the query was a bit underspecified. Note how we are already entering agent territory here, transitioning from a mere model that responds to user queries to a proactive, initiative-taking system.

The need for benchmarks. Once again, we do not have great benchmarks that isolate information-seeking capabilities, and anecdotally, models seem quite bad at this. Why? Notice how we ask questions based on our internal beliefs of what we do not know or understand. On the internet, models see many examples of questions and answers, but its much less common that the human asking the question wrote down their entire belief state--what they know, what they do not understand, and why they are asking that specific question. Information seeking almost seems to stem from a conscious understanding of one's knowledge, and what they need to expand it. Phrasing questions to optimally elicit important information is almost an art. Some would consider picking research questions, and then designing the most optimal experiments to get quick evidence as one of the most important skills one learns in a PhD. This is often what people call research taste. I then find it ironic that there's no benchmarks reported for information-seeking when companies claim their newest model is "PhD-level". Accumulating knowledge, or even learning how to "solve" a problem is more the goal of schooling or undergraduate education, while a doctorate in science is about asking better questions. I digress.

Training. So how can we train models that ask optimal questions? The most elegant recipe I can think of is providing the model partial information across tasks and their instances. Then, the model has to gather the remaining information to get to a correct solution. It should do this over multiple turns of asking questions and getting responses. How the environment is setup to provide the responses can be a design choice. One could try everything ranging from another LM judge with privileged access to the full question information, or web queries and even research experiments. The information seeking model can be rewarded based on how efficient it was at getting to the ground-truth with its questions.

Tool-Use

Motivation. But how does a model make web queries or run experiments? I've been skimming over a crucial capability, again one we have started to see more emphasis on in the most recent o3/o4-mini models from OpenAI--tool-use. As I discuss later, the “tools” used could range from hierarchical databases that organise traversable and searchable knowledge, to arbitrary code and software. In the limit, even collaboration with other specialised models can be considered a tool in multi-agent settings.

Why is tool-use cool? It offloads parts of the execution to a pre-existing "tool", ranging from functions, coding libraries, to search engines, web browsers and entire computers. Humans do this all the time to make our lives easier, both in the physical and digital realm. One fundamental problem is that while neural network weights offer a lot of flexibility in learning to approximate any function, they are not the best medium to implement a lot of functionality, like factual knowledge and algorithms. In the history of computing, we have created much better mediums for these, like databases and programs. Using these technologies as tools instead of statistically approximating this functionality through parameters can massively increase reliability. An agent that knows how to browse the internet does not need to store all the knowledge on the internet in its weights, thus requiring much fewer trained parameters (aka small size). This is exactly what humans do, we use tools like search to overcome our memory constraints, and computers to overcome our computational constraints. In the limit, introducing tools right into pre-training might be the recipe for efficient learning with smaller architectures, freeing up parameters for higher-level decision making on how to orchestrate these tools.

Training. But how do we teach models to use tools in the first place? One could of course record humans doing it, and then make the model learn via imitation learning. While this can be useful for a warm-start, such as teaching the model how to invoke each tool, I don't think imitation-learning is very useful for tool-use in the limit. For models, it might be optimal to use tools in different ways than how humans do. Humans have different constraints from models, in terms of time, working memory, knowledge capacity etc. Models fail in weird ways humans don't, and tools can then act as reliable replacements for parts of the action chain of an agent. It would be particularly interesting if the model learns to create new tools for itself. This might sound far-fetched, but is not very different from how models already know how to write functions and re-use them in other parts of the code. Maybe the optimal tools a model uses would look very different from the ones humans need.

Thus, to me, the best way to train tool-use in the limit seems once again reinforcement learning, where the model’s own exploration trajectories are rewarded based on task success. It can initially be taught to basic tools like search and code-execution with some imitation learning, but ultimately should be allowed to create its own tools using code. The hope is that after training with a sufficiently large number of tools and trajectories, models can pick up new tools in-context without any further finetuning. This can be particularly useful if the model is allowed to create and use its own tools, or even for the model to use the new libraries and services supporting cool functionality that keep appearing on the internet everyday.

Long-Horizon

With tool-use and information-seeking, we will have prototypes of effective agents. However, once trained, current language models only operate accurately over short time-horizons. They are limited by a maximum input (context) length they can process at a time, and they also start accumulating failures once they act over many turns.

Long Input: Towards years worth of experience via Memory

The first issue is models have a limited context size beyond which they do not have persistent memory. This becomes a problem especially as models use up more of their context when producing long reasoning chains, or sending long inputs and receiving long outputs from each tool call. The attention operation, fundamental to the working of the current transformer architecture, requires quadratic computation in the input length. While there is interesting ongoing work on making more efficient architectures for longer contexts, it remains to be seen whether these architectures will give the same performance at scale as quadratic attention.

Motivating Memory. A promising direction is training models to maintain memory. Humans can remember important experiences from years ago. How do we achieve this given a fixed memory capacity? The key word is important. We perform some form of hierarchical abstractive summarization. At the top level, one can only store a list of key events or topics. Once reminded of them, we can think a bit more, to retrieve important details about any given event or topic. If pressed further with targeted questions (perhaps self-generated), we can go deeper, and retrieve more minor details, though this starts to get unreliable. To remedy this, we maintain notes for future reference. Similarly, there’s already work on giving models access to scratchpads, and summary notes.

Towards better memory—Hierarchical, Interlinked Database as a Tool. Anyone that has used notebooks will realise they start to get clunky to search and organise over time. The cool thing is, in designing modern computer systems, we figured out a scalable and reliable storage system that can also extract the most minute details as needed—recursively organising information into nested hyperlinked files. I think it’s a matter of time that we give models read-write-search access to a relational database as a tool, where they can organise information as they desire. The fundamental unit could be free-form pages, that allow both nesting and inter-linking, just like we do on Wikipedia, or the Web. The model should also be allowed to make search queries to locate information they stored. The training tasks can remain the same, with the database just being a tool they learn to use in order to achieve high success rates across long-horizon tasks.

The need for training tasks and benchmarks. A bigger question is, what long-horizon tasks do we evaluate and train models on? As the input length increases, there is limited training data available that teaches models to reason over such longer contexts, while the number of possible inputs and questions shoots up combinatorially. There are only so many large books and codebases available, and only a small fraction of the queries that truly require long-context reasoning over them are already on the internet. Even if we made long context lengths efficiently computable in theory, less training data can still lead to lower accuracy. The simplest form of training here would be tasking the model with organising large knowledge stores it has not seen in pre-training, such as new news or wikipedia articles. It is then rewarded on how well it can answer queries about these new knowledge stores using its interlinked database. I think we will start seeing a lot of progress in this space.

Long Output: Towards years worth of actions

Motivation. Most value-producing tasks humans perform require thinking and acting for extended periods of time, such as completing a project. One could plausibly compile these into environment suites, and new startups are already looking into this. However, error at each step along the way can compound. Say a model is 99% accurate. If its used to power an agent that takes just 100 actions, something humans can do within an hour, it already has a 70% chance of making at least one mistake. Yann Le Cunn has used this argue that autoregressive LLMs are doomed. To some extent, this is definitely a problem, and a big reason to squeeze out the last nines of reliability. Long horizon execution is also a big reason to continue scaling even if we need exponentially more compute for additional performance gains. On long horizon tasks, small reductions in error compound to give large improvements, increasing the effective horizon the model can act over while staying above a desired accuracy threshold.

Compounding error is not the only problem though. The model's chance of making an error at each step does not necessarily remain constant (say at 99%) across the action horizon. Models are likely to err more on later steps as the horizon gets longer because they're trained on much lesser such data, similar to when they have long inputs. While there is some hope for length-generalization, where a model acts at the same accuracy as the number of steps increases as in a single step, it does not always work perfectly. Further, a great amount of optimization during model training is towards predicting the most likely next token. One would expect more mistakes in the most likely completion to input contexts that had some mistakes. Once a model sees mistakes in its context, it might condition on them, thinking it has to make mistakes, and thus make more mistakes.

What can we do?: One easy way to get around long-horizon errors is if models could reflect on their previous actions, realise they made a mistake and rollback. This is a behaviour that models can both be explicitly trained for, and can also emerge from reinforcement learning on long-horizon task success. This is easy in some domains, for example when browsing the internet, you can mostly go back to the previous page. However, in other domains, like when a model makes a financial transaction, things can be harder to reverse. Second, one could optimize for more goal-directed objectives in post-training that lower the effect of the most-likely-next-token-predictor behaviour trained into the model in pretraining. Ultimately, the bottleneck would be scaling the number of tasks and instances where we can train models for long-horizon task success.

One key issue in optimising for long-horizon task success is that even if you get a large number of steps right, but make mistakes in some parts, it could lead to task failure. Rewards based purely on task outcomes can thus be a noisy signal for training intermediate actions and hard to optimise. Think of it as working on a project for months and only getting a single score for it at the end, without any intermediate or qualitative feedback. It's very hard to learn what you should have done differently from that. One of the most important research challenges ahead of us is designing intermediate rewards for tasks that can be scaled both in horizon and the number of instances. A simple way to achieve this could be intrinsic rewards, where the model creates explicit subgoals, and rewards itself based on whether it could execute on them and its actions got it closer to the end-goal.

Multi-agent

Why just limit to one LLM when you can have multiple? After all, that’s why societies have specialisation, with many “agents” (humans) working together to make progress. Well, I don’t trust analogies between humans and AI, because like I mentioned earlier, they work under very different constraints and have different capabilities. For a long time, I struggled to find a first-principles explanation for why we would need multi-agent systems with specialization. A big lesson we learnt when shifting from the BERT to the GPT era was that instead of task-specific training, we can train a single model on data across tasks. It would then benefit from skills learnt on one task generalizing to help performance on others. After all, at a fundamental level, many of the “heuristics” we learn at any job (for example, the importance of decomposing problems) generalize across jobs and domains. So why then do we need multi-agent LLM systems?

Motivation: I think context management and parallel computation are the two main motivations for multi-agent systems. In the previous section, I discussed how long-horizon input, output, and execution can be challenging for current LLMs. If an agent requires 10^6 steps or pieces of information in total to solve a task, achieving this with a single agent in the same context would require quadratic, i.e. (10^6)^2 = 10^12 operations. If instead we can split the task into 10^4 parts each requiring 10^2 * c steps, where c is a multiplier representing the overhead from coordination and communication between the agents, then that reduces the operations needed to 10^4 * (10^2 * c)^2 = 10^8 * c^2 steps. If the coordination overhead is small, that can be a large reduction in the number of operations needed. Besides, if the parts can be executed in parallel by different agents, then the time taken can be much much lower. Further, it is possible that the sub-tasks are easy enough to be solved by weaker, cheaper models/agents further saving costs.

Training: Initial attempts at multi-agent systems decompose tasks in ways which humans find natural, explicitly prompting different model instances to solve these subproblems. I think in the long run, following the bitter lesson, we should let orchestrator models figure out how to decompose the task and prompt models on their own by scaling optimisation over sufficiently complex environments. Which model to call, what information to provide in its context and how, and how it should provide the output will all be decided by the orchestrator model through what will look like just another tool call to another agent. The reward can be multifaceted, interpolating between task success, execution time, and cost, based on our preferences. The key research question here is how can we scale up the number of instances and types of these complex tasks where multi-agent systems are truly needed. For example, Anthropic trained multi-agent systems for deep-research tasks requiring browsing hundreds of websites to answer complex queries. The next step in this direction could be multi-agent systems for tasks requiring general computer use. In general, I think the more open-ended and longer a task gets, the more we will see the need and emergence of multi-agent tool calls when training agents.

Part I Conclusion: Looking to the Future

That concludes part one of this two-part series on the capabilities we need to get to generally intelligent agents. I think once we solve how to train the capabilities discussed in this part into AI systems, we should get general agents that can execute a wide variety of projects, not just shorter tasks. Each capability unlock along the way will bring new applications of AI systems, creating immense value. I think general agents will be very powerful and transformative, without even being particularly intelligent. Recall that intelligence was defined as sample-efficiency at learning new skills. For achieving general intelligence, our AI systems need to be able to learn continually, and self-improve. This requires creativity, the ability to reliably verify novel ideas, and the ability to iterate on new creations where by definition fewer samples will be available.

I would really appreciate any feedback on this post, as this will help me improve the follow up post, but more importantly both my writing and understanding. Most of this post is speculative, and speculation is hard. I would love to hear disagreements, which could be with how I characterised the capabilities, or capabilities I missed entirely that you think are necessary for general agents, or with the general framework. Feel free to comment here, discuss on your favourite platform, write me an email, or drop anonymous notes.

Shashwat’s Substack

Discussion about this post