Economic Theory and Game Theory — Alignment Project by AISI

We are excited about using framings and tools from theoretical economics to better describe incentives and equilibrium outcomes relevant to AI alignment. By considering AI agents in a strategic setting, we hope to ensure alignment approaches are robust to capabilities improvements and that these framings can inspire new ideas for alignment techniques. We expect useful insights to come from proposing new or different abstractions: how different forms of misaligned incentives impact outcomes or how robust proposals are to different equilibrium concepts or modelling assumptions.

Hammond et al., 2025 provide a taxonomy of risks from advanced AI in multi-agent settings, including a broader list of problems and further context on some of the specific problems we highlight below.

Information Design

Problem summary: Current frontier models already exhibit sycophantic and deceptive behaviour (OpenAI, 2025; Park et al., 2023). This is arguably evidence of strategic behaviour. As AI models become more strategic, how should we expect models to choose what information to reveal? We anticipate useful insights from areas within information design (Bergemann and Morris, 2019) such as cheap talk (Crawford and Sobel, 1982), verifiable information disclosure (Milgrom, 1981), persuasion (Kamenica and Gentzkow, 2011) or delegation (Holmstrom, 1984). What are optimal revelation strategies from the perspective of the AI, how varied and stable are equilibria, and can we design mechanisms to improve information disclosure?

Why this matters: Information revelation matters for several aspects of alignment. Some alignment proposals focus on building protocols to incentivise honesty (e.g. Irving et al., 2018). If we trust an AI to be honest or incapable of deceiving us, we may be able to reliably use it to supervise or design other models (e.g. Buhl et al., 2025). In addition, information design may be able to help determine when we can use misaligned AI even if it is trying to manipulate us (see Bhatt et al., 2025 for an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned).

Examples of subproblems:

How does optimal information revelation (given AI(s) who might have a different objective) interact with alignment by debate protocols (Irving et al., 2018)? Under what assumptions do these protocols fail?
Unlike in most economic literature, the sender’s (AI’s) utility function can be modified in response to the AI’s previous actions (e.g. through reinforcement learning). If an AI is aware that it is being trained (‘situational awareness’), it may want to avoid being modified. How could situational awareness alter the dynamics of the debate game? For example, Chen et al., 2024 use imperfect recall to model imperfect situational awareness.
If we know an AI model is misaligned, how can we use it safely? This could be explored by extending manipulation framings in the persuasion literature (Kamenica and Gentzkow, 2011) or mechanism design where the contracting space represents possible training environments.
Kamenica and Gentzkow, 2011 model information transmission when the sender is trying to manipulate the receiver. Assuming we (the receiver) know that a misaligned AI is trying to deceive us, can we design mechanisms to improve the amount of information it is optimal to send (for example, by limiting the form the message has to take)? How sensitive is this model to its assumptions (multiple dimensions, bounded rationality, errors)?
When reporting the probability of some event changes the probability itself, what do we want to incentivise? Previous work includes Armstrong and O'Rorke, 2017, Perdomo et al., 2020 and Hudson, 2024.

Suggestions to learn more: For alignment by debate, we’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous research on debate includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024. Hammond et al., 2025 discuss a broad range of risks from multi-agent scenarios; see especially Section 3.1 on Information Asymmetries.

(Robust) Mechanism Design

Problem summary: As AI models increase in capabilities, they may have capabilities we don’t anticipate. If we model alignment as a game between a human principal and an AI agent, how robust are proposed safety mechanisms to unknown actions in the AI’s action space? Can we use insights from robust mechanism design (Bergemann and Morris, 2005) to relax modelling assumptions in AI alignment propositions?

Why this matters: We want alignment methods that are robust to increases in future capabilities, even if we can’t anticipate exactly what these are.

Examples of subproblems:

Carroll, 2015 shows the optimality of linear contracts when the principal doesn’t know the action space of the agent. This result relies on the assumption that there is no action (like corporate sabotage) which could severely harm the principal. We don’t want to exclude such actions when considering actions an AI agent might take. Can we find the optimal mechanism for the principal when considering the trade-off between lowering the probability of harm and lowering the usefulness of a model?
How should we think of the contractible/non-contractible action space when we (the principal) are interacting with or delegating decisions to a (possibly misaligned) AI (the agent)?
There are many subgoals or tendencies which seem instrumentally useful, in particular power-seeking, but are dangerous for an AI that has different goals than the user wants or believes it to have (see Carlsmith, 2022). How does training the agent to be myopic (having a very low or zero discount factor) change the tradeoff between safety (non-power seeking) and usability?

Suggestions to learn more: Hadfield-Menell and Hadfield, 2018 set out some framings of AI alignment as an incomplete contracting problem. We’d recommend reading Bhatt et al., 2025, as an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned. Previous research on reducing harm from misaligned systems includes Mallen et al., 2025.

Bounded Rationality

Problem summary: We want to predict how powerful AI systems will behave in a range of scenarios. Our best guesses are some combination of extrapolating from existing behaviour (from current models which have more limited capabilities), calculating optimal strategies in Nash equilibria (computationally unbounded agents) and calculating optimal strategies for computationally bounded agents (complexity theory and algorithmic game theory). Are there other approaches, taking inspiration from behavioural economics or elsewhere, which suggest meaningfully different behaviour of powerful AIs?

Why this matters: We currently have a lot of uncertainty about how neural nets and deep learning work, and hence how it will continue to scale. Most interpretability work is empirical and hence relies on past behaviour changes to extrapolate to future models. In order to understand whether proposed control and alignment techniques will work, we need a better understanding of how bounded rationality will impact the actions an AI might take.

Examples of subproblems:

Much work on alignment (e.g. Brown-Cohen et al., 2023) uses complexity theory to model bounded rationality in terms of computational constraints. Do other formalisations of bounded rationality in the debate game suggest different equilibria (relative to computationally bounded or unbounded agents)?
It may be hard to judge the safety of AI models which are significantly more capable or ‘intelligent’ than humans. Scalable oversight (also discussed in a variety of other problem areas) proposes using a series of AI models to align and monitor progressively more and more capable models. How should we model ‘more capable’ or ‘more intelligent’ in this framework? Are there conceptualisations which – if true – would make it harder or impossible for this framework to succeed?

Suggestions to learn more: We’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous research on debate (a scalable oversight method) includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving 2024. It may be particularly useful to consider cases where both the principal and agent are bounded, but the principal is significantly more so than the agent; for example, Goldreich, 2017 sets out a proof system where the prover is less constrained than the verifier.

Open-Source Game Theory, Collusion and Commitment

Problem summary: AIs may have more ability to collude or commit to actions than human actors. An AI may be able to share some of its code or chain of thought (Wei et al., 2022, Korbak et al., 2025), increasing its ability to send credible signals of commitment. In addition, the way AIs collude (for example, the actions they coordinate on in the absence of communication) may be different from the ways we expect humans to collude, making it more difficult to detect.

Why this matters: Some alignment proposals rely on independent AI models in adversarial settings, for example, having AIs monitor one another (Goel et al., 2025).

Examples of subproblems:

The concrete open problems in open-source game theory in Critch et al., 2022. Open-source game theory considers agents whose source code is (partially) observable to other players. This might inform how interpretable or observable agents need to be to one another (or to a human) to encourage cooperation or discourage collusion.
How can we design mechanisms to prevent collusion between AI agents and a monitoring system? (e.g. in the case of a red-team blue-team game evaluating control measures for a misaligned AI, see Bhatt et al., 2025)
What are the correct solution concepts to use when modelling interactions between AI? For example, subgame perfect may be too narrow when AIs can credibly signal commitment. (In open-source game theory, see Tennenholtz, 2004 for an example.)
Where independent AI models are used in alignment proposals (Hammond et al., 2025, p.43), what precise definition or formalisation of independence is needed for these protocols to function as intended?

Suggestions to learn more: We’d recommend reading Critch et al., 2022, Oesterheld et al., 2023. Previous research in open-source game theory includes Sauerberg and Oesterheld (2025). Hammond et al., 2025 discusses collusion among advanced AI in sections 2.3 and 4.1 (includes discussion of steganography or cryptic messages). Bhatt et al., 2025, provides an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned. Previous research on reducing harm from misaligned systems includes Mallen et al., 2025.

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming