Find incentives and mechanisms to direct strategic AI agents to desirable equilibria.
We are excited about using framings and tools from theoretical economics to better describe incentives and equilibrium outcomes relevant to AI alignment. By considering AI agents in a strategic setting, we hope to ensure alignment approaches are robust to capabilities improvements and that these framings can inspire new ideas for alignment techniques. We expect useful insights to come from proposing new or different abstractions: how different forms of misaligned incentives impact outcomes or how robust proposals are to different equilibrium concepts or modelling assumptions.
Hammond et al., 2025 provide a taxonomy of risks from advanced AI in multi-agent settings, including a broader list of problems and further context on some of the specific problems we highlight below.
Problem summary: Current frontier models already exhibit sycophantic and deceptive behaviour (OpenAI, 2025; Park et al., 2023). This is arguably evidence of strategic behaviour. As AI models become more strategic, how should we expect models to choose what information to reveal? We anticipate useful insights from areas within information design (Bergemann and Morris, 2019) such as cheap talk (Crawford and Sobel, 1982), verifiable information disclosure (Milgrom, 1981), persuasion (Kamenica and Gentzkow, 2011) or delegation (Holmstrom, 1984). What are optimal revelation strategies from the perspective of the AI, how varied and stable are equilibria, and can we design mechanisms to improve information disclosure?
Why this matters: Information revelation matters for several aspects of alignment. Some alignment proposals focus on building protocols to incentivise honesty (e.g. Irving et al., 2018). If we trust an AI to be honest or incapable of deceiving us, we may be able to reliably use it to supervise or design other models (e.g. Buhl et al., 2025). In addition, information design may be able to help determine when we can use misaligned AI even if it is trying to manipulate us (see Bhatt et al., 2025 for an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned).
Examples of subproblems:
Suggestions to learn more: For alignment by debate, we’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous research on debate includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024. Hammond et al., 2025 discuss a broad range of risks from multi-agent scenarios; see especially Section 3.1 on Information Asymmetries.
Problem summary: As AI models increase in capabilities, they may have capabilities we don’t anticipate. If we model alignment as a game between a human principal and an AI agent, how robust are proposed safety mechanisms to unknown actions in the AI’s action space? Can we use insights from robust mechanism design (Bergemann and Morris, 2005) to relax modelling assumptions in AI alignment propositions?
Why this matters: We want alignment methods that are robust to increases in future capabilities, even if we can’t anticipate exactly what these are.
Examples of subproblems:
Suggestions to learn more: Hadfield-Menell and Hadfield, 2018 set out some framings of AI alignment as an incomplete contracting problem. We’d recommend reading Bhatt et al., 2025, as an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned. Previous research on reducing harm from misaligned systems includes Mallen et al., 2025.
Problem summary: We want to predict how powerful AI systems will behave in a range of scenarios. Our best guesses are some combination of extrapolating from existing behaviour (from current models which have more limited capabilities), calculating optimal strategies in Nash equilibria (computationally unbounded agents) and calculating optimal strategies for computationally bounded agents (complexity theory and algorithmic game theory). Are there other approaches, taking inspiration from behavioural economics or elsewhere, which suggest meaningfully different behaviour of powerful AIs?
Why this matters: We currently have a lot of uncertainty about how neural nets and deep learning work, and hence how it will continue to scale. Most interpretability work is empirical and hence relies on past behaviour changes to extrapolate to future models. In order to understand whether proposed control and alignment techniques will work, we need a better understanding of how bounded rationality will impact the actions an AI might take.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous research on debate (a scalable oversight method) includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving 2024. It may be particularly useful to consider cases where both the principal and agent are bounded, but the principal is significantly more so than the agent; for example, Goldreich, 2017 sets out a proof system where the prover is less constrained than the verifier.
Problem summary: AIs may have more ability to collude or commit to actions than human actors. An AI may be able to share some of its code or chain of thought (Wei et al., 2022, Korbak et al., 2025), increasing its ability to send credible signals of commitment. In addition, the way AIs collude (for example, the actions they coordinate on in the absence of communication) may be different from the ways we expect humans to collude, making it more difficult to detect.
Why this matters: Some alignment proposals rely on independent AI models in adversarial settings, for example, having AIs monitor one another (Goel et al., 2025).
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Critch et al., 2022, Oesterheld et al., 2023. Previous research in open-source game theory includes Sauerberg and Oesterheld (2025). Hammond et al., 2025 discusses collusion among advanced AI in sections 2.3 and 4.1 (includes discussion of steganography or cryptic messages). Bhatt et al., 2025, provides an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned. Previous research on reducing harm from misaligned systems includes Mallen et al., 2025.