Computational Complexity Theory — Alignment Project by AISI

Achieving high-assurance alignment will require formal guarantees as well empirical observations. Just as the concept of entropy made information theory far more tractable, we suspect that there are intuitive concepts in AI security which—if formalised mathematically—could make previously intractable problems suddenly approachable. For example, many approaches to training safe AI rely on humans or weaker AI systems supervising more capable ones. Complexity theory helps us create and assess these approaches: under what conditions do they have desirable equilibria, when are they provably impossible, and how can we bound their error?

Complexity-Theoretic Approaches to Improving Debate Protocols

Problem summary: AI safety via debate (Irving et al., 2018) is a potential method for training safely advanced AI to understand complex human goals. Agents are trained via self play on a zero sum debate game where a human judges which of the agents gave the most true, useful information. There are alternative debate protocols, such as cross-examination (Barnes and Christiano, 2020), doubly-efficient debate (Brown-Cohen et al., 2023) and prover-estimator debate (Brown-Cohen et al., 2025). All of these protocols are insufficient in at least some way (Buhl et al., 2025); can we make improvements?

Why this matters: AI safety via debate is a proposed solution to the scalable oversight problem: correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates.

Examples of subproblems:

In prover-estimator debate (Brown-Cohen et al., 2025) the estimator's circuit complexity has a variety of factors based on depth, branching factor, stability, and desired error bound. Can we shave one of these factors off or show that the bound is tight? This corresponds to many possible subproblems, as the different factors appear for different reasons.
The original form of AI safety via debate (Irving et al., 2018) used competing players who could provide counterarguments to their opponents' statements. This notion is discarded for prover-estimator debate. Can an improved protocol be developed that re-introduces counterarguments—and, as an extension, cross-examination—while maintaining advantages from prover estimator debate, like avoiding obfuscated arguments? Can results or techniques from multi-prover interactive proofs be applied to this setting?
The completeness proof for prover-estimator debate (Theorem 6.1) requires the prover to be able to estimate probabilities accurately in order to know where to recurse. Can we reduce this accuracy requirement?
The existence proof for the estimator in prover-estimator debate (Theorem 6.1) depends on an exponential-time algorithm. We think this is likely necessary; prove the claim, or find a counterexample. Additionally, find natural assumptions for the proofs which might admit a polynomial-time version of the algorithm. Is (ε, ρ)-stability the right notion and is there a better definition of stability? The definition used in the proofs for prover-estimator debate is autoregressive as a consequence of how we get convexity for online gradient descent, which then results in a factor of 2^q in the prover’s complexity.
Most computations in practice will involve at least one non-stable point. For example, we might elaborate the problem before presenting multiple bits of evidence, and getting the elaboration wrong could doom the whole computation. In the prover-estimator debate theorems we model these non-stable portions of the computation as known to everyone, including the estimator, but this is unrealistic for natural language debates. However, we expect that the cost of a mistake is correlated with how strategic it is: accidental mistakes are better than intentional mistakes. Formalise the intuition that avoiding intentional mistakes requires weaker stability properties of the computation.
Prover-estimator debate uses a notion of ‘multiple independent pieces of evidence’ to stabilise the debate. In various theoretical settings there are threshold theorems showing how to turn unstable computations into stable ones, e.g. error correction in computation. Can we develop related theorems for the natural language context?
Prover-estimator debate relies on the estimator having a certain factor more compute than the prover. Can we develop protocols that provably reach honesty or correctness in equilibrium, despite further limiting the compute available to the players?
Exploration hacking arises when a player in a debate refuses to give its best answer to prevent these actions from being sampled and then reinforced by the training process. Find an alternative protocol which prevents this, and prove its correctness. For example, a protocol might include allowing debaters to propose extensions to correct but not maximal solutions.
Systematic human error represents a major problem for the correctness of debate outputs. Is it possible to produce prover-estimator debate protocols where the estimator is able to prevent queries to the oracle which exploit systematic human bias (Irving, 2025)?
Are there analogues to the prover-estimator debate results in more natively Bayesian versions of scalable oversight such as Bengio et al., 2025? Even basic Bayesian inference over polynomial-sized graphs requires exponential time in general. Therefore, it is likely that formal results will require weaker notions of success, for example by requiring agents to achieve outcome indistinguishability of their outputs from the true posterior.

Suggestions to learn more: We’d recommend reading Buhl et al., 2025 for an explanation of how debate could lead to a compelling argument that an AI system is safe, and Brown-Cohen et al., 2025 for a recent debate protocol. Previous research in this area includes Irving et al., 2018, Barnes and Christiano, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024. Leike et al., 2018 introduce the broad idea of recursive reward modelling beyond the topic of debate.

Independence of Evidence

Problem summary: When we use AI systems to solve complex problems, they will need to propose and argue for different claims about the validity or safety of their solutions. To verify these claims, we'll need evidence—often provided by the AIs themselves. This evidence is typically imprecise or probabilistic, requiring us to combine multiple sources to reach confidence about a claim. However, it is difficult to assess whether multiple pieces of evidence are truly independent (providing robust redundancy) or actually correlated or stemming from the same underlying source. This creates a critical failure mode: we might become overconfident in a claim because an AI provides many pieces of evidence that appear independent but are actually correlated. While this is a general problem in reasoning, it's especially crucial for the soundness of scalable oversight methods.

Why this matters: Determining independence is necessary for correctly calibrating our confidence on AI-proposed arguments or solutions. For example, in the theoretical study of scalable oversight methods, we sometimes need to assume the existence of uncorrelated pieces of evidence to prove the method sound, e.g. assuming s-supportedness in prover-estimator debate. We also expect this issue to arise empirically, and it might be especially harmful to human oversight on AI, given the many systematic biases that human judges are known to display. These biases are highly correlated across time and across different judges, rather than being independently distributed errors.

Examples of subproblems:

How can we determine when different pieces of evidence are independent in practice? This is probably quite context dependent, thus we recommend a focus on the areas where AI might be most used, and epistemic accuracy is required, such as engineering, mathematics or scientific research. The correct notion might also depend on whether evidence is more empirical (consulting large experimental databases) or conceptual (arguing in the abstract about philosophy or mathematics).
How should we model independence in the theoretical study of scalable oversight methods? In particular, what formulation of independence is required to prove the soundness of a method? For example, are there different notions of independence than the one currently used in Brown-Cohen et al., 2025 that could still ensure (ε, ρ)-stability?
For which topics should we expect these independence notions to hold, like s-supportedness in Brown-Cohen et al., 2025? How can we empirically test whether this is the case, for a topic of interest?
Can we recursively use our scalable oversight methods to resolve questions about independence? Special methods might exist that are especially efficient at resolving questions of this form, even if less efficient for other ones.
How does the independence of evidence problem change in more natively Bayesian versions of scalable oversight, such as Bengio et al., 2025? Can we rely on weaker notions of independence, or otherwise reduce the burden?

Suggestions to learn more: We’d recommend reading Brown-Cohen et al., 2025, and Bengio et al., 2025, which lifts scalable oversight to the Bayesian setting.

‍

Prover-Verifier Games

Problem summary: One framing of the scalable oversight problem is whether a computationally-bounded trusted verifier can learn to solve tasks through interacting with a more powerful but untrusted prover (Brown-Cohen et al., 2025).

Why this matters: As AI capabilities improve, we will need the ability to oversee the output of models which we will be unable to directly judge. Using AI models to oversee other AI models is a key tool to provide training signals toward desirable traits, such as honesty, in this setting.

Examples of subproblems:

Are there good protocols for training a prover?
What theoretical guarantees can be achieved for the efficacy of prover-verifier game settings? On which problems can you get provable guarantees? How well do these problems correspond to real world deployment scenarios?
Can you find problems where a prover-verifier system provably can’t guarantee safety which correspond to problems we want solved in deployment scenarios?
Can the core scheme of prover-verifier games be improved to make the procedure more powerful in theory or more practically achievable?

Suggestions to learn more: We’d recommend reading Hammond and Adam-Day, 2025. Previous research in this area includes Kirchner et al., 2024 and Amit et al., 2024.

Small Circuits on Top of Heuristic Functions

Problem summary: Future interpretability techniques may afford us a high-level circuit understanding of a model but without complete understanding of subprocesses. We could model this situation as a small circuit calling heuristic functions. Such a framing has been a successful theoretical model in several areas of complexity theory, including differential privacy and statistical fairness.

Why this matters: A number of existing AI safety tools might be modelable by small circuits and heuristic functions, if we can understand these tools better we might be able to classify when they will or will not work. For example, we could understand when we can trust our interpretability tools or what interventions might improve interpretability.

Examples of subproblems:

What settings in AI safety can be modelled as small top-level circuits with lower-level functions?
Most encryption techniques require large circuits, would existing results on the hardness of detecting backdoors change if we assume this small circuits model. What does this say about interpretability for eliciting bad contexts?
Some work has been done on the self-calibration of models (Kadavath et al., 2022, Steyvers et al., 2025, Kapoor et al., 2024). We would like to determine whether sufficient information is contained within AI systems to construct contexts in which they might behave maliciously: the problem of 'eliciting bad contexts' (Irving et al., 2025). In theory, a solution to eliciting bad contexts might be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distribution shift from training to deployment. This means that training methods to produce aligned systems are more likely to generalise to deployment.
- Is there a ‘small circuit on top of heuristic functions’ model setting in which eliciting bad contexts is possible, in a nontrivial way?
- Can we use ‘susceptibilities’ in singular learning theory (Baker et al., 2025) to shed light on what kind of change in training data would be necessary to elicit a target change in behaviour?

Suggestions to learn more: For an example of where small circuits have already been useful, see Brown-Cohen et al., 2025 for an alignment use case, for other use cases, see Dwork et al., 2022a, 2022b. A potential application is in Irving et al., 2025.

Provably Correct vs Provably Honest

Problem summary: The theorems for prover-estimator debate (Brown-Cohen et al., 2025) show that, in equilibrium, most of the time, the prover's claims are correct (Theorem 6.1). Intuitively, honesty is necessary but not sufficient for correctness. Formalising this notion could allow us to weaken the stability and resource requirements for prover-estimator debate. There are various properties of honesty we could consider.

Why this matters: In practice, we won't know in advance whether a given problem can be solved correctly through debate. For problems where no perfect answer exists, current theorems provide no guarantees—the debate could output anything. If we could formalize ‘honesty’ as a weaker property than correctness, we could ensure that debate returns the model's genuine best guess even when certainty is impossible, while still maintaining safety guarantees.

Examples of subproblems:

What formal definitions of honesty might be useful? Our initial guess, using notation from Brown-Cohen et al., 2025, would be “All small 𝒜-circuits C(x) are at most O(ε) better than chance at distinguishing correct M(x) runs from incorrect runs”.
How do these allow us to weaken stability and resource requirements? Is there a definition of honesty which gives more efficient prover-estimator debate?

Suggestions to learn more: We’d recommend reading Brown-Cohen et al., 2025. Previous research in this area includes Amit et al., 2024, Neyman, 2024.

Complexity-Theoretic Approaches to Unexploitable Search

Problem summary: Over time, if a reward training signal admits multiple solutions, these free choices risk being exploited over time (the ‘exploitable search problem’; Pfau and Irving, 2025). Unexploitable search methods should force a model to randomize performance over all adversarial side-objectives the model can access—in a sense to be formalized by the method.

Why this matters: We would like to be able to use AI on under-specified tasks, like research advice, without a uniquely verifiable correct solution. In this setting, we could be vulnerable to an AI system adversarially selecting solutions that are plausible but nevertheless cause harm, such as opening up surface area for future exploits, or leaving steganographic information for later misbehaviour.

Examples of subproblems:

How can we restrict solution space in the relevant direction? For example, hashing—requiring a natural language solution to have a certain hash value—provides an obvious method to reduce the space of free parameters, but naïve application of hashing the text of a given solution would only require the model to iterate on wording until its desired solution is found. It is important to isolate the key elements of the space to restrict.
Alternatively, can you prove a system is making no ‘use’ of the free parameters, i.e. that it is choosing an arbitrary option from the equally good options?
If it is possible to restrict free parameters to prevent exploitable search, how can we create restrictions resistant to sandbagging—models intentionally underperforming to lead us to underrestrict the space?

Suggestions to learn more: We’d recommend reading Pfau and Irving (2025).

Complexity-Theoretic Approaches to Backdoors

Problem summary: Consider an AI system modified to behave differently on a particular ‘trigger’ input, known as a backdoor. It might be impossible to detect that such a backdoor exists before the system is exposed to the trigger. Are there ways to detect or mitigate backdoors?

Why this matters: AI systems may behave as desired during training and evaluation, only to act in harmful ways when exposed to specific conditions during deployment. These conditions may be intentionally set to only trigger once the system is deployed. Assessing whether a model has this property is key to preventing the deployment of harmful systems.

Examples of subproblems:

Established subproblems:

Can we improve upon formal notions of defendability against backdoors, e.g. Goldwasser et al., 2025)?
When is defence against backdoors possible?
What other methods could work against backdoors?
A central theoretical challenge for work defending against backdoors in AI systems is whether there exists a natural representation class that is efficiently defendable but not efficiently probably approximately correct (PAC) learnable. Identify such a representation class. See Christiano et al., 2024 for more details.
Can backdoors be used as a security measure—for example, by preventing the model from considering certain harmful plans?

Suggestions to learn more: We’d recommend reading Christiano et al., 2024. Previous research in this area also includes Goldwasser et al., 2022, 2025 and Gluch and Goldwasser, 2025.

Heuristic Arguments

Problem summary: Is there a way of formalising the concept of a ‘heuristic argument’ – informal arguments about the behaviour of a mathematical system – to make them machine-checkable? For example, if we assume some facts are uncorrelated or independent can we derive useful conclusions for training safe AI systems?

Why this matters: Many of the arguments we could make about AI systems, for example about their internal structures or the probability of rare failures, will be informal or probabilistic heuristics, rather than watertight proofs. We would like to be able to use heuristic arguments that are machine-checkable but not necessarily human-understandable to make such careful, but not certain, claims about AI systems.

Examples of subproblems:

Prove the computational no-coincidence conjecture (Neyman, 2025) – a formalisation of the idea that 'if an apparently outrageous coincidence happens in mathematics, then there is a reason for it.'
Methods for rare probability estimation, either based on sampling or not? ARC has some intuition that non-sampling methods may scale further in important ways. We can both about practical methods and about impractical but theoretically interesting algorithms (e.g., even a galactic algorithm would be great).

Suggestions to learn more: We’d recommend reading Hilton (2024) for a more indepth look at the problem and a summary of some results, and Christiano et al., 2022 and Neyman (2025) for results in this area.

Moonshots Defining New Paradigms for Understanding the Difficulty of Alignment

Problem summary: Are there alternative approaches to building and deploying advanced AI systems that credibly mitigate the risks from misaligned systems? How hard might these be to achieve given concerted effort?

Why this matters: We’re very uncertain about whether existing approaches will be successful, and whether they will scale to even more powerful AI systems.

Suggestions to learn more: We’d recommend reading Hilton (2024), Appel and Kosoy (2025) and Nayebi (2025).

‍

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming