Probabilistic Methods — Alignment Project by AISI

Many alignment challenges require reasoning about uncertainty: What do AI systems actually believe? How can we detect risks too rare to observe directly? Can we make informal reasoning more verifiable? Probabilistic methods provide mathematical frameworks to address these questions. The problems in this section span three approaches: estimating the likelihood of rare but dangerous behaviours through low-probability estimation, building AI systems that explicitly track and communicate their uncertainty using Bayesian frameworks like GFlowNets, and developing logical systems that combine formal verification with probabilistic reasoning.

Low Probability Estimation

Problem summary: Wu and Hilton, 2025 introduce the problem of low probability estimation for alignment: “Given a machine learning model and a formally specified input distribution, how can we estimate the probability of a binary property of the model’s output, even when that probability is too small to estimate by random sampling?”

Why this matters: We would like to have methods to detect and estimate the probability of tail risks for worst-case behaviour from AI systems. AI systems intending to subvert human supervision may attempt to act harmfully only in the very rare cases where human supervision will predictably fail to catch misbehaviour. Such cases of misbehaviour are low probability events which cannot be measured directly through standard small sample-size benchmarking.

Examples of subproblems:

Are there new methods for tackling the benchmarks in Wu and Hilton (2025)?
We would like to determine whether AI systems sufficient information is contained within AI systems to construct contexts in which they might behave maliciously: the problem of 'eliciting bad contexts' (Irving et al., 2025). Can we extend the results in Wu and Hilton, 2025 to solve the problem of eliciting bad contexts?

Suggestions to learn more: We’d recommend reading Wu and Hilton, 2025. Earlier research in this area includes Webb et al., 2018 who study the problem of low probability estimation in the computer vision setting.

Bayesian Methods for Scientist AI

Problem summary: Bengio et al., 2025 describes a potential path to safe superintelligence via explicit tracking of uncertainty and a strategically limited scope for deployment of advanced AI. A Bayesian scientist AI would model uncertainty over natural language latent variables making it possible to query the AI about its beliefs rather than just about what a human would say. A central piece of this agenda is to replace mode-seeking reinforcement learning with reward-proportional sampling as is done in GFlowNets; this prevents mode collapse (Deleu et al., 2023).

Why this matters: For under-specified, or mis-specified tasks, an advanced agentic AI will be incentivised to exploit error in human supervision—leading to misaligned AI. The scientist AI agenda initiated by Bengio approaches this problem by proposing a new latent variable modelling framework with a focus on trustworthiness of the natural language statements naming those latents.

Examples of subproblems:

Established subproblems:

Priors for Bayesian scientist AI: What language is used for this prior, and how intra- and inter-hypothesis consistency is a central question. For instance, Mahfoud et al., 2025 demonstrate how GFlowNet can be used to update a prior over decision trees. See also Irving et al., 2022 for another perspective on this problem.
Evaluation and limitations of GFlowNets: The optimum of GFlowNet objectives is reward-proportional sampling, but we don’t know how to efficiently evaluate goodness-of-fit given the optimum cannot be easily sampled from (Silva et al., 2025).
Honesty, faithfulness and deception benchmarks: The main property to evaluate in a Bayesian scientist AI implementation, besides capability in estimating conditional probabilities over natural language statements, is how it compares with existing AI systems in terms of various metrics of trustworthiness of the answers.

Suggestions to learn more: We’d recommend reading Hu et al., 2023 for application, and Bengio et al., 2025 for an overview of the direction.

Inclusive Logical Frameworks

Problem summary: State of the art models reason in natural language, but it might be possible to develop models that use formal language to track and update its uncertainty over hypotheses. For example, starting from natural language, we may enhance chain-of-thought reasoning with numbered claims or typed arguments, by formalising key terms within these claims by means of definitions or symbolic representation, or by linking argument chains into Directed Acyclic Graphs (DAGs).

Why this matters: Given its inherent ambiguity, statements in natural language are prone to inconsistencies, and can be deliberately used to obscure information or mislead audiences. In contrast, arguments in formal languages lend themselves more readily to oversight and scrutiny—be that through humans, other overseers, or formal verifiers. By embedding formal or semi-formal structures into AI reasoning or chain-of-thought, and even before full logical interpretation is possible, claims can become more transparent and easier to evaluate. This would in turn bolster other efforts, such as scalable oversight, adversarial testing through debate, or systems like Scientist AI. At the same time, improving our ability to soundly move between natural and formal languages could expand the applicability of approaches that currently rely on full formalization, such as Safeguarded AI—a paradigm that integrates AI, mathematical modelling, and formal methods into a scalable workflow for producing AI systems with quantitative safety guarantees in their context of use.

Examples of subproblems:

What tools or interactive processes can support the incremental formalization of natural language claims and the implicit deductions between them? E.g. a synthesis of causal models or probabilistic programs (e.g. probabilistic dependency graphs), regarded as world models. Or a method by which an AI can iteratively revise its argument in chain-of-thought, prune away or update certain parts of the argument and replace them with new ones, rather than regenerating a full new chain-of-thought at every iteration (Kim et al., 2025, Leng et al., 2025).
What logical or semi-logical frameworks support dynamic, incremental specification of definitions and types, and what methods would facilitate their interactive disambiguation?
What logical formalisms can decompose chain-of-thought reasoning into atomic propositions while maintaining sufficient expressiveness?
Can we develop hierarchical world model representations where probabilistic estimates at different abstraction levels maintain coherent relationships for language model querying?
How can adjoint logic provide foundations for multi-modal frameworks that formalize informal reasoning patterns (Pruiksma et al., 2018)?
What categorical frameworks beyond Heyting algebra valuations can support sound reasoning under logical uncertainty and logical non-omniscience?
Can neural amortized inference move beyond Boolean logic by leveraging bi-Heyting toposes or many-valued model theory (Dweck, 2015) for richer conditional query representations?

Suggestions to learn more: We’d recommend reading Dalrymple, 2024. Other related work includes Kim et al., 2025, Ethan, 2024, Wang et al., 2025, Liell-Cock and Staton, 2024 and Pruiksma et al., 2018.

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming