Bayesian and rare-event techniques for tail-risk estimation, scientist-AI, and formal reasoning under uncertainty.
Many alignment challenges require reasoning about uncertainty: What do AI systems actually believe? How can we detect risks too rare to observe directly? Can we make informal reasoning more verifiable? Probabilistic methods provide mathematical frameworks to address these questions. The problems in this section span three approaches: estimating the likelihood of rare but dangerous behaviours through low-probability estimation, building AI systems that explicitly track and communicate their uncertainty using Bayesian frameworks like GFlowNets, and developing logical systems that combine formal verification with probabilistic reasoning.
Problem summary: Wu and Hilton, 2025 introduce the problem of low probability estimation for alignment: “Given a machine learning model and a formally specified input distribution, how can we estimate the probability of a binary property of the model’s output, even when that probability is too small to estimate by random sampling?”
Why this matters: We would like to have methods to detect and estimate the probability of tail risks for worst-case behaviour from AI systems. AI systems intending to subvert human supervision may attempt to act harmfully only in the very rare cases where human supervision will predictably fail to catch misbehaviour. Such cases of misbehaviour are low probability events which cannot be measured directly through standard small sample-size benchmarking.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Wu and Hilton, 2025. Earlier research in this area includes Webb et al., 2018 who study the problem of low probability estimation in the computer vision setting.
Problem summary: Bengio et al., 2025 describes a potential path to safe superintelligence via explicit tracking of uncertainty and a strategically limited scope for deployment of advanced AI. A Bayesian scientist AI would model uncertainty over natural language latent variables making it possible to query the AI about its beliefs rather than just about what a human would say. A central piece of this agenda is to replace mode-seeking reinforcement learning with reward-proportional sampling as is done in GFlowNets; this prevents mode collapse (Deleu et al., 2023).
Why this matters: For under-specified, or mis-specified tasks, an advanced agentic AI will be incentivised to exploit error in human supervision—leading to misaligned AI. The scientist AI agenda initiated by Bengio approaches this problem by proposing a new latent variable modelling framework with a focus on trustworthiness of the natural language statements naming those latents.
Examples of subproblems:
Established subproblems:
Suggestions to learn more: We’d recommend reading Hu et al., 2023 for application, and Bengio et al., 2025 for an overview of the direction.
Problem summary: State of the art models reason in natural language, but it might be possible to develop models that use formal language to track and update its uncertainty over hypotheses. For example, starting from natural language, we may enhance chain-of-thought reasoning with numbered claims or typed arguments, by formalising key terms within these claims by means of definitions or symbolic representation, or by linking argument chains into Directed Acyclic Graphs (DAGs).
Why this matters: Given its inherent ambiguity, statements in natural language are prone to inconsistencies, and can be deliberately used to obscure information or mislead audiences. In contrast, arguments in formal languages lend themselves more readily to oversight and scrutiny—be that through humans, other overseers, or formal verifiers. By embedding formal or semi-formal structures into AI reasoning or chain-of-thought, and even before full logical interpretation is possible, claims can become more transparent and easier to evaluate. This would in turn bolster other efforts, such as scalable oversight, adversarial testing through debate, or systems like Scientist AI. At the same time, improving our ability to soundly move between natural and formal languages could expand the applicability of approaches that currently rely on full formalization, such as Safeguarded AI—a paradigm that integrates AI, mathematical modelling, and formal methods into a scalable workflow for producing AI systems with quantitative safety guarantees in their context of use.
Examples of subproblems:
Suggestions to learn more: We’d recommend reading Dalrymple, 2024. Other related work includes Kim et al., 2025, Ethan, 2024, Wang et al., 2025, Liell-Cock and Staton, 2024 and Pruiksma et al., 2018.