Translate alignment's conceptual challenges into concrete, measurable tasks.
Benchmarks translate alignment's abstract conceptual challenges into measurable tasks that researchers can systematically study and use to evaluate solutions. For alignment, this means creating tests that capture the difficulties of ensuring AI systems behave as intended—from evaluating behaviours we can't efficiently oversee, to testing against models that might actively try to deceive us. This section develops evaluation infrastructure in two areas:
Problem summary: As highly capable AI systems are deployed on high-stakes tasks humans will be unable to evaluate them, and new methods and evaluations are required to surface emergent safety failures. We are particularly concerned with alignment of AI systems which are used by humans to execute AI research tasks. We need benchmarks which quantify performance on the broad scope of tasks models will be used for, including tasks without uniquely verifiable solutions, like research advice. We also need benchmarks and datasets quantifying known issues.
Why this matters: Scalable oversight is the problem of correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates. More generally, we believe AI systems themselves will likely be used heavily in AI research, but that this may not happen safely by default. For iterating on scalable oversight methods, benchmarks and datasets make it easier to identify issues with existing methods, iterate and improve.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: Bowman 2022 provides background on benchmark design for scalable oversight. Previous research in this area includes Tan et al., 2024, Lang et al., 2024, and Farquhar et al., 2025).
Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over misaligned AI models which are deliberately aiming to subvert human control. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).
Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.
Key subproblems:
Building high-stakes control settings. Settings are empirical research environments, for example containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage and the effectiveness of control protocols in preventing it. They contain main tasks, like legitimate software engineering tasks for a coding agent, as well as side tasks that a misaligned model might attempt to conduct, like reducing security restrictions that protect model weights. See ControlArena for example settings. A diverse range of settings meeting the following desiderata would substantially advance control research (Shlegeris, 2025):
A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows studying of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt, 2025 for more context. It’s important to run control evaluations in these realistic settings to validate their usefulness for control research.
High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al., 2023 and Bhatt et al., 2025. Valuable projects may include:
Understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:
We would like to advance our understanding of this threat model, and assess possible countermeasures.Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts—dataset, logs and model behaviour on normal prompts—look benign.
There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:
Problem summary: There are a diverse range of research areas, for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.
Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.
Key subproblems:
Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (Shlegeris, 2024) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).
Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible however that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.
Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment.
Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).
Capability evaluations of latent reasoning.
Studying chain of thought (CoT). Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures: