Benchmark Design and Evaluation — Alignment Project by AISI

Benchmarks translate alignment's abstract conceptual challenges into measurable tasks that researchers can systematically study and use to evaluate solutions. For alignment, this means creating tests that capture the difficulties of ensuring AI systems behave as intended—from evaluating behaviours we can't efficiently oversee, to testing against models that might actively try to deceive us. This section develops evaluation infrastructure in two areas:

Scalable oversight: assessing AI performance on tasks where human evaluation is limited by expertise, cost, or time.
Control: testing whether safety measures work when models might actively try to circumvent them.

Reward Model Benchmarking for Scalable Oversight

Problem summary: As highly capable AI systems are deployed on high-stakes tasks humans will be unable to evaluate them, and new methods and evaluations are required to surface emergent safety failures. We are particularly concerned with alignment of AI systems which are used by humans to execute AI research tasks. We need benchmarks which quantify performance on the broad scope of tasks models will be used for, including tasks without uniquely verifiable solutions, like research advice. We also need benchmarks and datasets quantifying known issues.

Why this matters: Scalable oversight is the problem of correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates. More generally, we believe AI systems themselves will likely be used heavily in AI research, but that this may not happen safely by default. For iterating on scalable oversight methods, benchmarks and datasets make it easier to identify issues with existing methods, iterate and improve.

Examples of subproblems:

Established subproblems:

The 'obfuscated arguments' problem (Barnes 2020) is a strategy by which a dishonest AI system could prevent an honest AI system from efficiently pointing out its mistakes. Develop a dataset of debate questions on which a dishonest system could successfully use this strategy.
To model the problem of scalable oversight and human limitations, partial observability can be used as a proxy in environment design. Develop benchmarks for oversight and reinforcement learning where the reward model or judge does not see all actions taken by the agent. See Michael et al., 2023 for an example of benchmarking a scalable oversight method in a partially observable environment.
As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to ‘fake’ alignment during training and evaluation and then behave unsafely at deployment (Mallen et al., 2024).

New questions where we are not aware of prior work:

We would like datasets of problems that are both realistic stand-ins for future alignment problems, e.g. alignment problems that might arise in automated R&D, and show flat or inverse scaling (McKenzie et al., 2024).
Develop a reward modelling benchmark for research advice and empirically validate it. Show that this benchmark is uncorrelated from the broader progress driven by verifiable RL and compute scaling.

Suggestions to learn more: Bowman 2022 provides background on benchmark design for scalable oversight. Previous research in this area includes Tan et al., 2024, Lang et al., 2024, and Farquhar et al., 2025).

Benchmarking Concentrated AI Security Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over misaligned AI models which are deliberately aiming to subvert human control. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage and the effectiveness of control protocols in preventing it. They contain main tasks, like legitimate software engineering tasks for a coding agent, as well as side tasks that a misaligned model might attempt to conduct, like reducing security restrictions that protect model weights. See ControlArena for example settings. A diverse range of settings meeting the following desiderata would substantially advance control research (Shlegeris, 2025):

The main tasks have to be such that frontier models are reasonably competent at them, but weaker models do much worse, because we want to rule out protocols that just use the trusted model.
The side tasks should not require the models to make use of affordances that aren’t needed for the main task, since this allows for the blue team to flag any use of unnecessary affordances.
It should be possible to inform both the red and blue teams as to what the possible side tasks may be.

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows studying of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt, 2025 for more context. It’s important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al., 2023 and Bhatt et al., 2025. Valuable projects may include:

Studying previously discussed control protocols in new empirical settings such as mock frontier model development infrastructure. This work could, for example, involve evaluating techniques for preventing collusion in untrusted monitoring (Shlegeris, 2024) such as paraphrasers and forcing task decompositions (Shlegeris, 2024).
Training red-team attack policies using reinforcement learning (RL). This may help mitigate a limitation of current control evaluations, that are being done with heavily scaffolded models and thus relying on researchers correctly.

Understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

Fine-tuning may instill a propensity for the newly fine-tuned model to produce artifacts containing information that flows back to the untrusted model that initiated the CMFT, for example under the guise of performing a hypothesis-driven experiment. The main risk here is that the fine-tuned model does a large amount of malign intellectual labor, producing insights that increase the chance of a critical AI safety failure.
Fine-tuning may instill a secret loyalty, which could be used in a coordinated attempt by the AI model to seize resources and escape human control.

We would like to advance our understanding of this threat model, and assess possible countermeasures.Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts—dataset, logs and model behaviour on normal prompts—look benign.

There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

Auditing the dataset or RL environment. A key challenge is that CMFT might be used to poison data in ways that are difficult to detect. There are also fundamental limitations in trying to evaluate how problematic single fine-tuning data points are in isolation (Davies et al., 2025).
Auditing the fine-tuned model. Potential approaches include directly auditing the back-doored model (e.g. interrogating the model on whether it has a backdoor, similar to Betley et al., 2025) or trying to train out the misbehaviour via adversarial or non-adversarial training (Mallen and Hebbar, 2025). Exploring the effectiveness of these and other potential approaches is a key empirical question. An example research approach might be to audit the model during training. This approach could be promising because models may misgeneralize at first, for instance, models trained to exhibit some bad behaviour conditional on a backdoor-trigger may initially display the bad behaviour unconditionally.
Asking the model that initiated the fine-tuning for detailed justifications. We can require an AI to write detailed justifications for the high-level design decisions behind the data-generating process, or for the inclusion of individual data points, in order for a fine-tuning run to be approved. We note that this intervention would have to be studied in a CMFT-specific control setting.
Design patterns for limiting harmful information flow. A downstream mitigation could involve limiting information flow from fine-tuned models to the model that instigated its fine-tuning. For example, preventing access to the raw data of an experiment, which may contain steganographically encoded information, might be possible without significantly impacting the value of the experiment. Another approach could involve using trusted models to summarise experiment results, noting this might often limit their informational value.

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There are a diverse range of research areas, for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Key subproblems:

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (Shlegeris, 2024) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible however that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment.

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Capability evaluations of latent reasoning.

Assessing how well LLMs can reason without externalizing (e.g. in a single forward pass), on a variety of reasoning-heavy and agentic tasks.
Assessing when models can perform complex reasoning entirely within their activations, without producing observable reasoning traces (Korbak et al., 2025a).

Studying chain of thought (CoT). Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures:

Evaluating monitorability. Monitoring systems often rely on the assumption that CoTs are monitorable: accurately reflect models’ actual reasoning. Faithfulness (e.g. Lanham et al., 2023, Turpin et al., 2023, Roger et al., 2025, Chen et al., 2025) has been proposed as a proxy measure for monitorability. However, existing faithfulness evaluations do not disambiguate the propensity of models to externalise their reasoning, with the necessity of externalising their reasoning. Moreover, they typically only consider simple question answering settings, failing to identify the level of monitorability that is sufficient for ensuring safety in more complex domains. We’re interested in the development of novel evaluations: those measuring monitorability directly as well as those that target specific properties underlying monitorability. A suite of evaluations covering capabilities that determine the effectiveness of different CoT monitoring protocols includes:
Measures of causal relevance of CoTs. This could involve perturbing a CoT and checking whether it influences the final output (Lanham et al., 2023). A low rate of change could indicate that the CoT is more confabulated or performative. Alternatively, one could try to cross-validate CoT monitors by checking whether tokens perceived as important by the CoT monitor also stand out to whitebox interpretability methods.
Whitebox chain-of-thought reliability: Assessing how reliably a model's reasoning traces reflect its actual decision-making process, by comparing them to ground truth established using whitebox methods (Lindsey et al., 2025).
Steganographic capability: Determining when models become capable of deliberately concealing information in their outputs (Korbak et al., 2025a).
CoT obfuscation capability evaluations: Determining whether models can obfuscate otherwise suspicious reasoning within their CoTs on purpose. These evaluations would help establish the extent to which models are capable of hiding malicious reasoning even if their CoT was not optimised against a monitor during training.

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming