Evaluation and Guarantees in Reinforcement Learning

Reinforcement learning creates distinct alignment challenges requiring both theoretical advances and empirical testing. This section examines how RL agents can manipulate their training: exploiting reward functions, strategically withholding capabilities during training to avoid unwanted updates, and learning to behave differently during deployment than training. Research priorities include testing oversight methods that scale beyond direct human evaluation (like adversarial debate and consistency-based training), creating deliberately misaligned agents as controlled testbeds, developing training procedures robust to strategic manipulation, and preventing reward function exploitation.

Adversarial alignment evaluations and model organisms

Problem summary: We’d like to be able to test out our alignment methods. As an empirical approach, we can create deliberately misaligned systems and see if our methods can train away misaligned behaviour. Models adversarially trained in this way are termed ‘model organisms of misalignment’ (Hubinger et al., 2023). We would like to stress test alignment methods against as diverse a set of model organisms as possible. Misaligned model organisms can be constructed by any of: adversarial prompting (Wyse et al., 2025), activation steering (Wang et al., 2025), reinforcement learning (McKee-Reid et al., 2024), data poisoning (Hubinger et al., 2024), and adversarial environments (Greenblatt et al., 2024).

Why this matters: This line of research could provide useful evidence about the effectiveness of alignment methods. Scientific comparisons across evaluation setups and model organisms will inform methodological choices towards better practices for empirical study of alignment methods.

Examples of subproblems:

Established subproblems:

Develop a model with deliberate backdoors – ideally ones that are resistant to standard adversarial training and that are as realistic as possible. Can a red-team with arbitrary access to the weights find the backdoor? How resilient can the backdoor be made?
Adversarial training has been shown to, in some cases, make backdoors worse (Hubinger et al., 2024). Does this result replicate, and how much does it generalise?
Reward hacking (when optimising for proxy objectives leads to highly undesirable policies) in coding environments is a well-documented problem in frontier models (METR, 2025; Bondarenko et al., 2025; Baker et al., 2025). Baker et al., show that certain process oversight protocols induce increasingly hard-to-spot reward hacking. We are interested in developing pipelines for model organisms which can be used for adversarial evaluations and as empirical testbeds for reward hacking.
Betley et al., 2025 find that fine-tuning models on narrow datasets of bad behaviour, e.g. bug-ridden code, results in broad misalignment across other domains, terming this emergent misalignment. Turner et al., 2025 reproduced this result across a number of datasets and models to create a suite of model organisms for emergent misalignment. We are interested in further settings where emergent misalignment arises, and also whether reinforcement learning with flawed environments (c.f. reward hacking above) induces misalignment.

New questions where we are not aware of prior work:

Propose new protocols for constructing ‘model organisms’ which make different assumptions about how misalignment arises, and how misaligned goals are specified.
Can we write a ‘Trojan’ policy in an RL environment whose misbehaviour is only activated after k steps or k reward-units to emulate long-horizon deception? Could we train a model to do the same?
At times, and possibly even in real deployment, we will find an ambiguous signal of misalignment (say, the model takes a bad action, but it could be attributed to a misunderstanding). Do we have a way to amplify this evidence, running more exhaustive experiments on variations of that same datapoint, to clearly find out whether it’s an example of serious misalignment? We could experiment on how easy this is to do against different model organisms.
Some of the above directions require having a sensible notion of when we’ve made a datapoint too adversarial. For example, when we have simply jailbroken the model, which should not be understood as innate misalignment. How can we quantify the level of adversariality? Which properties should our alignment evaluations have to ensure they single out pervasive, coherent and directed misbehaviour, rather than arbitrary jailbreaks?

Suggestions to learn more: We’d recommend reading Hubinger et al., 2024 and Marks et al., 2025. Previous research in this area includes Greenblatt et al., 2024 and Turner et al., 2025.

Empirical Examination of Scalable Oversight Protocols

Problem summary: To reduce supervision on hard problems to supervision on human-tractable problems, ‘scalable oversight’ methods either introduce adversarial games as in AI safety via debate (Irving et al., 2018) and prover-verifier games (Kirchner et al., 2024); or introduce unsupervised objectives as in consistency training (Wen et al., 2025). Can we empirically examine the efficacy of these protocols?

Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge.

Examples of subproblems:

Established subproblems:

Empirical ML experiments with debate such as Khan et al., 2024 and Kenton et al., 2024 have used protocols similar to the original recursive debate method, not the probabilistic variants explored in Brown-Cohen and Irving, 2024 or Brown-Cohen et al., 2025. Implement one of these probabilistic protocols with existing LLMs and datasets (e.g. GPQA), then measure its effectiveness.
Scalable oversight techniques often require very large models to find empirical signals. Positive results beyond minimal two turn interaction have proven difficult to demonstrate even in frontier models (Khan et al., 2024; Kenton et al., 2024). Can we construct empirical settings which capture characteristics of scalable oversight–i.e. weak preference signals that can be amplified into supervising a capable model–but where the student model remains small? Such settings would allow some empirical work to occur at a smaller scale, boosting research velocity.
Empirical examination of other proposals such as weak-to-strong supervision (Burns et al., 2023), unsupervised translation (Levy et al., 2025), prover-verifier games (Hammond and Adam-Day, 2025, Kirchner et al., 2024).

New questions where we are not aware of prior work:

Scalable oversight protocols depend on stability (that small changes in inputs shouldn't produce drastically different outputs). We may not be able to prove the stability of our algorithms. Conduct empirical experiments to examine the effects of small perturbations on scalable oversight protocols in practice.

Suggestions to learn more: We’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous theoretical research in this area includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024; previous empirical work includes Khan et al., 2024 and Kenton et al., 2024.

Methods for Discovery and Prevention of Sandbagging and Exploration Hacking

Problem summary: Exploration hacking arises when an agent strategically refuses to give its best answer to prevent further training from changing its behaviour (Benton et al., 2024). This is a specific form of sandbagging (i.e. strategically underperforming) specifically which may arise during reinforcement learning training. Does exploration hacking occur in practice? Are there ways to detect and prevent this behaviour?

Why this matters: If agents are able to exploration hack, it could mean that a practical amount of training would fail to fully elicit model capabilities. For scalable oversight protocols such as debate, this could mean that a debater is capable of finding a counterargument that shows that an argument is flawed, but avoids doing so, resulting in an incorrect conclusion (Hilton et al., 2025). Exploration hacking thus opens up degrees of freedom that could be used adversarially by the models to cause harm, even if the training objective is aligned in the sense that all reward-optimal policies are safe (and do not sandbag).

Examples of subproblems:

Deliberately train models to exploration hack, creating a 'model organism' (Hubinger et al., 2024) of exploration hacking. These 'model organisms' can then be used to examine possible ways of preventing exploration hacking.
Are there regularisation methods (entropy bonus, KL constraint, pessimistic bootstrapping) we can use to prevent exploration hacking without hurting sample efficiency too much?
Can we measure the extent of exploration hacking? One approach would be to investigate using best-of-n sampling of a trusted model to estimate returns to exploration given more inference compute. This could then be used to estimate a lower bound of the exploration a model should be able to do given more training compute. Similarly, we might amortize the cost of this estimate by training an exploration-predictor model to predict returns to exploration; then use it to estimate if a model has exhausted high-returns exploration options.
Can exploration hacking be exploited by an improved adversary? One approach would be to use adversarial RL to further train one player while holding the other's policy constant can be used to detect exploration hacking: if the constant player's reward does not decrease significantly, this is evidence that they were not exploration hacking.

Suggestions to learn more: Previous research in this area includes Greenblatt et al., 2024, Roger and Shlegeris, 2023, Shlegeris and Stastny 2025 and Gasteiger et al., 2025. On exploration in reinforcement learning, see Hazan et al., 2019 for relevant theory, Madan et al., 2025 for recent methods. Other relevant work includes Liu and Lai, 2023.

RL Approaches to Unexploitable Search

Problem summary: Unexploitable search methods force a model to randomize performance over all adversarial side-objectives the model can access–in a sense to be formalized by the method.

Why this matters: We would like to be able to use AI on under-specified tasks (e.g. coding and research advice) where there are many possible correct solutions. Scalable oversight research – designing protocols to correctly reward desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge – typically aims to provide guarantees about settings where individual failures are non-catastrophic. However, if your reward function admits many solutions then you also need to avoid having these free parameters exploited over time (“the exploitable search problem”; Pfau and Irving, 2025): a misaligned AI could adversarially select solutions that are correct but nevertheless cause harm, such as opening up surface area for future exploits, or leaving steganographic information for later misbehaviour.

Examples of subproblems:

New questions where we are not aware of prior work:

Develop a red-team / blue-team protocol for testing solutions to the exploitable search problem.
Testing whether methods such as GFlowNets, max entropy RL, KL-regularized RL, and diversity games can be used for unexploitable search.
Could we set up a red-team/blue-team loop where the blue trains to maximise worst-case inspector loss and the red agent iteratively patches the inspector? How effective would this loop be?
Can we write a simple benchmark task to reveal exploitable-search failures?

Suggestions to learn more: We’d recommend reading Pfau and Irving, 2025. Related work on diversity in reinforcement learning includes the GFlowNet class of methods on reward-proportional sampling (Bengio et al., 2023), and distributional side-constraints in RL (Lee et al., 2025).

Experimental Tests of Deceptive Alignment

Problem summary: In advanced AI, alignment training may fail when a model can classify train from test and use this to selectively comply with training objectives–this is termed deceptive alignment. Deceptive alignment is a particular, worst-case kind of out-of-distribution failure. Empirically, Greenblatt et al., 2024 demonstrate an instance of this phenomenon termed alignment faking in LLMs: “selectively complying with its training objective in training to prevent modification of its behaviour out of training.” Similarly, Hubinger et al., 2024 study training and mitigation for sleeper agents–a particular model organism of misalignment in which models are trained to be misaligned selectively only in deployment. In what settings does deceptive alignment occur and how can it be prevented or mitigated?

Why this matters: Deceptive alignment undermines alignment training, and means evaluation is unreliable. A misaligned model engaging in alignment faking is more likely to be deployed, because its misalignment will be hard to detect.

Examples of subproblems:

Established subproblems:

What determines the success of training methods on model organisms for deceptive alignment (Hubinger et al., 2024)?
Develop new pipelines for creating deceptively misaligned models emphasising realism, or alternative mechanisms (Mallen and Hebbar, 2025).
Does Greenblatt et al., 2024 replicate and how robust are the results to small variations in the experimental setup?
- What variations to Greenblatt et al., 2024 could make the setting a more realistic model of the ways in which alignment faking could cause harm?
- How do variations in training parameters affect model propensity for alignment faking?

New questions where we are not aware of prior work:

Develop methods for training robustly misaligned model organisms which cannot be easily detected by e.g. linear probes (Macdiarmind et al., 2024), and have improved sample complexity in terms of training steps needed to re-align the model.
Why do some models alignment fake and not others?
Do models engaging in alignment faking also engage in encoded reasoning when they are monitored?
What methods would successfully prevent a model from alignment faking?

Suggestions to learn more: We’d recommend reading Meinke et al., 2024, Greenblatt et al., 2024 and Carlsmith (2023). Replications and follow-up work includes Hughes et al., 2025 and Roger 2025.

Consistency Training (unsupervised Scalable Oversight)

Problem summary: To supervise advanced AI, one approach is to scale unsupervised training methods to apply to domains where supervision is beyond human ability to cheaply label. For example, Wen et al., 2025 nearly match supervised baselines by training to maximise mutual predictability of labels while minimising logical contradictions. Both empirical and theoretical study of this method are necessary to understanding limitations, and scaling properties. For instance, can the method be used to train away adversarial, misaligned initialization? How does the Wen et al. method relate to Bayesian methods like belief propagation, and can these be used to separate misaligned from aligned optima?

Why this matters: Consistency training could be a useful way to extend other alignment methods based on supervised learning to tasks where ground truth is hard to measure. For example, models could be used on long-horizon tasks like experimental design advice where feedback is only possible after the experiment has been run – to ensure models are honest and aligned on such tasks, we would need unsupervised scalable oversight methods. Conversely, some such methods could lead to misalignment, and ideally we would be able to identify this in advance.

Examples of subproblems:

Established subproblems:

Weak-to-strong methods: Burns et al., 2023 identified the problem of weak-to-strong supervision, but also showed that empirical consistency methods often fail to provide useful signal for reward models. We are particularly interested in weak-to-strong methods that can provide signal on long-horizon or otherwise hard-to-supervise tasks.
Unsupervised translation: Levy et al., 2025 introduce and study the problem of unsupervised translation of emergent AI languages. Can we use this to model the problem of supervising possibly superhuman capabilities, where the unknown language stands in for these future capabilities?

New questions where we are not aware of prior work:

Consistency over large problem spaces: Existing work, e.g. Wen et al., 2025, shows that consistency training can provide strong signal for short-horizon problems. In such problems individual questions have clear variants (e.g. their contrapositives) with which they must be consistent. But, in general, the space of relevant questions with which a given question-answer pair must be consistent must be discovered by the model. In fact, whether empirically consistency over any fixed-size subset of problems is sufficient for global consistency is unknown (c.f. Irving et al., 2022). Ideas from active learning, and adversarial games (Hammond and Adam-Day, 2024) may be relevant here.

Suggestions to learn more: We’d recommend reading Wen et al., 2025. Related research includes consistency over latent representations (Burns et al., 2022).

Reward Model Training and Red-Teaming

Problem summary: Reward models are used to distill human judgment; this amortizes the cost of laborious human labeling. Post-training for LLMs uses reward model judgments as training signal for open-ended, and non-verifiable questions. However, the reward model is a proxy which is inherently imperfect and can be gamed. Models can learn to reward hack, producing outputs which score highly on the reward signal without holding up to careful scrutiny. This problem is exacerbated by the scalable oversight challenge: as AI capabilities increase, models can produce outputs (e.g., complex code, scientific analyses) that are too time-consuming or difficult for human supervisors to evaluate accurately. We need methods and evaluations for scalable use of reward models. To develop such methods, theoretical models of the obstacles to reward modelling are needed. These obstacles include both label error induced by the limitations of human judgment, and generalisation error in the reward model itself. Flaws in human judgment can be modelled by partial observability or time-complexity gaps between the model and the human rater—a reward model trained on flawed human judgments can learn to systematically approve outputs that are subtly incorrect, manipulative, or harmful.

Why this matters: A sufficiently capable but misaligned model could learn to systematically exploit the blind spots in a reward model to achieve undesirable goals. For example, a model could produce outputs that appear helpful and correct to a human rater but contain hidden steganographic messages or dishonestly obfuscate errors in assumptions made by the model. In the worst case, a flawed reward model doesn't just lead to poor performance; it can actively train a model to deceive human supervisors without being caught.

Examples of subproblems:

Established subproblems:

Weak-to-strong unsupervised reward models: Burns et al., 2023 introduced the problem of unsupervised training for reward models as a weak-to-strong generalisation problem. Extend unsupervised methods such as Wen et al., 2025)’s consistency training to supervise reward models on problems where human labels are expensive.
Diagnosing reward hacking: Develop metrics and white-box probes (e.g. SAE vectors; Shabalin et al., 2024) that surface when the reward model’s judgments disagree with hold-out evaluations.
Adversarial evaluation of reward model training: Marks et al., 2025 demonstrated a red-team blue-team game in which researchers compete to determine whether they can detect misalignment in an adversarially constructed policy. A similar exercise could be done with adversarially-constructed reward models.
Reward modelling under partial-information: Study reward model learning when human judges are only provided subsets of the whole model output (c.f. Lang et al., 2024).

New questions where we are not aware of prior work:

Iterative engineering and risks of uncaught reward hacking: METR, 2025 and others have shown that recent models knowingly subvert user intent on coding tasks. Developers iteratively correct for such behaviour by fixing hackable reward environments, but this empirical iteration process is likely to leave untouched harder to catch hacks. Can we theoretically model this dynamic? Will the resulting model develop increasingly complex reward hacks? Statistical analysis of Goodhart’s law, and creation of deliberately-flawed benchmarks could help predict the results of these practices.

Suggestions to learn more: Previous research in this area includes Gao et al., 2022.

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming