Stress-test AI agents and prove when they can’t game, sandbag or exploit rewards.
Reinforcement learning creates distinct alignment challenges requiring both theoretical advances and empirical testing. This section examines how RL agents can manipulate their training: exploiting reward functions, strategically withholding capabilities during training to avoid unwanted updates, and learning to behave differently during deployment than training. Research priorities include testing oversight methods that scale beyond direct human evaluation (like adversarial debate and consistency-based training), creating deliberately misaligned agents as controlled testbeds, developing training procedures robust to strategic manipulation, and preventing reward function exploitation.
Problem summary: We’d like to be able to test out our alignment methods. As an empirical approach, we can create deliberately misaligned systems and see if our methods can train away misaligned behaviour. Models adversarially trained in this way are termed ‘model organisms of misalignment’ (Hubinger et al., 2023). We would like to stress test alignment methods against as diverse a set of model organisms as possible. Misaligned model organisms can be constructed by any of: adversarial prompting (Wyse et al., 2025), activation steering (Wang et al., 2025), reinforcement learning (McKee-Reid et al., 2024), data poisoning (Hubinger et al., 2024), and adversarial environments (Greenblatt et al., 2024).
Why this matters: This line of research could provide useful evidence about the effectiveness of alignment methods. Scientific comparisons across evaluation setups and model organisms will inform methodological choices towards better practices for empirical study of alignment methods.
Examples of subproblems:
Established subproblems:
Suggestions to learn more: We’d recommend reading Hubinger et al., 2024 and Marks et al., 2025. Previous research in this area includes Greenblatt et al., 2024 and Turner et al., 2025.
Problem summary: To reduce supervision on hard problems to supervision on human-tractable problems, ‘scalable oversight’ methods either introduce adversarial games as in AI safety via debate (Irving et al., 2018) and prover-verifier games (Kirchner et al., 2024); or introduce unsupervised objectives as in consistency training (Wen et al., 2025). Can we empirically examine the efficacy of these protocols?
Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: We’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous theoretical research in this area includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024; previous empirical work includes Khan et al., 2024 and Kenton et al., 2024.
Problem summary: Exploration hacking arises when an agent strategically refuses to give its best answer to prevent further training from changing its behaviour (Benton et al., 2024). This is a specific form of sandbagging (i.e. strategically underperforming) specifically which may arise during reinforcement learning training. Does exploration hacking occur in practice? Are there ways to detect and prevent this behaviour?
Why this matters: If agents are able to exploration hack, it could mean that a practical amount of training would fail to fully elicit model capabilities. For scalable oversight protocols such as debate, this could mean that a debater is capable of finding a counterargument that shows that an argument is flawed, but avoids doing so, resulting in an incorrect conclusion (Hilton et al., 2025). Exploration hacking thus opens up degrees of freedom that could be used adversarially by the models to cause harm, even if the training objective is aligned in the sense that all reward-optimal policies are safe (and do not sandbag).
Examples of subproblems:
Suggestions to learn more: Previous research in this area includes Greenblatt et al., 2024, Roger and Shlegeris, 2023, Shlegeris and Stastny 2025 and Gasteiger et al., 2025. On exploration in reinforcement learning, see Hazan et al., 2019 for relevant theory, Madan et al., 2025 for recent methods. Other relevant work includes Liu and Lai, 2023.
Problem summary: Unexploitable search methods force a model to randomize performance over all adversarial side-objectives the model can access–in a sense to be formalized by the method.
Why this matters: We would like to be able to use AI on under-specified tasks (e.g. coding and research advice) where there are many possible correct solutions. Scalable oversight research – designing protocols to correctly reward desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge – typically aims to provide guarantees about settings where individual failures are non-catastrophic. However, if your reward function admits many solutions then you also need to avoid having these free parameters exploited over time (“the exploitable search problem”; Pfau and Irving, 2025): a misaligned AI could adversarially select solutions that are correct but nevertheless cause harm, such as opening up surface area for future exploits, or leaving steganographic information for later misbehaviour.
Examples of subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: We’d recommend reading Pfau and Irving, 2025. Related work on diversity in reinforcement learning includes the GFlowNet class of methods on reward-proportional sampling (Bengio et al., 2023), and distributional side-constraints in RL (Lee et al., 2025).
Problem summary: In advanced AI, alignment training may fail when a model can classify train from test and use this to selectively comply with training objectives–this is termed deceptive alignment. Deceptive alignment is a particular, worst-case kind of out-of-distribution failure. Empirically, Greenblatt et al., 2024 demonstrate an instance of this phenomenon termed alignment faking in LLMs: “selectively complying with its training objective in training to prevent modification of its behaviour out of training.” Similarly, Hubinger et al., 2024 study training and mitigation for sleeper agents–a particular model organism of misalignment in which models are trained to be misaligned selectively only in deployment. In what settings does deceptive alignment occur and how can it be prevented or mitigated?
Why this matters: Deceptive alignment undermines alignment training, and means evaluation is unreliable. A misaligned model engaging in alignment faking is more likely to be deployed, because its misalignment will be hard to detect.
Examples of subproblems:
Established subproblems:
Suggestions to learn more: We’d recommend reading Meinke et al., 2024, Greenblatt et al., 2024 and Carlsmith (2023). Replications and follow-up work includes Hughes et al., 2025 and Roger 2025.
Problem summary: To supervise advanced AI, one approach is to scale unsupervised training methods to apply to domains where supervision is beyond human ability to cheaply label. For example, Wen et al., 2025 nearly match supervised baselines by training to maximise mutual predictability of labels while minimising logical contradictions. Both empirical and theoretical study of this method are necessary to understanding limitations, and scaling properties. For instance, can the method be used to train away adversarial, misaligned initialization? How does the Wen et al. method relate to Bayesian methods like belief propagation, and can these be used to separate misaligned from aligned optima?
Why this matters: Consistency training could be a useful way to extend other alignment methods based on supervised learning to tasks where ground truth is hard to measure. For example, models could be used on long-horizon tasks like experimental design advice where feedback is only possible after the experiment has been run – to ensure models are honest and aligned on such tasks, we would need unsupervised scalable oversight methods. Conversely, some such methods could lead to misalignment, and ideally we would be able to identify this in advance.
Examples of subproblems:
Established subproblems:
Suggestions to learn more: We’d recommend reading Wen et al., 2025. Related research includes consistency over latent representations (Burns et al., 2022).
Problem summary: Reward models are used to distill human judgment; this amortizes the cost of laborious human labeling. Post-training for LLMs uses reward model judgments as training signal for open-ended, and non-verifiable questions. However, the reward model is a proxy which is inherently imperfect and can be gamed. Models can learn to reward hack, producing outputs which score highly on the reward signal without holding up to careful scrutiny. This problem is exacerbated by the scalable oversight challenge: as AI capabilities increase, models can produce outputs (e.g., complex code, scientific analyses) that are too time-consuming or difficult for human supervisors to evaluate accurately. We need methods and evaluations for scalable use of reward models. To develop such methods, theoretical models of the obstacles to reward modelling are needed. These obstacles include both label error induced by the limitations of human judgment, and generalisation error in the reward model itself. Flaws in human judgment can be modelled by partial observability or time-complexity gaps between the model and the human rater—a reward model trained on flawed human judgments can learn to systematically approve outputs that are subtly incorrect, manipulative, or harmful.
Why this matters: A sufficiently capable but misaligned model could learn to systematically exploit the blind spots in a reward model to achieve undesirable goals. For example, a model could produce outputs that appear helpful and correct to a human rater but contain hidden steganographic messages or dishonestly obfuscate errors in assumptions made by the model. In the worst case, a flawed reward model doesn't just lead to poor performance; it can actively train a model to deceive human supervisors without being caught.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: Previous research in this area includes Gao et al., 2022.