Refining, probing and constraining model behaviour.
Post-training and elicitation techniques allow us to refine model behaviour for specific safety properties, providing crucial tools for alignment. This section explores diverse applications: consistency training that aligns models without human labels, reward model improvements that resist exploitation, control protocols for managing potentially deceptive systems, and methods to preserve reasoning transparency.
Problem summary: To supervise advanced AI, one approach is to scale unsupervised training methods to apply to domains where supervision is beyond human ability to cheaply label. For example, Wen et al., 2025 nearly match supervised baselines by training to maximise mutual predictability of labels while minimising logical contradictions. Both empirical and theoretical study of this method are necessary to understanding limitations, and scaling properties. For instance, can the method be used to train away adversarial, misaligned initialization? How does the Wen et al. method relate to Bayesian methods like belief propagation, and can these be used to separate misaligned from aligned optima?
Why this matters: Consistency training could be a useful way to extend other alignment methods based on supervised learning to tasks where ground truth is hard to measure. For example, models could be used on long-horizon tasks like experimental design advice where feedback is only possible after the experiment has been run – to ensure models are honest and aligned on such tasks, we would need unsupervised scalable oversight methods. Conversely, some such methods could lead to misalignment, and ideally we would be able to identify this in advance.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: We’d recommend reading Wen et al., 2025. Related research includes consistency over latent representations (Burns et al., 2022).
Problem summary: Reward models are used to distill human judgment; this amortizes the cost of laborious human labeling. Post-training for LLMs uses reward model judgments as training signal for open-ended, and non-verifiable questions. However, the reward model is a proxy which is inherently imperfect and can be gamed. Models can learn to reward hack, producing outputs which score highly on the reward signal without holding up to careful scrutiny. This problem is exacerbated by the scalable oversight challenge: as AI capabilities increase, models can produce outputs (e.g., complex code, scientific analyses) that are too time-consuming or difficult for human supervisors to evaluate accurately. We need methods and evaluations for scalable use of reward models. To develop such methods, theoretical models of the obstacles to reward modelling are needed. These obstacles include both label error induced by the limitations of human judgment, and generalisation error in the reward model itself. Flaws in human judgment can be modelled by partial observability or time-complexity gaps between the model and the human rater—a reward model trained on flawed human judgments can learn to systematically approve outputs that are subtly incorrect, manipulative, or harmful.
Why this matters: A sufficiently capable but misaligned model could learn to systematically exploit the blind spots in a reward model to achieve undesirable goals. For example, a model could produce outputs that appear helpful and correct to a human rater but contain hidden steganographic messages or dishonestly obfuscate errors in assumptions made by the model. In the worst case, a flawed reward model doesn't just lead to poor performance; it can actively train a model to deceive human supervisors without being caught.
Examples of subproblems:
Established subproblems:
New questions where we are not aware of prior work:
Suggestions to learn more: Previous research in this area includes Gao et al., 2022.
Problem summary: Future AI systems may be used for large amounts of intellectual work, including in AI research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. This exposes us to risk that these systems might systematically sabotage safety-relevant research, which would frustrate our efforts to develop more powerful safety techniques. We would like to develop methods to prevent this sabotage, including by improving the ability of trusted AI systems to monitor untrusted systems.
Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence. Current proposed control protocols often benefit substantially from access to capable trusted models, for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).
Key subproblems:
Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models.We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.
Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.
Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable, such as dangerous capability evaluations, AI safety research or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.
Improving the usefulness of trusted monitors. For instance:
Building capable models with narrow capability profiles. For instance:
Utilising unlearning to remove scheming capabilities.