Visit AISI.gov.uk
Please enable javascript for this website.
Research Agenda
Methods for Post-training and Elicitation

Methods for Post-training and Elicitation

Refining, probing and constraining model behaviour.

Post-training and elicitation techniques allow us to refine model behaviour for specific safety properties, providing crucial tools for alignment. This section explores diverse applications: consistency training that aligns models without human labels, reward model improvements that resist exploitation, control protocols for managing potentially deceptive systems, and methods to preserve reasoning transparency. 

Consistency Training (unsupervised Scalable Oversight) 

Problem summary: To supervise advanced AI, one approach is to scale unsupervised training methods to apply to domains where supervision is beyond human ability to cheaply label. For example, Wen et al., 2025 nearly match supervised baselines by training to maximise mutual predictability of labels while minimising logical contradictions. Both empirical and theoretical study of this method are necessary to understanding limitations, and scaling properties. For instance, can the method be used to train away adversarial, misaligned initialization? How does the Wen et al. method relate to Bayesian methods like belief propagation, and can these be used to separate misaligned from aligned optima?  

Why this matters: Consistency training could be a useful way to extend other alignment methods based on supervised learning to tasks where ground truth is hard to measure. For example, models could be used on long-horizon tasks like experimental design advice where feedback is only possible after the experiment has been run – to ensure models are honest and aligned on such tasks, we would need unsupervised scalable oversight methods. Conversely, some such methods could lead to misalignment, and ideally we would be able to identify this in advance.

Examples of subproblems:

Established subproblems:

  1. Weak-to-strong methods: Burns et al., 2023 identified the problem of weak-to-strong supervision, but also showed that empirical consistency methods often fail to provide useful signal for reward models. We are particularly interested in weak-to-strong methods that can provide signal on long-horizon or otherwise hard-to-supervise tasks.
  2. Unsupervised translation: Levy et al., 2025 introduce and study the problem of unsupervised translation of emergent AI languages. Can we use this to model the problem of supervising possibly superhuman capabilities, where the unknown language stands in for these future capabilities?

New questions where we are not aware of prior work:

  1. Consistency over large problem spaces: Existing work, e.g. Wen et al., 2025, shows that consistency training can provide strong signal for short-horizon problems. In such problems individual questions have clear variants (e.g. their contrapositives) with which they must be consistent. But, in general, the space of relevant questions with which a given question-answer pair must be consistent must be discovered by the model. In fact, whether empirically consistency over any fixed-size subset of problems is sufficient for global consistency is unknown (c.f. Irving et al., 2022). Ideas from active learning, and adversarial games (Hammond and Adam-Day, 2024) may be relevant here. 

Suggestions to learn more: We’d recommend reading Wen et al., 2025. Related research includes consistency over latent representations (Burns et al., 2022). 

Reward Model Training and Red-Teaming

Problem summary: Reward models are used to distill human judgment; this amortizes the cost of laborious human labeling. Post-training for LLMs uses reward model judgments as training signal for open-ended, and non-verifiable questions. However, the reward model is a proxy which is inherently imperfect and can be gamed. Models can learn to reward hack, producing outputs which score highly on the reward signal without holding up to careful scrutiny. This problem is exacerbated by the scalable oversight challenge: as AI capabilities increase, models can produce outputs (e.g., complex code, scientific analyses) that are too time-consuming or difficult for human supervisors to evaluate accurately. We need methods and evaluations for scalable use of reward models. To develop such methods, theoretical models of the obstacles to reward modelling are needed. These obstacles include both label error induced by the limitations of human judgment, and generalisation error in the reward model itself. Flaws in human judgment can be modelled by partial observability or time-complexity gaps between the model and the human rater—a reward model trained on flawed human judgments can learn to systematically approve outputs that are subtly incorrect, manipulative, or harmful. 

Why this matters: A sufficiently capable but misaligned model could learn to systematically exploit the blind spots in a reward model to achieve undesirable goals. For example, a model could produce outputs that appear helpful and correct to a human rater but contain hidden steganographic messages or dishonestly obfuscate errors in assumptions made by the model. In the worst case, a flawed reward model doesn't just lead to poor performance; it can actively train a model to deceive human supervisors without being caught.

Examples of subproblems:

Established subproblems:

  1. Weak-to-strong unsupervised reward models:  Burns et al., 2023 introduced the problem of unsupervised training for reward models as a weak-to-strong generalisation problem. Extend unsupervised methods such as Wen et al., 2025)’s consistency training to supervise reward models on problems where human labels are expensive.
  2. Diagnosing reward hacking:  Develop metrics and white-box probes (e.g. SAE vectors; Shabalin et al., 2024) that surface when the reward model’s judgments disagree with hold-out evaluations.
  3. Adversarial evaluation of reward model training: Marks et al., 2025 demonstrated a red-team blue-team game in which researchers compete to determine whether they can detect misalignment in an adversarially constructed policy. A similar exercise could be done with adversarially-constructed reward models. 
  4. Reward modelling under partial-information: Study reward model learning when human judges are only provided subsets of the whole model output (c.f. Lang et al., 2024).

New questions where we are not aware of prior work:

  1. Iterative engineering and risks of uncaught reward hacking: METR, 2025 and others have shown that recent models knowingly subvert user intent on coding tasks. Developers iteratively correct for such behaviour by fixing hackable reward environments, but this empirical iteration process is likely to leave untouched harder to catch hacks. Can we theoretically model this dynamic? Will the resulting model develop increasingly complex reward hacks? Statistical analysis of Goodhart’s law, and creation of deliberately-flawed benchmarks could help predict the results of these practices.

Suggestions to learn more: Previous research in this area includes Gao et al., 2022.

Empirical Investigations Into Preventing Research Sabotage

Problem summary: Future AI systems may be used for large amounts of intellectual work, including in AI research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. This exposes us to risk that these systems might systematically sabotage safety-relevant research, which would frustrate our efforts to develop more powerful safety techniques. We would like to develop methods to prevent this sabotage, including by improving the ability of trusted AI systems to monitor untrusted systems. 

Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence. Current proposed control protocols often benefit substantially from access to capable trusted models, for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).

Key subproblems:

Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models.We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.

Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.

Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable, such as dangerous capability evaluations, AI safety research or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.

Improving the usefulness of trusted monitors. For instance:

  • Training trusted monitors to become better at critical skills, which involves building training environments or datasets. Relevant skills include:
    • Detecting vulnerabilities.
    • Detecting steganographic encodings.
    • Paraphrasing the outputs of powerful models, to help prevent collusion (Shlegeris, 2024).
  • Making trusted monitors robust against jailbreak attempts embedded in the outputs of untrusted policies.
  • Ensuring that trusted monitors can deal with very long context, such as entire codebases or long agent trajectories.
  • Developing agentic trusted monitors, such as Interrogator agents (Pacchiardi et al., 2023).
  • Identifying scaling laws for the above.

Building capable models with narrow capability profiles. For instance:

  • Studying the scaling laws of highly specialized models, e.g. coding models, with a focus on techniques to get small models to perform well (e.g. distillation).
  • Studying the capability profile of large models that are pre-trained only on data predating language models.

Utilising unlearning to remove scheming capabilities.

All Research Priorities