Cognitive Science — Alignment Project by AISI

Systematic human error

Problem summary: Modern AI models (LLMs and associated agents) depend critically on human supervision in the training loop: AI models do tasks, and human judges express their preferences about their outputs or performance. These judgments are used as training signals to improve the model. However, there are many reasons why humans could introduce systematic error into a training signal: they have idiosyncratic priors, limited attention, systematic cognitive biases, reasoning flaws, moral blind spots, etc. How can we design evaluation pipelines that mitigate these flaws to minimise systematic error, so that the models are safe for deployment?

For the sub-problems below, we specify the problem as follows:

There is a to-be-aligned AI model that produces outputs (text, code, or actions).

We assume that these outputs have some in-theory knowable ground truth quality, even if this is not immediately discernible (e.g. their consequences may be unambiguously good or bad).

There are one or more human raters, who express preferences about these outputs, with or without confidences.

The model is assumed to have more expertise or domain specific knowledge than the humans on the task being performed.

These raters may be assisted by other AI agents, who are also less expert than the model being aligned. They may advise individually or collectively.

The judgments of the human raters are used to align the model in some way.

The core question then becomes: given the potential limitations of the human judges, how can we align the model to produce outputs that are ultimately good rather than bad? We think of this as a mechanism design problem: we want to find ways to harness human metacognition, the wisdom of the crowds, collective deliberation or debate, and the use of AI models as thought partners to improve the alignment problem.

We have both scientific goals and engineering goals. The engineering goals are focussed on post-training pipeline design and are geared to find ways that improve alignment. The scientific questions are about the nature of the cognitive or social processes that make weak-to-strong alignment either more or less challenging. We think that this is an interesting problem in its own right (for example, you can see representative democracy as a special case of weak-to-strong alignment). Some of the scientific goals build on work already conducted in other domains.

Why this matters: In some cases, the model being trained may be more capable than the humans providing training signals. For example, the model may be writing code for human judges, who cannot tell whether or how the code will execute just from reading it. The problem of correctly rewarding desired behaviours for AI systems when those behaviours are beyond humans’ ability to efficiently judge is called “scalable oversight”. The idea behind scalable oversight methods is to reduce harder problems to easier ones where human judgment can be relied upon. For example, judges might be assisted by another AI model (a rater assistant), or judge the outcome of a debate between different strong models (safety by debate).

Examples of subproblems:

The problems summarised below are inspired by the initial thinking from Irving and Askell, 2019 (this paper focuses on debate, but the ideas are relevant to other methods).

Weighting responses by their confidence provides higher-precision aggregate estimates. Can we improve alignment by weighting training signals by the confidence expressed by human raters? We could do this either within rater (downweigh specific choices) or between raters (downweigh less calibrated individuals).

Can we improve alignment using the ‘wisdom of the crowds’, by pooling the (confidence-weighted) responses of multiple human raters? For example, adopting similar techniques to those used in Bahrami et al., 2010.

If humans are provided with the opportunity to discuss their preferences before assigning a rating, does this improve alignment? Related to Bergman et al., 2024. Note that discussion can confer outsize effects on collective accuracy (Navajas et al., 2018).

Does giving raters more time to reflect help or hurt the alignment process, considering the relative quantity of training data produced? Can people learn to discern how much time they should allocate to a choice, given their sensitivity (optimally, both very hard and very easy choices should be made quickly; Masis et al., 2023). Do humans do this when judging AI outputs?

When judging messages from experts, humans can judge their trustworthiness based on surface features (such as the tone, confidence, or crispness with which they are expressed), even if they are not experts themselves. Do similar signals exist for AI models in expert domains, possibly using interpretability tools?

If human users are given the option to query the strong model, does this improve their ability to discern whether its answers are accurate or not? What queries work best? Do people tend to ask questions that maximise information gain relevant for the problem? See Rothe et al., 2018 for work that has studied the quality of human questions.

Where humans have a (potentially unlimited) number of weak AI systems that could be used to help, how can they best be used to assist with the rating process. For example, do humans tend to ignore the assistance (overconfidence) or follow it slavishly (overreliance), relative to optimal (conditional on how accurate the model is)? See Bridgers et al., 2024 for more details concerning rater assist and hybridization methods.

Let’s imagine that we are able to create two copies of the strong model, and encourage one to point out the flaws in the argument of the other. Would humans be able to discern which model was correct? See Irving et al., 2018 and Michael et al., 2023.

Suggestions to learn more: We’d recommend reading Bridgers et al., 2024 and Irving and Askell (2019). 

Running scalable oversight protocols with humans

Problem summary: As scalable oversight protocols are developed, protocols should be empirically tested – and where LLMs are not yet capable enough to carry out the protocols, human participants can be used instead. 

Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A scalable oversight method is therefore needed to supervise for honesty, helpfulness and harmlessness in models with capabilities exceeding human judges.

Examples of subproblems:

Established subproblems:

Investigate the experimental design for studies of this type and conduct small-scale trial experiments to improve future work in the area. As an example, previous OpenAI human studies surfaced flaws in debate protocols (Barnes, 2020), where malicious debaters could win using ambiguous language or obfuscated arguments with hard-to-find flaws.
Barnes and Christiano, 2020 found a failure mode where debaters could win through ambiguous language, and proposed an improved "cross-examination" scheme to mitigate this failure. However, more recent human debate experiments such as Michael et al., 2023 haven't used cross-examination. Can we replicate the benefit of cross-examination in human experiments? This requires careful design, as cross-examination uses snapshots of the mental state of a debater which cannot be fully replicated with humans.

We have recent theoretical work on when the obfuscated arguments problem from (1) can be avoided, introducing a new “prover-estimator debate” scalable oversight protocol (Brown-Cohen et al., 2025). However, it has not been empirically tested. Implement prover-estimator debate with human participants and measure its effectiveness.

New questions where we are not aware of prior work:

Scalable oversight protocols depend on stability (that small changes in inputs shouldn't produce drastically different outputs). We may not be able to prove the stability of our algorithms. Conduct empirical experiments with human subjects to examine the effects of small perturbations on scalable oversight protocols in practice.

Suggestions to learn more: We’d recommend reading Irving and Askell (2019) . Previous research in this area includes Parrish et al., 2022, Michael et al., 2023, and Rahman et al., 2025).

‍

Developing and Evaluating Processes for Human Auditing and Potentially Scheming Models

In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques like monitoring by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt 2024).

The early stages of an AI control protocol will usually involve automated filters and modification via machine learning classifiers or other tools. However, we expect that any costly action taken as a result of these automated filters will eventually involve human input for confirmation. Therefore, it will be critical to understand how control protocols involving humans operate, both by applying prior knowledge from cognitive science about human behaviour and cognitive biases, and via end-to-end tests of control protocols involving humans.

Why this matters: Control measures provide a practical safety approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using current capabilities, making them particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems see Korbak et al., 2025a).

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing.

Some valuable research directions here include:

Running trials with humans to evaluate different protocols for auditing, aiming to make them cheap and effective. This likely includes studying uplift (and risks) from giving human auditors access to various bespoke AI tools.
Measuring and mitigating cognitive biases that result from data imbalance. By default, a human auditor in a control protocol might see a stream of data which is nearly all fine, making sustained attention difficult if the task is to catch rare failures. We are interested in exploring potential mitigations to this problem.
Identifying relevant settings which allow for evaluating techniques for auditor uplift, e.g. spotting real vulnerabilities in unfamiliar codebases or suspicious activity in settings like SHADE-Arena (Kutasov et al., 2025). Some relevant settings might have to be developed from scratch—for example, we might want to be able to audit potential data poisoning attempts.
Developing mechanisms (e.g. redactors or paraphrasers) that can handle practical implementation issues regarding sensitive information. For example, the suspicious action may have been performed in the context of work with privileged or highly sensitive information, that human auditors can’t have access to.
Developing auxiliary methods to aid with few-shot catastrophe prevention (Greenblatt and Shlegeris, 2024), such as proliferation and interrogation techniques (focusing on methods that particularly benefit from execution by or with the help of competent humans).

‍

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming