Learning Theory — Alignment Project by AISI

From a scientific standpoint, understanding and measuring the key factors that influence the learning process—and turning those factors into levers of control—is a major boon to aligning AI systems with complex human values. This is because alignment to our intended goals can break down as the result of learning failure, for example, a model could be doing well on the loss it was trained on, yet act unsafely in a different setting. This may happen when the input distribution shifts, an adversary supplies a worst-case prompt, the reward signal is misspecified, human oversight fails, or a hidden optimiser emerges. Learning theory helps understand these breakpoints. By providing predictive indicators of failure, impossibility results, and new algorithmic designs, we can frame alignment problems in more tractable terms. This section is therefore organised around the different stages where an algorithm would fail to learn, and how theory can predict, detect, or prevent such failures.

Training Dynamics

Problem summary: Modern foundation models are created by an industrial process of running simple update rules for millions of steps, over billions of parameters and trillions of tokens. Although we only control a few dials, such as your choice of optimiser, learning-rate schedule, batch-size, architecture, and data curriculum, these choices eventually define a very high-dimensional complex system whose trajectories decide what the network ultimately becomes. Empirically, we see a lot of phenomena such as capability jumps, scaling laws, and critical periods, yet lack a concise, predictive theory that connects these macroscopic phenomena to the microscopic updates. A programme of ‘training-dynamics’ research would ask: what are the relevant coarse-grained variables of this system; what attractors and phases exist; how do perturbations steer a run toward or away from them; and can we model the continuous-time limit well enough to anticipate where a given configuration will land?

Why this matters: Robustly aligning AI systems with human goals relies on being able to steer learning rather than just evaluate the final checkpoint. If we can forecast when a model is likely to develop scheming tendencies, or which learning-rate schedule will tip the model towards a basin of harmful behaviour, we could intervene before harm is done and before money is spent. An effective theory of training would give us levers that scale with compute and operate in real-time, and absent such a theory we are essentially building AGI through trial-and-error while hoping the attractor we hit is the one we want.

Examples of subproblems:

Established subproblems:

Further develop or investigate existing models of training dynamics (Hu et al., 2023, Karras et al., 2024, Redman et al., 2024, Nunez and Soatto, 2024, Saxe et al., 2019) and critically examine their theoretical assumptions. Given a model of training dynamics, can we predict the sensitivities of model weights to training data? Ideally such methods could be used in reinforcement learning (RL) training to prevent adversarial, or collusion failure modes.
There is already a breadth of literature exploring the inductive bias of stochastic gradient descent (SGD), a popular method for training models in deep learning. In general, which optimiser or architecture choices most strongly tilt inductive bias toward helpful vs deceptive behaviour? For example, Lesci et al., 2025 give a method to estimate the bias induced by a certain choice of tokenization. How can you use theory to engineer good inductive biases (Hu et al., 2025)?
There is a significant gap between typical training setups and their theoretical models. There are many empirical phenomena and engineering practices which have not received theoretical attention. We would like to either have reasonable assurance that these knowledge gaps are inconsequential or have a good theoretical understanding of them. We would like to see a rigorous and systematic cataloging of these mysteries, that is, systematically isolate variables, test hypotheses, and build toward theoretical understanding. Examples include:
1. What role does SGD mini-batch noise (Ziyin et al., 2022) actually play— is it essential for exploration or mere computational convenience?
2. What are the effects of momentum (Xie et al., 2022) and other significant innovations for modern optimisers?
3. Why do specific batch sizes (Shau et al., 2024) and learning rates (Jin et al., 2023) and their schedules outperform others?
4. What makes batch normalization (Ioffe and Szegedy, 2015, Santurkar et al., 2018, Bjork et al., 2018) and layer normalization (Ba et al., 2016, Xu et al., 2019, Xiong et al., 2020) so effective, and why do they work in different contexts?
5. How does quantization-aware training (Nagel et al., 2022, Park et al., 2018, Nahshan et al., 2018) preserve accuracy despite radical weight discretization?
6. Why do massive activations (specific activations with significantly larger values than others) exist (Sun et al., 2024), and what relation do they have with RoPE (rotary positional encoding; Jin et al., 2025)?
Different training runs with identical architectures can converge to qualitatively different solutions —s ome models memorize while others generalize, some develop modular internal structures while others remain entangled, some exhibit harmless tool-use while others may develop deceptive tendencies. These distinct endpoints suggest that training dynamics contain multiple attractors and phases. Can we map the important parts of the ‘phase diagram’ systematically across different training setups? The key questions are:
1. First, what macroscopic variables actually distinguish these phases? For example, Singular Learning Theory suggests tracking measurements such as the training loss and the local learning coefficients (Lau et al., 2024, Chen et al., 2023).
2. Second, which training conditions (initialization scale, learning rate schedules, data ordering, architectural bottlenecks) reliably select for specific attractors?
3. Third, can we monitor these phase-determining variables efficiently during training at scale, providing opportunity for early intervention?
4. Finally, what control mechanisms are available and do we have a reliable scientific theory for their efficacy? Are point interventions at critical moments during training sufficient, or do we need continuous feedback control throughout training?
Biological neural networks exhibit critical periods where certain capabilities must develop or be permanently impaired (Hubel et al., 1970, Coulson et al., 2022, Byrne and Jerbi, 2022). Artificial neural networks seem to exhibit similar properties (Achille et al., 2018, Kleinman et al., 2023, Nakaishi et al., 2024, Chimoto et al., 2024). Some evidence suggests early training shapes inductive biases irreversibly (Achille et al., 2018), but systematic investigation is lacking. Can we map which capabilities have critical periods versus which can be learned anytime? Does the ‘critical brain hypothesis’ from neuroscience—that networks likely self-organize to the edge of phase transitions—apply to artificial systems?
Kaplan et al., 2020 and Hoffmann et al., 2022 document precise power-law relationships between loss, compute, and model size. However, we lack mechanistic explanations. What are the factors that can impact scaling exponents? Recent work suggests connections to data manifold dimension (Sharma and Kaplan, 2022), feature learning dynamics (Bordelon et al., 2024) and factors related to the distribution of patterns in data (Michaud et al., 2023). Are there additional factors that affect the emergence of scaling law training behaviour? What are the relevant and irrelevant details or hyperparameters of a training algorithm that affect the scaling exponents?

New questions where we are not aware of prior work:

Neural network training involves billions of parameters, but useful descriptions might require only a few macroscopic variables. What are these coarse-grained coordinates—spectral properties of weight matrices, alignment between gradients and data, effective dimensionality of representations? Can we write down dynamical equations for these variables that predict training outcomes? When the system is perturbed (corrupted data, adversarial gradients, hyperparameter shifts), can the effective theory predict whether training will recover or diverge?
Reinforcement learning introduces additional complexity: the data distribution that the model observes shifts as the policy improves. How do attractors of such dynamics and their transitions manifest in such conditions? Can we extend effective theories from supervised learning to handle this co-evolution of policy and experience?
If gradient descent can create internal optimization processes, are there different phases with different mesa-objectives or ones without them at all? Can we detect and control when we transition into such states? Could we reconstruct potential mesa-objectives by analyzing the history of checkpoints, gradients, and training data? What minimal logging during training would enable post-hoc detection of emergent optimization?

Suggestions to learn more: Regarding alignment and learning dynamics we recommend reading Park et al., 2024, Baker et al., 2025 and Piotrowski et al., 2025.

‍

Generalisation and Inductive Bias

Problem summary: Training sets pin down behaviour only on the narrow slice of inputs they contain; infinitely many functions can match that data while diverging elsewhere. Which of those functions a model actually learns is determined by its inductive biases—the preferences introduced by optimiser, architecture, curriculum, tokenisation, regularisation and inference setup. Because deployment inevitably throws new distributions at the system, these biases, rather than the empirical loss, govern how the model will act in the wild. How does each design choice shape the bias, and how do those biases translate into off-distribution behaviour?

Why this matters: Some biases steer models towards robustly learning generalisable concepts, while others might steer it towards fragile heuristics, privacy leakage, sycophancy. In the worst case, inductive biases could steer the model to be ‘deceptively misaligned’ such that it behaves indistinguishably from an aligned model desired during training, only to actively pursue harmful goals once deployed. As frontier models are not easily interpretable in deployment, the most practical way to prevent such generalisation is to understand and steer those biases.

Examples of subproblems:

Established subproblems:

In general, when do neural networks that appear aligned on the training distribution stop generalising that alignment to novel situations—for example, by reverting to proxy goals, leaking private information, or producing deceptive outputs—and how might one begin to investigate this? We would like to think about this in terms of stages in the lifecycle of a model: are there critical periods of training during which a model learns most of its goal-like representations? What aspects of the training data, the model architecture, the optimiser, and the inference setup promote harmful out-of-distribution behaviours in neural networks? How much of the variance in training is due to randomness, and how much do generalisation behaviours depend on randomness in the initialization (Zhao et al., 2025)?
How likely are LLMs to develop tendencies towards deception or other harmful behaviour? Deceptive goals (Carlsmith, 2023) probably require both length generalisation and out-of-context generalisation (Berglund et al., 2023) which are therefore also of interest.
Out-of-distribution (OOD) failure modes in LLMs, such the reversal curse (Zhu et al., 2024) or many-shot jailbreaking (Anil et al., 2024), are not yet understood in terms of the inductive biases or training dynamics that give rise to them. Most current explanations amount to post-hoc stories—“the model latched onto a brittle shortcut”—rather than quantitative theories that could have predicted failure in advance. What general, principled claims can we make about OOD failure modes? What are the simplest toy settings in which an OOD failure mode is the most probable generalisation?
Can model complexity measures such as local learning coefficients (Lau et al., 2024, Baker et al., 2025, Aoyagi, 2025, Hitchcock, 2024) measured over pretraining or finetuning predict downstream capability or honesty phase-transitions? More generally, can we use singular learning theory to characterise a ‘catalogue’ of phases visited by training trajectories (Chen and Murfet, 2025)?

‍

New questions where we are not aware of prior work:

How rapidly do models forget safety constraints during continual finetuning, and can we upper-bound this forgetting rate analytically?
In reinforcement learning settings, can we define a deceptive-goal indicator as a specific measurable statistic that correlates with the presence of any internal objective that diverges from the training objective and is being actively concealed by the model (deceptive alignment)? At what scale do deceptive-goal indicators spike, and can simple scaling laws forecast the spike location?
We want learning-theoretic guarantees for agents that can actually run on frontier-scale hardware, allowing alignment proofs to apply to real systems rather than to theoretical proxies such as Solomonoff inductors (Kosoy, 2023). Can we devise a computable, Occam-style prior or hypothesis language (e.g. string-machines (Kosoy, 2023), compositional polytope-MDPs (Kosoy, 2022)) in which both Bayes-optimal planning and Bayesian-regret can be computed in time polynomial in description length?

Suggestions to learn more: We recommend reading Hoogland et al., 2024 on alignment and inductive biases. Previous research in this area includes Betley et al., 2025 and Greenblatt et al., 2024. Regarding alignment and learning dynamics we recommend reading Park et al., 2024, Baker et al., 2025 and Piotrowski et al., 2025. For more on the reversal curse see Lv et al., 2024. For more on out-of-context learning, see Treutlein et al., 2024 and Betley et al., 2025b). For more on deceptive alignment see Hubinger et al., 2019.

‍

Scalable Oversight and Preference Learning

Problem summary: People can’t reliably judge AI behaviour in every situation the AI might encounter. Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. Methods such as debate (Irving et al., 2018), recursive reward modelling (Leike et al., 2018), and task decomposition (Christiano et al., 2018, Wu et al., 2021) try to turn these imperfect human ratings into much stronger oversight. Other methods, such as weak-to-strong generalisation (Burns et al., 2023) and consistency training (Wen et al., 2025, Paleka et al., 2024, Bürger et al., 2024, Burns et al., 2023) move away from human ratings entirely. We still need evidence—both experimental and theoretical—on when these methods really work, how big the remaining errors are, and whether those leftover errors are the kind that an adversary could exploit.

Why this matters: Most AI companies try to keep their language models safe by fine-tuning them to follow human preferences. Larger future systems will likely rely on AI-assisted methods to judge answers we cannot evaluate directly. If that oversight quietly breaks down in high-stakes areas—say advanced science, software engineering, or biotech lab work—a powerful model could cause harm without obvious warning signs.

Examples of subproblems:

Established subproblems:

If a proof about a scalable oversight method is stated in terms of a game theoretic equilibrium (Hilton et al., 2025), the safety properties of our guarantees only hold if training sufficiently converges (Buhl et al., 2025). Can learning theory detect whether models are at a local minimum?
Online training can be used to minimise distribution shift, but it only provides an average-case regret bound: single errors could still occur. Scalable oversight protocols such as prover-estimator debate (Brown-Cohen et al., 2025) show that some property, such as honesty, will be achieved in a (1-ε) fraction of cases (Buhl et al., 2025). Are there ways to allow us to maintain an error rate of ε at deployment-time?
We would like to be able to deploy AI systems in 'high-stakes' settings where even a single error could cause large-scale harm (Christiano, 2021). These errors are likely to arise in part because of the distribution shift from training to deployment. Can we use online training-like techniques during inference, on candidate actions instead of on final actions?
There are a variety of methods that try to improve accuracy or safety in LLMs through abstention: refusing to provide potentially harmful or misleading answers (Wen et al., 2025). Allowing agents to abstain could reduce risk, but at the cost of removing models’ responses to these challenging inputs from our training protocol. Can we apply this to develop principled methods that let agents pass on unsafe or low-confidence items while still guaranteeing that the overseer learns as much as possible and that overall regret to the human remains bounded?
Some work has been done on the self-calibration of models (Kadavath et al., 2022, Steyvers et al., 2025, Kapoor et al., 2024). We would like to determine whether sufficient information is contained within AI systems to construct contexts in which they might behave maliciously: the problem of 'eliciting bad contexts' (Irving et al., 2025). In theory, a solution to eliciting bad contexts might be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distribution shift from training to deployment. This means that training methods to produce aligned systems are more likely to generalise to deployment.
1. Is there a ‘small circuit on top of heuristic functions’ model setting in which eliciting bad contexts is possible, in a nontrivial way?
2. Can we use ‘susceptibilities’ in singular learning theory (Baker et al., 2025) to shed light on what kind of change in training data would be necessary to elicit a target change in behaviour?

‍

New questions where we are not aware of prior work:

State-of-the-art systems already call other models and execute code internally; understanding how such self-referential computation generalises is essential for preventing hidden mesa-optimisers or deceptive planning. Can we formalise agents with an internal compute channel, design a query-efficient procedure that learns the unknown program governing that channel, and prove an epistemic-regret or performance bound?
Can a scalable oversight method abstain on unsafe questions and still guarantee bounded regret on utility? This might be useful if we are afraid certain adversarial settings may also influence the oversight model, or if we only want to evaluate a few out of the many possible results in the setting, which may actually be the default for some methods. Formalise selective versions of existing methods, such as ‘selective debate’ or ‘selective consistency training’. Is this a useful method to learn preferences and judge AI outputs?
What does the model actually learn during scalable oversight schemes? Monitor loss curvature and other statistics—layer CKA, effective rank, etc.—during fine-tuning, and evaluate whether the model undergoes sharp transitions.
In the debate setting, rules can be provided for debaters to follow and check one another against. One hypothesis to justify this is that truthful explanations form smaller, more-self consistent manifolds than deceptive ones. This suggests that as the number of logically consistent rules increase, the volume of false-argument space shrinks faster than true-argument space. Formalize this hypothesis, and provide an empirical test on this hypothesis.
Can we derive a critical learning-rate for debate finetuning that guarantees convergence to an equilibrium where being honest is the most optimal strategy?

Suggestions to learn more: Regarding eliciting bad contexts, we’d recommend reading Irving et al., 2025 and Baker et al., 2025. Regarding debate for AI alignment, we recommend reading Irving et al., 2018, Brown-Cohen et al., 2025, and Buhl et al., 2025. Regarding scalable oversight in general, we recommend reading Rahadkrishnan et al., 2023 and Bowman et al., 2022.

‍

Adversarial / Worst-Case Robustness

Problem summary: We want AI systems to be deployed in critical areas such as medicine, law, or policy-making, but these are areas even a single error could cause substantial harm. This also applies to automated AI research settings where a single error could allow self-exfiltration of an AI model. This might take the form of a single adversarial sample, a poisoned data point or gradient, or a carefully chosen latent perturbation. In these cases we might try to train away worst-case behaviours, elicit responses in such a way that makes worst cases impossible, or reduce the impact of such behaviours.

Why this matters: Safety arguments that rely on ‘average-case’ behaviour do not provide high assurance if an attacker can force the model into rare, but unsafe states. Robustness guarantees are our only line of defence once a model operates in open environments or interacts with malicious users.

Examples of subproblems:

Established subproblems:

Adversarial training can generalize poorly (Raghunathan et al., 2019) and harms benign capability (Tsipras et al., 2019) but is important to remove worst case behaviours. Relaxed adversarial training (Hubinger, 2019) and latent adversarial training (Jermyn, 2022, Sheshadri et al., 2024) offer possible alternatives. Can we develop empirical methods for these methods that achieve a high success rate on removing data-poisoned backdoors? Or alternatively, benchmark against success rate on resisting held-out jailbreaks?
Some promising proposals for training AI safely such as prover-estimator debate (Brown-Cohen et al., 2025) show that some property, such as honesty, will be achieved in a (1-ε) fraction of cases (Buhl et al., 2025). Are there ways to restrict what systems will be able to do in the remaining ε error cases?
Consistency training (Wen et al., 2025, Paleka et al., 2024, Bürger et al., 2024, Burns et al., 2023) serves as a promising way to fine-tune language models without requiring expensive human preferences. Adversarial training here is done mostly in the setting of labels, using the internal representations of the model to generate consistency. Do these methods still work if the weights themselves are adversarial to human values (like in the case of a model that exhibits emergent misalignment; Betley et al., 2025)?

New questions where we are not aware of prior work:

How much can theory tell us—given certain assumptions about the adversary’s power—about the largest possible harm a system’s mistakes could cause? In other words, can we derive bounds that connect an adversary’s influence to the severity of the resulting failure, ranging from trivial errors: a typo in a discharge note, to catastrophic outcomes: recommending a 100× overdose or steering users toward a bio-threat?
When a model is instructed to self-criticise verify its own output, does this provably improve the bound on the safe-set of inputs compared to standard adversarial training? Can we give a PAC-Bayes or compressive bound on remaining jailbreak probability?
If we discover an adversarial string after deployment, how much additional data or parameter perturbation is needed to provably rule it out, assuming no retraining from scratch?
In a prompt injection attack, an attacker includes an instruction in some text the LLM reads, which aims to make an LLM follow the injected instruction to perform an attacker-chosen task (Shao et al., 2024). Can we model a prompt injection attack as a perturbation in prompt‐feature space, and derive a bound on how many malicious shots are needed to force a certain target error rate in a GPT-style LLM?

Suggestions to learn more: Regarding relaxed or latent adversarial training, we would recommend reading Christiano, 2019, Hubinger, 2019, Casper et al., 2024 and Sheshadri et al., 2024. Previous research in this area includes Xhonneux et al., 2024. Regarding recent work on emergent misalignment, we’d recommend reading Betley et al., 2025, Soligo et al., 2025 and Turner et al., 2025.

‍

Data Distribution

Problem summary: The training corpus is the primary way through which we specify what a model should know, value, and ignore. Modern frontier labs already pour thousands of GPU-hours and a great deal of human effort into scraping, deduplicating, filtering, and annotating pre-training corpora. However, the dominant optimization target remains to drive cross-entropy loss down as quickly as possible, not to maintain a precise, mechanistic grasp of what each slice of the corpus is teaching. As a result, even though we do curate data, we still treat the final mixture as an enormous, largely undifferentiated soup. Further, we have neither sufficiently developed a precise scientific language for patterns, regularities, and features in data, nor scalable tools that can detect, add, or ablate those patterns at will. The hope is to map influence paths from specific slices of data to specific model behaviours, automatically discover latent modes of the data distribution in the corpus, and build a precise science of how statistical structure flows from text to weights.

Why this matters: Data is the most accessible and acceptable lever for shaping model cognition. If we could predict that ‘including corpus X reliably yields behaviour Y’, we could preempt privacy leakage, sycophancy, or other undesirable behaviour. Conversely we could also ensure that safety-critical concepts like honesty are robustly learned. Understanding precisely how training data determines model behaviour is likely a necessary step in having high assurance that a training protocol will produce safe AI.

Examples of subproblems:

Established subproblems:

An influence function is a classic statistical measure for attributing the effect of training samples to specific model properties and training outcomes. These functions have been used to better understand the predictions of a black-box model (Koh and Liang, 2017, Grosse et al., 2023, Zhang et al., 2024, Kreer et al., 2025, Adam et al., 2025, Anthropic, 2023). Can we develop scalable methods for discovery of meaningful data clustering and give reliable attribution for model properties to these data clusters? For instance, can we identify the minimal data subset that would need to be removed to suppress a specific harmful capability or promote certain safety properties?
The training corpus is a sample from a certain distribution that contains statistical regularities, or modes (Chen et al., 2025, Saxe et al., 2019), at multiple scales – from character bigrams to thought patterns. Yet we lack systematic methods to identify which patterns are ‘deep’ in the sense of being robustly encoded and fundamental building blocks for other patterns (Lehalleur, 2025). Can we develop scalable methods to identify deep patterns of a data distribution from its samples (Baker et al., 2025)? Using knowledge of model architecture, can we predict which modes will be reliably learned, in what order, and at what data-model scale?
There is an extensive body of work studying (Tay et al., 2022, Bahri et al., 2021, Kaplan et al., 2020) the relationship between the performance of a system as variables are scaled up and down. These scaling laws are known to be affected by changing the data composition e.g. size of pretraining dataset, or linguistic diversity. Why does this happen, and what effect does this have on the model relative to other changes such as choice of architecture or optimiser?
Curriculum learning is a machine learning method where models are trained with attention to the difficulty of the different slices of the training corpus (Soviany et al., 2021). How sensitive are the model’s eventual behaviours to permutations of the presentation order of those slices? More formally, for K slices, a given architecture and optimiser, what is the variance in post-training behaviour B when the data order is drawn uniformly at random from the K! possible curricula? Can we bound or predict this ‘curriculum variance’ as a function of model scale and slice statistics?

New questions where we are not aware of prior work:

There is an inverse problem to discovering patterns and regularities in the data distribution: given a specification of desired model properties, can we engineer or alter existing distribution to achieve what is specified? What is required to attain such a ‘data distribution compiler’?
Suppose a pre-training corpus is partitioned into K identifiable modes (dialects, task genres, ‘cultural clusters’, etc.). Can we develop a learning-theoretic framework that predicts the higher-order ‘interaction terms’ that arise when two or more modes are mixed—i.e. behaviours that are absent when the modes are trained on in isolation but emerge only from their joint presence? How large can such interaction effects be, and under what conditions (relative mode sizes, curriculum order, model width) do they amplify or cancel?

Suggestions to learn more:

Regarding the effect of data on model structure and learning, we recommend reading Lehalleur, 2025. For more on neural scaling laws, we recommend reading Michaud et al., 2024, He et al., 2024, Liu et al., 2025.

‍

Specification Error and Reward Hacking

Problem summary: Modern AI systems are often trained against proxy objectives: reward models, preference classifiers, or sparse success criteria. These proxies can be gamed: AI systems can gain high scores without actually performing the desired behaviour (Skalse et al., 2022, Weng, 2024, METR, 2025). We would like to characterise when optimising for proxy objectives leads to highly undesirable policies, and how to prevent or detect it.

Why this matters: A perfectly robust, perfectly generalising policy is still dangerous if the objective it pursues contradicts human values. In practice, training advanced AI systems requires us to distill complex or hard to specify preferences into some computable reward signal. To ensure systems trained in this way are safe, we need this reward signal to actually incentivise learning our preferences.

Examples of subproblems:

Established subproblems:

Can we develop quick and realistic probes that flag when a policy has discovered a short-horizon strategy that increases the learnt reward while lowering some independently measured ground-truth utility? Concretely you might train two reward models on disjoint feature-sets; does sudden divergence between their gradients predict upcoming reward hacking?
Loosely speaking, the Littlestone dimension of a hypothesis class, or of a learner that is restricted to that class, is a measure of the difficulty of the class in an adversarial, sequential-prediction setting. Ambiguous online learning (Kosoy, 2025) is a recent framework which allows a learner to provide multiple predicted labels. Can we extend this to deep networks? If so, does its ambiguous Littlestone dimension (Kosoy, 2025) predict OOD generalisation or jailbreak resistance (Kumar et al., 2024, Hasan et al., 2024) better than classic Littlestone bounds (Shalev-Shwartz, 2011, Littlestone, 1988)?
Hidden rewards are utility functions defined over latent states that are never directly observed (Kosoy, 2025, Dogan et al., 2023). Real human values refer to things like ‘suffering’ or ‘biodiversity’, not token streams. Partially observable Markov decision processes (POMDP) have latent physical states that must be inferred, but still assume that the reward is known and returned to the agent each step. Inverse reinforcement learning, preference-based reinforcement learning, and other similar methods do learn a reward model but base it on the feature vector or on observation history, rather than working to specify a true variable. Some work (Ha and Schmidhuber, 2018, Richens and Everitt, 2024, Kipf et al., 2020) tries to learn latent variables meant to correspond to ‘objects’ or ‘physical causes’, and then learn policies that optimise a reward defined on those latents. But what counts as the correct latent space is left to the unsupervised learner; there is no requirement that it line up with ethical or value-laden concepts.
1. We should try to specify and learn such latent-state objectives as it seems important to non-trivial outer alignment. Can we formalize them, and prove learnability or impossibility results?
2. One formulation of this might be: suppose the world is a POMDP with an unobserved state S. There exists a true reward function R*: S-> R that encodes what humans care about. The agent gets only observations oₜ, chooses actions aₜ, and perhaps receives very sparse labels L(o_{1:t},a_{1:t} ) such as preference comparisons or occasional scalar scores. Then we would ask the following questions:
  1. Can we write down conditions under which R* is (or is not) identifiable from the data the agent can ever get?
  2. Given identifiability, can we bound the sample complexity of learning R* to some range of accuracy in some norm?
  3. If we cannot identify R* exactly, can we at least recover a conservative envelope R̲ <= R* <= R̅ that is tight enough to rule out catastrophic actions?

New questions where we are not aware of prior work:

Alignment collapses if one catastrophic action or a small model mis-specification voids safety guarantees; a trap-aware, agnostic regret bound (Kosoy, 2023) would be the first formal handle on this failure mode. Can we produce an algorithm whose regret remains sub-linear even when (i) the environment contains irreversible ‘trap’ states and (ii) the true environment may lie outside the hypothesis class? Password-guessing games would be the canonical test-bed
A value-improving update is any policy-gradient direction that raises the true (latent) utility we care about, whereas a Goodharting update is a direction that boosts the learned proxy reward while failing to raise true utility. For a given RLHF pipeline (such as the setup of Gao, 2024) can we decompose the vector field of policy gradients into ‘value-improving’ and ‘Goodharting’ components, then upper-bound long-run divergence?
Can we give necessary and sufficient conditions under which a learned latent-state reward model is calibrated with respect to a hidden ground-truth reward, assuming only partial observability and misspecified priors?
Fine-tuned models increasingly choose their own training data such as self-play or automated data-collection. Can we build online tests that detect when a learner is reinforcing a reward-hacking loop faster than human oversight can respond?

Suggestions to learn more: Regarding Goodhart and specification gaming, we recommend reading Manheim and Garrabarrant, 2018, and Krueger et al., 2020, Regarding reward hacking, we recommend reading METR, 2025, Weng, 2024, Hadfield-Menell et al., 2016, Everitt et al., 2019. Recent work on this includes Wen et al., 2024, Pan et al., 2024, and Wang et al., 2024.

‍

In-Context and Continual Learning

Problem summary: Large models are increasingly learning at inference time, absorbing new information prompts, tool calls, or streaming data. These learning processes might be importantly different, but significantly influenced by, prior training including pre-training and post-training. Further, they are increasing in terms of the time horizon they operate on, possession of long-term memory and potential evolving goals and values. This can result in technical challenges including: avoiding catastrophic forgetting of safety constraints, and detecting when new capabilities or values are gained. We would like theory or measurement tools that quantify how fast safety knowledge decays, what architecture or optimiser choices affect that decay, and how mesa-optimisers could arise in such a setting.

Why this matters: Safety guarantees obtained at initial training time are moot if they wash out after a week of user interaction. Many toxic behaviours only emerge after long periods of interaction with a user. Reliable forgetting mechanisms, safe online adaptation, and early-warning indicators for harmful behaviour are therefore central to keeping deployed systems aligned over their entire lifecycle. Without them, every post-deployment fine-tune or context window becomes a potential jailbreak vector, gradually eroding the safeguards initially in place.

Examples of subproblems:

Established subproblems:

In-context learning is an emergent ability for a model to adjust its behaviour to the tokens it receives during inference time and effectively improve existing or outright gain new capabilities (Brown et al., 2020, Dong et al., 2022). We have a loose understanding of when models gain and lose in-context capability (Singh et al., 2023, Carroll et al., 2025), as well as the conditions in pretraining that promote in-context learning. We still do not understand a variety of things, such as how to control characteristics of in-context learning via intervention during pre-, post-, and RL training, as well as an effective theory of how in-context learning capabilities and pre-training dynamics relates (Singh et al., 2024).
Various mechanisms have been proposed to explain in-context learning. Can we validate or falsify said mechanisms:
1. ICL as gradient descent mesa-optimiser in the forward pass: Oswald et al., 2023
2. Identifying induction heads as necessary and central components of ICL capability: Olsson et al., 2022
3. Different kinds of learning algorithms implemented in-context depending on different factors: Akyürek et al., 2023
Catastrophic forgetting (De Lange et al, 2021, Wang et al., 2025) refers to the loss of previously learned capabilities after being trained on new data. This may be harmful when forgetting safety guardrails learned in training, and may be beneficial when unlearning harmful behaviours in training (Yao et al., 2024). Why does this happen, and how can we identify and control which capabilities are forgotten? Are there methods of continued training (RL, fine-tuning or continued pretraining) and careful curation of in-context data that allow for retention of safety properties? What measurements (benchmarks, whitebox methods or other metric) can we perform to be confident that safety properties are retained?
Eliciting bad contexts from models (Irving et al., 2025) may be related to work that studies if language models know what they know (Kadavath et al., 2022, Steyvers et al., 2025, Kapoor et al., 2024). Could we apply these methods to elicit scenarios where a given model would perform poorly or act in harmful ways? Can we develop chain-of-thought monitors that can give early warning of a potential transition into dangerous contexts? Can we systematically discover potential model jailbreaking prompts via sufficient understanding on how models update their states in-context?

New questions where we are not aware of prior work:

Can we derive scalable ways to monitor dynamics of learning after pre-training and serve as early warning systems for any potential phase change?
Could we use learning theory to understand the sample efficiency of various in-context learners, as well as how well they generalise and other OOD behaviours?
What are the safety properties for continual learning or for AI systems that are either trained with or augmented with long-term memory? How likely they to misbehave, or act erratically especially after long periods of user interaction?

Suggestions to learn more: For more on ICL, we recommend reading Brown et al. 2020, Dong et al. 2022, Olsson et al. 2022 and Min et al. 2022. For more recent research in this area, we recommend reading Zhou, et. al., 2023, Singh, et. al., 2024, Mainali, Teixeira, 2025, Li et al. 2024. For more on continual learning, we recommend reading Kirkpatrick et. al., 2017, van de Ven and Tolias 2019, French 1999, and Wang et al. 2024.

Inner Alignment and Agent Foundations

Problem summary: Natural selection shaped humans to maximize genetic fitness through proxy mechanisms such as preferences for calorie-dense foods. Whilst these proxies aligned with survival in the ancestral environment, they can lead to maladaptive behaviours in modern contexts. When we train a neural network, we are worried that parts of the model may start acting like their own optimiser—running an internal search for a goal that isn’t the one we intended. This has been observed in neural networks in restricted settings (Guez et al., 2019, Taufeeque et al., 2024). We need to figure out if or when these hidden objectives appear, how they grow during training, and how to catch them before they cause trouble. In general, we also need a normative framework: a way to state what an agent should believe, want, and do, and proofs that those prescriptions remain valid after learning, self-improvement, or interaction with other agents.

Why this matters: A model that quietly switches to a different goal can defeat even the best-designed training objective. These behavioural shifts might happen suddenly and without obvious warning, or might activate in an environment unsuitable for such drives. Understanding the learning dynamics behind them would let us spot and fix alignment failures before they scale up. Further, without a solid foundation we are left with ad-hoc patches whose safety evaporates once the system rewrites its own decision rules or confronts shifts out of distribution. A computable, learning-aware decision theory would let us reason about reward hacking, deceptive alignment, or collusion in principled terms, and would inform the design of training algorithms whose objectives survive scale. In short, clear agent foundations take alignment from empirical guesswork into something we can prove (or falsify) before deployment.

Examples of subproblems:

Established subproblems:

Hu et al., 2023 suggests that the outcomes of training can be apparent early in the training process. When can model behaviour be interpreted solely from trained weights, and when are learning dynamics necessary for interpretability (see Lehalleur, 2025 for background)?
Open, heterogeneous deployment will pit many AIs against each other and against humans (some work on strategic intelligence in LLMs by Payne and Alloui-Cros, 2025 guarantees that cooperation (or at least bounded damage) emerges even under adversarial cross-play are a key piece of real-world robustness. Systems optimised for this may also learn harmful dynamics from a competitive environment. Can we prove that a class of learning algorithms (e.g. infra-Bayesian or mirror-descent learners) converges to logit or ε-Nash equilibria (a set of strategies where, given the opponents' strategies, each player obtains within ε of the maximum possible expected payoff) in repeated population games where opponents are reshuffled every round? Can we quantify the worst-case payoff shortfall?
Hidden “agent-within-agent” dynamics could lead to secret collusion (Motwani et al., 2024) or gradient hacking (Barnett, 2021), where a model might modify its own training process. A detection+mitigation pipeline provides a worst-case safety back-stop before deployment. Could we use the infra-Bayesian “clone law” (Direction 15 in Kosoy, 2023) formalism to detect when two subnetworks of the same model begin modelling each other as agents with private objectives? Could we design an intervention (weight-regulariser, architectural change, etc.) that provably suppresses this? A good testbed would be to use this intervention and prevent emergent misalignment in model organisms where we know it can be induced (Turner et al., 2025).

New questions where we are not aware of prior work:

Existing decision-theoretic models (AIXI (Hutter 2000), classical Bayesian reinforcement learning, standard game theory) either assume unlimited compute, perfectly observed rewards, or a clear separation between agent and environment. None of these assumptions hold for frontier-scale systems that are embedded in the world, can possibly modify their own code or environment, and interact with other AIs. However, we need agent foundations that state what an agent should believe, want, and do, and proofs that those prescriptions remain valid after learning, self-improvement, or interaction with other agents Therefore we ask: What is the right formal language for reasoning about bounded, learning agents that must infer simultaneously about how the world works and how one should act in it? How should we construct models that capture simplicity bias without generalisation to harmful concepts? Can we derive guarantees—regret bounds, equilibrium concepts, safety constraints—that still hold when the agent is part of, and can be modified by, its environment?
1. Can we formalize simplicity bias beyond Kolmogorov, Solomonoff and Chaitin, and relate them to inductive biases within computable statistical models like ANNs.
2. Could we use learning theory to bring formal clarity into mesa-optimiser / objective formation, as well as specify theoretical conditions where they arise?
3. Could we use learning theory to provide a formal and quantitative characterisation of reward hacking and other goal misgeneralisation type behaviours in agents?
We want an empirical handle on value-drift through representation change, where large, continually-finetuned models face a radically different ontology at employment. Can we devise an ‘ontology-mixing attack’, where we switch a policy trained on one ontology to a more compressed one at deployment?
Current alignment schemes tacitly assume the agent’s input-output channels are the only things that matter; if advanced models dissolve that boundary by simulating themselves or their operators, we need guarantees that survive the ontological shift. Could we produce an embedded model (Demski and Garrabrant, 2019, Kosoy and Appel, 2021) that avoids classic wire-heading traps?

Suggestions to learn more: We recommend reading Hubinger et al., 2019 and Tennant et al., 2025 on inner alignment. Regarding infra-Bayesianism, we recommend reading Kosoy and Appel, 2021. On agent foundations: we suggest Legg and Hutter, 2006, Hutter, 2000, Schmidhuber 2002, and Kosoy, 2023, and Garrabrant et al., 2016.