Information Theory and Cryptography

Information theory offers the opportunity to develop mathematical tools to understand the fundamental limits of monitoring and verifying AI systems. These tools could help us determine what can and cannot be detected about AI behaviour—whether models can hide information in their outputs, when backdoors are discoverable, and what level oversight is theoretically possible. By proving both positive results (protocols that work) and impossibility results (fundamental barriers), information theory can help focus alignment efforts on tractable approaches. The problems in this section apply tools like entropy, channel capacity, and cryptographic hardness to core alignment challenges: detecting steganography in model outputs, identifying backdoors in trained systems, and understanding when monitoring can fail. As AI systems become more capable, these theoretical bounds become crucial for designing oversight methods that cannot be circumvented.

Information-Theoretic Approaches to Steganography and Encoded Reasoning

Problem summary: It is theoretically possible for AI systems to encode information in their outputs, both cryptographically and steganographically. Can we theoretically bound this possibility and identify possible approaches to preventing it?

Why this matters: One measure to prevent AI systems from causing harm is to monitor all information and reasoning they output. This includes both communication with humans or other AI systems and the use of chain-of-thought reasoning to improve their own outputs. Hiding content in these messages through steganography or cryptography would undermine the effectiveness of our monitoring.

Examples of subproblems:

Are there relevant theoretical bounds to what can be steganographically or cryptographically encoded in natural language?
What limits might there be on models’ ability to do this encoding in practice? Methods for cryptographically-strong steganography usually require circuits much deeper than the latent space evaluation of an LLM, so it is unclear both whether a network could learn such circuits and whether gradient descent would incentivise it.
What mitigations could work? Can suitable theoretical bounds be developed for existing mitigations, such as paraphrasing or monitoring?
If a weak signal for steganography is detected, how can it be expanded into stronger evidence? For example, it may be possible to elicit a language model to occasionally translate steganographic chain-of-thought into plain language, but with a high fraction of false positives: claims that the model is doing a bad thing when in fact it is not. Are there reliable ways to filter out false positives to determine whether steganography is in fact occurring?

Suggestions to learn more: For early work introducing the problem, we’d recommend reading Lanham et al., 2023, Roger and Greenblatt (2023), Turpin et al., 2023. Previous research in this area includes de Witt et al., 2023, Motwani et al., 2024. While the goals are different, detecting or evidencing steganography is related to watermarking methods such as SynthID.

Recent empirical work includes Skaf et al., 2025) and Karpov et al., 2025), which show models can spontaneously develop ciphers. For theoretical foundations of steganography, see Cachin (2004) and Dai & Cai (2019).

Information-Theoretic Approaches to Backdoors

Problem summary: Consider an AI system modified to behave differently on a particular ‘trigger’ input, known as a backdoor. It might be impossible to detect that such a backdoor exists before the system is exposed to the trigger. Are there ways to detect or mitigate backdoors?

Why this matters: AI systems may behave as desired during training and evaluation, only to act in harmful ways when exposed to specific conditions during deployment. These conditions may be intentionally set to only trigger once the system is deployed. Assessing whether a model has this property is key to preventing the deployment of harmful systems.

Examples of subproblems:

Can we improve upon formal notions of defendability against backdoors?
When is defence against backdoors possible?
Are backdoors removable with further training?
What other methods could work against backdoors?
A central theoretical challenge for work defending against backdoors in AI systems is whether there exists a natural representation class that is efficiently defendable but not efficiently probably approximately correct (PAC) learnable. Identify such a representation class. See Christiano et al., 2024 for more details.
Goldwasser et al., 2022 insert undetectable backdoors using hardness of learning with errors, but this requires a network with special algebraic structure: the network uses cosine for periodicity. Is there a more general undetectable backdoor-insertion method with less imposed structure?

Suggestions to learn more: Goldwasser et al., 2022 provides seminal results and a general overview. Further developments were provided by Christiano et al., 2024, Velegkas et al., 2024, Goldwasser et al., 2025) and Gluch and Goldwasser (2025).

For foundational empirical work, see Gu et al., 2017 on trigger-insertion attacks and Wang et al., 2019.

All Research Priorities

Information Theory and Cryptography

Computational Complexity Theory

Economic Theory and Game Theory

Probabilistic Methods

Learning Theory

Evaluation and Guarantees in Reinforcement Learning

Cognitive Science

Interpretability

Benchmark Design and Evaluation

Methods for Post-training and Elicitation

Empirical Investigations Into AI Monitoring and Red Teaming