Prove theoretical limits on what AI systems can hide, reveal or prove about their behaviour.
Information theory offers the opportunity to develop mathematical tools to understand the fundamental limits of monitoring and verifying AI systems. These tools could help us determine what can and cannot be detected about AI behaviour—whether models can hide information in their outputs, when backdoors are discoverable, and what level oversight is theoretically possible. By proving both positive results (protocols that work) and impossibility results (fundamental barriers), information theory can help focus alignment efforts on tractable approaches. The problems in this section apply tools like entropy, channel capacity, and cryptographic hardness to core alignment challenges: detecting steganography in model outputs, identifying backdoors in trained systems, and understanding when monitoring can fail. As AI systems become more capable, these theoretical bounds become crucial for designing oversight methods that cannot be circumvented.
Problem summary: It is theoretically possible for AI systems to encode information in their outputs, both cryptographically and steganographically. Can we theoretically bound this possibility and identify possible approaches to preventing it?
Why this matters: One measure to prevent AI systems from causing harm is to monitor all information and reasoning they output. This includes both communication with humans or other AI systems and the use of chain-of-thought reasoning to improve their own outputs. Hiding content in these messages through steganography or cryptography would undermine the effectiveness of our monitoring.
Examples of subproblems:
Suggestions to learn more: For early work introducing the problem, we’d recommend reading Lanham et al., 2023, Roger and Greenblatt (2023), Turpin et al., 2023. Previous research in this area includes de Witt et al., 2023, Motwani et al., 2024. While the goals are different, detecting or evidencing steganography is related to watermarking methods such as SynthID.
Recent empirical work includes Skaf et al., 2025) and Karpov et al., 2025), which show models can spontaneously develop ciphers. For theoretical foundations of steganography, see Cachin (2004) and Dai & Cai (2019).
Problem summary: Consider an AI system modified to behave differently on a particular ‘trigger’ input, known as a backdoor. It might be impossible to detect that such a backdoor exists before the system is exposed to the trigger. Are there ways to detect or mitigate backdoors?
Why this matters: AI systems may behave as desired during training and evaluation, only to act in harmful ways when exposed to specific conditions during deployment. These conditions may be intentionally set to only trigger once the system is deployed. Assessing whether a model has this property is key to preventing the deployment of harmful systems.
Examples of subproblems:
Suggestions to learn more: Goldwasser et al., 2022 provides seminal results and a general overview. Further developments were provided by Christiano et al., 2024, Velegkas et al., 2024, Goldwasser et al., 2025) and Gluch and Goldwasser (2025).