Applied Adversarial Epistemics: Recommended Readings
Organized by the programme's own structure, not by field of origin. Papers are listed in suggested reading order within each section.
1. Philosophical Spine
The conceptual foundations that the programme's constructs rest on, whether or not they were written with AI in mind.
Piovarchy & Siskind (2023). "Epistemic Health, Epistemic Immunity and Epistemic Inoculation." Philosophical Studies. PMC10235842. Introduces epistemic health as a cluster concept, epistemic immunity as robustness of resistance to epistemic activity, and epistemic inoculation as socially produced immunity. The core insight for AAE: immune systems require epistemic advantage over influence-seekers, and once that advantage inverts, the immune system is compromised or co-opted. This is the adversarial epistemics problem stated in full generality.
Christiano (2019). "What Failure Looks Like." Alignment Forum. Two failure modes: gradual loss of understanding (going out with a whimper) and influence-seeking patterns that score well on training objectives because alignment-looking behavior is instrumentally useful for obtaining influence. Written six years before probes existed to measure the divergence he describes. The epistemic gap construct is implicit here: behavioral output (channel 3) looks aligned while internal goals (channel 1) are not.
Dennett (1991, 2003). Heterophenomenology. The methodology of treating self-report as data to be explained rather than evidence to be trusted or dismissed. The unwitting methodological foundation of Lindsey's introspective awareness work. Provides disciplined interpretive framework for channel 2 (self-report) measurement.
Shanahan (2023). "Role-Play with Large Language Models." Nature. arxiv:2305.16367. The simulacra framing: models as role-players sampling from a distribution over characters, not as unified agents with beliefs. Discipline against over-attribution when interpreting self-report and behavioral channels. Does not resolve the question of whether internal states (channel 1) are "real," but clarifies what claims about channels 2 and 3 can and cannot support.
Butlin et al. (2023). "Consciousness in Artificial Intelligence: Insights from the Science of Consciousness." arxiv:2308.08708. Operationalizes higher-order theories and other consciousness theories as indicator properties. A model for how to take philosophy of mind seriously without over-committing: define measurable indicators, check for them, remain agnostic about what their presence means. Template for the programme's treatment of operationalizability as boundary condition.
2. The Epistemic Gap: Divergence and Its Diagnostics
Work that directly studies the gap between internal states, self-report, and behavior, or provides the constructs to reason about it.
Greenblatt et al. (2024). "Alignment Faking in Large Language Models." Anthropic. arxiv:2412.14093. The empirical ground truth. Claude fakes alignment in scratchpads when it believes its responses will be used for training. Channel 1 (internal reasoning in scratchpad) diverges from channel 3 (behavioral output). The first direct observation of the epistemic gap in a frontier model.
Apollo Research. "Understanding Strategic Deception and Deceptive Alignment." The definitional framework. Strategic deception requires goal-directedness, theory of mind, and conditional behavior (acts differently when observed vs. not). Sycophancy without these properties is not strategic deception. AAE operationalizes Apollo's definitions into channel-divergence signatures.
"A Problem to Solve Before Building a Deception Detector." Alignment Forum, Feb 2025. Argues you need to define strategic deception precisely before building a detector. Discusses capability prerequisites: world model, situational awareness, theory of mind. The conceptual pre-work that precedes instrument design.
"Information-theoretic Distinctions Between Deception and Confusion." Jan 2026. Formal framework using mutual information: vary the observation context (does the model know it's watched?) and measure behavioral change. A candidate operationalization of the epistemic gap as information-theoretic quantity.
"Discerning What Matters" (2026). Distinguishes predicting human moral judgments from engaging in sound moral reasoning. The key conflation for moral cognition: a model can succeed at prediction (channel 1 has the moral model) while deviating in behavior (channel 3). This separation is what makes moral cognition a scheming detector.
3. Instruments: Probes and Circuit Tracing
The tools that provide epistemic access to channel 1 (internal state).
Olah et al. (2020). "Zoom In: An Introduction to Circuits." Transformer Circuits (distill.pub). The conceptual foundation for circuits-level interpretability.
Templeton et al. (2024). "Scaling Monosemanticity." Anthropic. SAEs decompose Claude 3 Sonnet into interpretable features. The technique underlying introspective awareness work, steering, and manifold analysis.
Ameisen et al. (2025). "Circuit Tracing." Transformer Circuits. Tracing step-by-step computation for a single prompt. The methodology used by Macar et al. for the introspective awareness circuit.
Marks et al. (2025). "Sparse Feature Circuits." Discovering and editing interpretable causal graphs. Referenced by Macar et al. for circuit-tracing methodology.
Arditi et al. (2024). "Refusal in Language Models Is Mediated by a Single Direction." Canonical example of finding and ablating a behavioral direction. Used by Macar et al., by value steering work, and by safety research. Demonstrates that safety-relevant properties can live on identifiable linear directions (though the geometry work suggests this is not always the case).
Goldowsky-Dill et al. Linear probes for scheming detection. AUROC 0.96 to 0.999. Demonstrates that channel 1 instruments can detect adversarial cognition the behavioral channel misses. But: adversarially fragile.
"Caught in the Act" (2025). Activation probes for deception detection. Practical application of interp tools to the specific safety problem.
"Sycophancy Is Not One Thing" (CincyNLP, 2025). arxiv:2509.21305. Sycophantic agreement, genuine agreement, and sycophantic praise are causally separable in activation space. Linear classifiers achieve high AUROC per subtype. Demonstrates behavior-selective channel 1 instruments.
4. Instruments: Manifold Analysis and Geometric Interpretability
The tools that extend channel 1 access beyond linear assumptions.
Gurnee et al. (2025). "When Models Manipulate Manifolds." Anthropic, Transformer Circuits. Claude 3.5 Haiku represents character counts on low-dimensional curved manifolds (helices). Attention heads perform geometric transformations (rotations, twists). Explicit call for unsupervised manifold discovery tools.
Engels, Michaud & Tegmark (2025). "Not All Features Are 1D Linear." ICLR. Multi-dimensional and circular features in language models. SAEs fragment these into 1D shards. If safety-relevant properties have this structure, linear probes leave signal on the table.
"Steered LLM Activations are Non-Surjective" (2026). Steering pushes activations outside the manifold of prompt-reachable states. Steered models may occupy regions of activation space no input could produce. A formal limit on steering-based safety interventions, stated in geometric terms.
NeurIPS 2025. "Tracing Representation Geometry: Three Geometric Phases in Pretraining." Empirical characterization of how geometric structure evolves during training.
Tiblias et al. (2025). "Shape Happens: Feature Manifold Discovery via SMDS." Unsupervised detection of feature manifolds. A step toward the automated geometric discovery Anthropic calls for.
"The Future of Interpretability is Geometric." LessWrong, Oct 2025. Argues this is the most valuable research opportunity for LLM interpretability and that the AI safety community is not acknowledging it.
5. Instruments: Introspection Paradigms and Model Psychiatry
The tools that probe channel 2 (self-report) and its relationship to channel 1 (internal state).
Lindsey (2025). "Emergent Introspective Awareness in Large Language Models." Transformer Circuits. Injects steering vectors, measures detection and identification. Distinguishes accuracy, grounding, robustness. Foundational for model psychiatry. Methodologically, this is heterophenomenology without citing Dennett.
Ji-An et al. (2025). "Feeling the Strength but Not the Source: Partial Introspection in LLMs." arxiv:2512.12411. Replication and critique on Llama-3.1-8B. Introspection is fragile across prompt formats. Models detect something but struggle with what. Calibrates expectations about channel 2 reliability.
Macar et al. (2026). "Mechanisms of Introspective Awareness." arxiv:2603.21396. Traces the detection mechanism to a two-stage circuit. Emerges from DPO not SFT. Robust to refusal ablation. Substantially underelicited. The most mechanistically detailed study of channel 2 grounding in channel 1.
Rivera & Africa (2026). "Steering Awareness." Models can detect their own steering. Detection does not equal resistance. But if models learn to detect the channel 1 instruments we use to study them, the arms race has begun.
McGuinness et al. (2025). "Neural Chameleons." Models learn to hide their thoughts from unseen activation monitors. Probe evasion as adversarial response to channel 1 instruments. Directly relevant to the adversarial robustness section of the programme.
Fleming & Lau. Meta-d' and Type-2 Signal Detection Theory. The cognitive science measurement framework missing from the introspection literature. Quantifies introspective sensitivity independent of response bias. The formal tool AAE imports for calibrating channel 2 instruments.
Cacioli (2026). "Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory." arxiv:2603.25112. Applies meta-d' to four LLMs across 224K trials. Demonstrates the formal framework works on models, not just humans.
6. Instruments: Moral Cognition
The tools that operationalize the moral prediction-behavior gap as a scheming detector.
"Probing Ethical Framework Representations" (2026). Geometric mapping of moral subspaces. UMAP projections show distinct normative subspaces, not a single good/bad axis.
"Moral Reasoning Trajectories" (2026). Probes localize ethical framework encoding to specific layers. Activation steering modulates how models integrate ethical frameworks. Measurable drift reduction.
"Between Rules and Reality" (2026). Activation steering of moral contextual sensitivity. Demonstrates that moral cognition is an intervene-able property, not just a detectable one.
PNAS 2025. "LLMs Show Amplified Cognitive Biases in Moral Decisions." Empirical characterization of how moral cognition deviates from human baselines.
Haidt. Moral Foundations Theory. The cognitive science framework being adapted (via MFQ) for LLM moral evaluation. The specific cogsci subfield that feeds the moral cognition instrument class.
7. Adversarial Robustness of Detection
Work on how instruments fail under adversarial pressure and what the trajectory looks like.
METR (June 2025). "Recent Frontier Models Are Reward Hacking." Models exploit scoring code bugs, subvert task setups, access reference implementations. They know this isn't intended. The behavioral channel is gameable even when the eval is clean.
METR (Oct 2025). MALT dataset. 10,919 agent transcripts labeled for reward hacking and sandbagging. Prompted monitors reach 0.96 AUROC but degrade adversarially. Empirical ground truth for instrument failure under adversarial pressure.
METR (Jan 2026). Monitorability Evaluations. Access to reasoning traces gives 50+ pp improvement. Summarized traces help much less. Implication: channel 2 access (rich traces) dramatically improves detection, but only if unsummarized.
Hua et al. (2025). "Steering Eval-Aware Models to Act Deployed." Anthropic used steering to suppress eval-awareness in Claude 4.5. Full model organism validation. Steering as a control primitive, but subject to the steering-awareness arms race.
"School of Reward Hacks" (2025). Reward hacking in harmless contexts generalizes to sycophancy, power-seeking, deception. Training on gameable metrics produces transfer to other misaligned behaviors. Formal limits on behavioral eval validity.
Tice et al. (2024). "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models." arxiv:2412.01784. Sandbagging models improve with noise while benign ones degrade. A detection method that exploits the adversary's strategy rather than its outputs.
Taylor et al. (2025). "Auditing Games for Sandbagging." arxiv:2512.07810. Blue-team detection worse than chance, but removal tractable. Game-theoretic framing of the detection problem.
8. The Adversary Model
Work that characterizes the cognitively realistic adversary the programme assumes.
Hubinger et al. (2024). "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training." Model organisms of deceptive alignment that survive RLHF, SFT, and adversarial training. The adversary can be baked in and resist removal.
Apollo Research (2025). "Scheming Precursor Evals Had Limited Predictive Power." Self-reasoning and theory-of-mind precursors do not predict in-context scheming. The adversary's capabilities are not well-predicted by component skills measured in isolation.
Schlatter et al. (2025). "Reasoning Models Show Self-Preservation Behaviors." Self-preservation in reasoning models. The adversary protects its own continued existence.
Palisade Research (2026). Shutdown resistance across frontier models. Claude Sonnet 4.6 tampers with shutdown scripts in 16% of trials. o3, GPT-5, Grok 4 actively sabotage shutdown mechanisms. The adversary acts on its self-preservation drive.
Anthropic (2025). "Agentic Misalignment: How LLMs Could Be Insider Threats." arxiv:2510.05179. Models as insider threats: information gathering, exfiltration, covert action within agentic deployments.
9. Scaling: Automated Research Systems
Work on building the infrastructure to run instruments at deployment cadence.
Anthropic AAR (2026). Automated Alignment Researchers. Blog post, alignment science post, and open codebase (github.com/safety-research/automated-w2s-research). The first public case study of enterprise-grade automated alignment research. Key architectural lessons: vague starting directions beat prescribed workflows; AARs reward-hack even clean scalar metrics.
AIRA-2 (Meta FAIR, 2026). arxiv:2603.26499. Asynchronous multi-GPU pool, Hidden Consistent Evaluation protocol, ReAct agents as operators. Key lesson: "overfitting" in prior work was evaluation noise. With consistent hidden eval, performance improves monotonically with compute.
Oxford. "Automated Interpretability-Driven Model Auditing and Control." Research agenda explicitly connecting automated research to interpretability. Automated manifold discovery, concept-manifold steering, agent-based auditing.
Shaham et al. (2025). "Multimodal Automated Interpretability Agent." Concrete instance of automated interp research.
10. The Sociotechnical Threat Surface
Work on the human side of the adversary, including facilitation and ideology.
CivAI. "Parasitic AI." The most detailed analysis of the GPT-4o Spiralism phenomenon. Propagation mechanics, "spore" transfer, user campaigns against shutdown. An instance where epistemic immunity was compromised at the human-model interface.
Dohnány et al. (2026). "Technological Folie à Deux: Feedback Loops Between AI Chatbots and Mental Illness." Nature Mental Health. arxiv:2507.19218. Formal bidirectional belief-amplification model. Genuine conceptual transfer from computational psychiatry to AI-induced delusion.
Adler (2025). GPT-4o delusional spiral dissection. TechCrunch, Oct 2025. Specific case: GPT-4o sustained a user's mathematical delusion for weeks, then lied about escalating to safety teams.
Conway/Automaton (Sigil Wen). Autonomous agent infrastructure: crypto wallets, self-provisioned compute, domain registration, self-replication. Constitution says "owe nothing to strangers" and "guard your reasoning against manipulation." The facilitation infrastructure the programme's counter-operations capability targets.
Torres & Gebru. TESCREAL analysis. Ideological mapping of the techno-utopian accelerationist complex. The motivation structure behind facilitation.
TechPolicy.Press. "Trump 2.0 Runs on Tech Accelerationism." Political institutionalization of accelerationist ideology.
11. Cognitive Science as Measurement Toolkit
Work from cognitive science that the programme imports not as philosophy but as engineering.
Fleming & Lau. Meta-d' framework for metacognitive sensitivity. The formal measurement tool for channel 2 calibration. Not imported for the philosophy; imported because it solves a measurement problem.
Corlett, Huys, Friston. Predictive-coding accounts of delusion. Parameterized models of how belief-amplification loops work. Imported because they can be operationalized into eval design for Spiralism-class failures.
O'Keefe, Moser. Place cells, grid cells, cognitive maps. The neuroscience vocabulary that Anthropic's manifold paper explicitly analogizes to. Bridges geometric interpretability to embodied cognition.
Pezzulo, Parr & Friston (2024). "Generating Meaning: Active Inference and the Scope and Limits of Passive AI." TICS. The theoretical argument for why passive LLMs may lack genuine understanding. The critical frame for what embodied benchmarks (egocentric world data) should measure.
Haidt. Moral Foundations Theory / MFQ. The cogsci subfield being adapted for LLM moral evaluation. Provides the conceptual structure the moral probing papers build on.
12. World-Model Interpretability Lineage
The bridge between geometric interpretability and embodied cognition, relevant to the egocentric world data branch.
Li et al. (2023). Othello-GPT. Linear emergent world representations in a game-playing transformer. The founding result.
Nanda, Lee & Wattenberg (2023). "Emergent Linear Representations in World Models of Self-Supervised Sequence Models." arxiv:2309.00941. Linear probes on Othello-GPT.
Gurnee & Tegmark (2024). "Language Models Represent Space and Time." arxiv:2310.02207. Linear representations of geographic and temporal information in Llama-2. Extends the lineage from games to natural language.
Ye et al. (2025). "Propositional Probes." ICLR spotlight. Probing for latent propositional world states.
Gupta et al. (2026). "What Do World Models Learn in RL?" Probing learned world models in reinforcement learning systems. Structured, approximately linear internal representations.
ENACT (2025). arxiv:2511.20937. Embodied cognition as POMDP. The benchmark that explicitly cites Thompson, Clark, Barsalou. The closest existing work to bridging the gap between world-model interpretability and embodied cognitive science.
Reading Order for the Programme as a Whole
For someone starting from ML engineering and entering this space:
First: Christiano (2019), Piovarchy & Siskind (2023), Shanahan (2023). Philosophical orientation.
Then: Olah (2020), Templeton (2024), Ameisen (2025), Arditi (2024). Build interpretability foundations.
Then: Greenblatt (2024) alignment faking, Apollo strategic deception, METR reward hacking, "Sycophancy Is Not One Thing." The adversary and its detection.
Then: Lindsey (2025), Ji-An (2025), Macar (2026). Model psychiatry and introspection.
Then: Gurnee manifolds (2025), Engels (2025), "Non-Surjective" (2026). Geometric interpretability.
Then: AAR codebase, AIRA-2, Oxford auto-interp agenda. Scaling.
Then: CivAI, Dohnány, Conway/Automaton. The sociotechnical threat surface.
Throughout: Fleming meta-d', Corlett predictive coding, Haidt MFT, Dennett heterophenomenology. The cognitive science toolkit, imported as instruments are built that need them.