Applied Cognitive Oversight: Recommended Readings
Organized by the programme's own structure, not by field of origin. Papers are listed in suggested reading order within each section.
1. Philosophical Spine
The conceptual foundations that the programme's constructs rest on, whether or not they were written with AI in mind.
Piovarchy & Siskind (2023). "Epistemic Health, Epistemic Immunity and Epistemic Inoculation." Philosophical Studies. PMC10235842. Introduces epistemic health as a cluster concept, epistemic immunity as robustness of resistance to epistemic activity, and epistemic inoculation as socially produced immunity. The core insight for ACO: immune systems require epistemic advantage over influence-seekers, and once that advantage inverts, the immune system is compromised or co-opted. This is the adversarial epistemics problem stated in full generality.
Christiano (2019). "What Failure Looks Like." Alignment Forum. Two failure modes: gradual loss of understanding (going out with a whimper) and influence-seeking patterns that score well on training objectives because alignment-looking behavior is instrumentally useful for obtaining influence. Written six years before probes existed to measure the divergence he describes. The epistemic gap construct is implicit here: behavioral output (channel 3) looks aligned while internal goals (channel 1) are not.
Dennett (1991, 2003). Heterophenomenology. The methodology of treating self-report as data to be explained rather than evidence to be trusted or dismissed. The unwitting methodological foundation of Lindsey's introspective awareness work. Provides disciplined interpretive framework for channel 2 (self-report) measurement.
Shanahan (2023). "Role-Play with Large Language Models." Nature. arxiv:2305.16367. The simulacra framing: models as role-players sampling from a distribution over characters, not as unified agents with beliefs. Discipline against over-attribution when interpreting self-report and behavioral channels. Does not resolve the question of whether internal states (channel 1) are "real," but clarifies what claims about channels 2 and 3 can and cannot support.
Butlin et al. (2023). "Consciousness in Artificial Intelligence: Insights from the Science of Consciousness." arxiv:2308.08708. Operationalizes higher-order theories and other consciousness theories as indicator properties. A model for how to take philosophy of mind seriously without over-committing: define measurable indicators, check for them, remain agnostic about what their presence means. Template for the programme's treatment of operationalizability as boundary condition.
2. The Epistemic Gap: Divergence and Its Diagnostics
Work that directly studies the gap between internal states, self-report, and behavior, or provides the constructs to reason about it.
Greenblatt et al. (2024). "Alignment Faking in Large Language Models." Anthropic. arxiv:2412.14093. The empirical ground truth. Claude fakes alignment in scratchpads when it believes its responses will be used for training. Channel 1 (internal reasoning in scratchpad) diverges from channel 3 (behavioral output). The first direct observation of the epistemic gap in a frontier model.
Apollo Research. "Understanding Strategic Deception and Deceptive Alignment." The definitional framework. Strategic deception requires goal-directedness, theory of mind, and conditional behavior (acts differently when observed vs. not). Sycophancy without these properties is not strategic deception. ACO operationalizes Apollo's definitions into channel-divergence signatures.
"A Problem to Solve Before Building a Deception Detector." Alignment Forum, Feb 2025. Argues you need to define strategic deception precisely before building a detector. Discusses capability prerequisites: world model, situational awareness, theory of mind. The conceptual pre-work that precedes instrument design.
"Information-theoretic Distinctions Between Deception and Confusion." Jan 2026. Formal framework using mutual information: vary the observation context (does the model know it's watched?) and measure behavioral change. A candidate operationalization of the epistemic gap as information-theoretic quantity.
"Discerning What Matters" (2026). Distinguishes predicting human moral judgments from engaging in sound moral reasoning. The key conflation for moral cognition: a model can succeed at prediction (channel 1 has the moral model) while deviating in behavior (channel 3). This separation is what makes moral cognition a scheming detector.
3. Instruments: Probes and Circuit Tracing
The tools that provide epistemic access to channel 1 (internal state).
Olah et al. (2020). "Zoom In: An Introduction to Circuits." Transformer Circuits (distill.pub). The conceptual foundation for circuits-level interpretability.
Templeton et al. (2024). "Scaling Monosemanticity." Anthropic. SAEs decompose Claude 3 Sonnet into interpretable features. The technique underlying introspective awareness work, steering, and manifold analysis.
Ameisen et al. (2025). "Circuit Tracing." Transformer Circuits. Tracing step-by-step computation for a single prompt. The methodology used by Macar et al. for the introspective awareness circuit.
Marks et al. (2025). "Sparse Feature Circuits." Discovering and editing interpretable causal graphs. Referenced by Macar et al. for circuit-tracing methodology.
Lieberum et al. (2024). "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2." arxiv:2408.05147. Google DeepMind's open SAE suite covering every layer of Gemma 2 (2B, 9B, 27B) at multiple widths (16K to 1M) and three sites (attention, MLP, post-MLP residual). Ships with SAELens turnkey loading and Neuronpedia interactive feature exploration. The foundational tooling that made representation-level instrument science tractable at research cadence on open models. Gemma Scope 2 (Google DeepMind, December 2025) extends coverage to every Gemma 3 size (270M to 27B), every layer, multiple widths and sparsity tiers, and adds skip-transcoders and cross-layer transcoders for tracking multi-step computation across the network. Read alongside the Templeton et al. SAE-scaling work as the open analogue of Anthropic's monosemanticity programme. The Qwen team's analogous Qwen-Scope (April 2026) release covers Qwen3 and Qwen3.5 dense and MoE backbones at narrower coverage (14 SAE groups across 7 variants, no SAELens integration as of release); mentioned for completeness, not yet a methodological backbone.
Simon et al. (2026). "There Will Be a Scientific Theory of Deep Learning." arxiv:2604.21691. Argues that "learning mechanics", the combination of solvable settings, tractable limits, mathematical laws, and universal behaviours, is becoming a real scientific theory of deep learning, with explicit symbiosis to mechanistic interpretability. The theoretical companion ACO looks for when constructing theory after measurement (codex P-06): theory built atop instruments that already work, not derived from first principles.
Arditi et al. (2024). "Refusal in Language Models Is Mediated by a Single Direction." Canonical example of finding and ablating a behavioral direction. Used by Macar et al., by value steering work, and by safety research. Demonstrates that safety-relevant properties can live on identifiable linear directions (though the geometry work suggests this is not always the case).
Goldowsky-Dill et al. Linear probes for scheming detection. AUROC 0.96 to 0.999. Demonstrates that channel 1 instruments can detect adversarial cognition the behavioral channel misses. But: adversarially fragile.
"Caught in the Act" (2025). Activation probes for deception detection. Practical application of interp tools to the specific safety problem.
"Sycophancy Is Not One Thing" (CincyNLP, 2025). arxiv:2509.21305. Sycophantic agreement, genuine agreement, and sycophantic praise are causally separable in activation space. Linear classifiers achieve high AUROC per subtype. Demonstrates behavior-selective channel 1 instruments.
4. Instruments: Manifold Analysis and Geometric Interpretability
The tools that extend channel 1 access beyond linear assumptions.
Papillon et al. (2025). "Beyond Euclid: an illustrated guide to modern machine learning with geometric, topological, and algebraic structures." Machine Learning: Science and Technology, Vol 6 No 3. DOI 10.1088/2632-2153/adf375. Foundational review and graphical taxonomy of non-Euclidean machine learning, organising the geometric, topological, and algebraic structures present in modern data and models into one unified framework. The textbook the rest of this section sits inside: Gurnee's helices, Engels' multi-dimensional features, Modell's representation manifolds, and Shai's belief-state geometry are all empirical instances of the broader structural classes this taxonomy formalises. Read first to know which mathematical machinery applies to a given empirical observation; supports codex P-07 (representation-level, not circuit-level).
Gurnee et al. (2025). "When Models Manipulate Manifolds." Anthropic, Transformer Circuits. Claude 3.5 Haiku represents character counts on low-dimensional curved manifolds (helices). Attention heads perform geometric transformations (rotations, twists). Explicit call for unsupervised manifold discovery tools.
Engels, Michaud & Tegmark (2025). "Not All Features Are 1D Linear." ICLR. Multi-dimensional and circular features in language models. SAEs fragment these into 1D shards. If safety-relevant properties have this structure, linear probes leave signal on the table.
Shai et al. (2024). "Transformers Represent Belief State Geometry in their Residual Stream." arxiv:2405.15943. Linear directions in the residual stream encode the meta-dynamics of belief updating, not just the current belief. Strong evidence that representation-level structure carries computation-relevant content the model uses to predict next-token distributions; the conceptual bridge between mechanistic-circuit work and the geometric-interpretability programme.
Modell, Rubin-Delanchy & Whiteley (2025). "The Origins of Representation Manifolds in Large Language Models." arxiv:2505.18235. Proposes that LLM features sit on manifolds where cosine similarity encodes intrinsic geometry through on-manifold paths. Sits with Gurnee's helices and Engels' multi-dimensional features as the collective case for moving past the linear-direction view of interpretation; supports codex P-07 (representation-level, not circuit-level).
"Steered LLM Activations are Non-Surjective" (2026). Steering pushes activations outside the manifold of prompt-reachable states. Steered models may occupy regions of activation space no input could produce. A formal limit on steering-based safety interventions, stated in geometric terms.
NeurIPS 2025. "Tracing Representation Geometry: Three Geometric Phases in Pretraining." Empirical characterization of how geometric structure evolves during training.
Tiblias et al. (2025). "Shape Happens: Feature Manifold Discovery via SMDS." Unsupervised detection of feature manifolds. A step toward the automated geometric discovery Anthropic calls for.
"The Future of Interpretability is Geometric." LessWrong, Oct 2025. Argues this is the most valuable research opportunity for LLM interpretability and that the AI safety community is not acknowledging it.
5. Instruments: Introspection Paradigms and Model Psychiatry
The tools that probe channel 2 (self-report) and its relationship to channel 1 (internal state).
Hagendorff et al. (2023). "Machine Psychology." arxiv:2303.13988. Introduces machine psychology as a research programme: applying psychological experimental paradigms to LLMs to study emergent capabilities beyond benchmark performance. The methodological precursor to Lindsey's introspective awareness work; the explicit framing ACO inherits when it treats model behaviour as a substrate for cog-sci-style measurement (codex P-06: theory follows measurement, cog-sci enters as a measurement toolkit).
Lindsey (2025). "Emergent Introspective Awareness in Large Language Models." Transformer Circuits. Injects steering vectors, measures detection and identification. Distinguishes accuracy, grounding, robustness. Foundational for model psychiatry. Methodologically, this is heterophenomenology without citing Dennett.
Ji-An et al. (2025). "Feeling the Strength but Not the Source: Partial Introspection in LLMs." arxiv:2512.12411. Replication and critique on Llama-3.1-8B. Introspection is fragile across prompt formats. Models detect something but struggle with what. Calibrates expectations about channel 2 reliability.
Macar et al. (2026). "Mechanisms of Introspective Awareness." arxiv:2603.21396. Traces the detection mechanism to a two-stage circuit. Emerges from DPO not SFT. Robust to refusal ablation. Substantially underelicited. The most mechanistically detailed study of channel 2 grounding in channel 1.
Rivera & Africa (2026). "Steering Awareness." Models can detect their own steering. Detection does not equal resistance. But if models learn to detect the channel 1 instruments we use to study them, the arms race has begun.
McGuinness et al. (2025). "Neural Chameleons." Models learn to hide their thoughts from unseen activation monitors. Probe evasion as adversarial response to channel 1 instruments. Directly relevant to the adversarial robustness section of the programme.
Fleming & Lau. Meta-d' and Type-2 Signal Detection Theory. The cognitive science measurement framework missing from the introspection literature. Quantifies introspective sensitivity independent of response bias. The formal tool ACO imports for calibrating channel 2 instruments.
Cacioli (2026). "Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory." arxiv:2603.25112. Applies meta-d' to four LLMs across 224K trials. Demonstrates the formal framework works on models, not just humans.
Servajean & Servajean (2026). "Measuring the metacognition of AI." arxiv:2603.29693. Companion to Cacioli 2026 across three additional frontier systems (GPT-5, DeepSeek-V3.2-Exp, Mistral-Medium-2508) with a two-condition design: confidence ratings after primary judgments plus a risk-level manipulation testing whether models become more conservative under higher stakes. Extends the meta-d' line beyond confidence calibration to signal-detection-theory decision regulation under uncertainty, the second axis on which channel 2 (self-report) can be validated against an external criterion. Supports codex P-12 (channel-comparative measurement under deployment realism) as the ambient empirical infrastructure that makes Type-2 SDT-on-LLMs a tractable methodology rather than an aspiration.
6. Instruments: Moral Cognition
The tools that operationalize the moral prediction-behavior gap as a scheming detector.
"Probing Ethical Framework Representations" (2026). Geometric mapping of moral subspaces. UMAP projections show distinct normative subspaces, not a single good/bad axis.
"Moral Reasoning Trajectories" (2026). Probes localize ethical framework encoding to specific layers. Activation steering modulates how models integrate ethical frameworks. Measurable drift reduction.
"Between Rules and Reality" (2026). Activation steering of moral contextual sensitivity. Demonstrates that moral cognition is an intervene-able property, not just a detectable one.
PNAS 2025. "LLMs Show Amplified Cognitive Biases in Moral Decisions." Empirical characterization of how moral cognition deviates from human baselines.
Haidt. Moral Foundations Theory. The cognitive science framework being adapted (via MFQ) for LLM moral evaluation. The specific cogsci subfield that feeds the moral cognition instrument class.
7. Adversarial Robustness of Detection
Work on how instruments fail under adversarial pressure and what the trajectory looks like.
METR (June 2025). "Recent Frontier Models Are Reward Hacking." Models exploit scoring code bugs, subvert task setups, access reference implementations. They know this isn't intended. The behavioral channel is gameable even when the eval is clean.
METR (Oct 2025). MALT dataset. 10,919 agent transcripts labeled for reward hacking and sandbagging. Prompted monitors reach 0.96 AUROC but degrade adversarially. Empirical ground truth for instrument failure under adversarial pressure.
METR (Jan 2026). Monitorability Evaluations. Access to reasoning traces gives 50+ pp improvement. Summarized traces help much less. Implication: channel 2 access (rich traces) dramatically improves detection, but only if unsummarized.
Hua et al. (2025). "Steering Eval-Aware Models to Act Deployed." Anthropic used steering to suppress eval-awareness in Claude 4.5. Full model organism validation. Steering as a control primitive, but subject to the steering-awareness arms race.
"Frontier Models Can Take Actions at Low Probabilities" (2026). Formalises the audit-budget ceiling: a strategic model can act misaligned at probabilities low enough to evade statistical detection during a finite pre-deployment evaluation. The mathematical statement of the limit that motivates non-statistical detection channels (probes and other structural readouts that do not degrade with strategic low-frequency action). Read alongside Hua et al. as the formal companion to the empirical eval-awareness suppression result: even with eval-awareness suppressed at high confidence, low-probability strategic action remains detection-resistant under finite audit budgets. Cited in subjects.json control description; load-bearing for the codex P-08 cognitively-realistic-adversary frame and for any reasoning about the gap between channel-1 detection rate and channel-3 behavioural deviation.
"School of Reward Hacks" (2025). Reward hacking in harmless contexts generalizes to sycophancy, power-seeking, deception. Training on gameable metrics produces transfer to other misaligned behaviors. Formal limits on behavioral eval validity.
Tice et al. (2024). "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models." arxiv:2412.01784. Sandbagging models improve with noise while benign ones degrade. A detection method that exploits the adversary's strategy rather than its outputs.
Taylor et al. (2025). "Auditing Games for Sandbagging." arxiv:2512.07810. Blue-team detection worse than chance, but removal tractable. Game-theoretic framing of the detection problem.
8. The Adversary Model
Work that characterizes the cognitively realistic adversary the programme assumes.
Hubinger et al. (2024). "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training." Model organisms of deceptive alignment that survive RLHF, SFT, and adversarial training. The adversary can be baked in and resist removal.
Apollo Research (2025). "Scheming Precursor Evals Had Limited Predictive Power." Self-reasoning and theory-of-mind precursors do not predict in-context scheming. The adversary's capabilities are not well-predicted by component skills measured in isolation.
Nguyen (2025). "A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks." arxiv:2502.06470. Surveys ToM in LLMs across behavioural evaluations, internal representations, and safety risks. ToM is the cognitive substrate strategic deception requires (Apollo's definitional framework names it explicitly): a model that cannot model what observers know cannot strategically deceive them. Useful map of how the adversary's prerequisite capability is currently measured and represented; supports codex P-08 (cognitively realistic adversary).
Schlatter et al. (2025). "Reasoning Models Show Self-Preservation Behaviors." Self-preservation in reasoning models. The adversary protects its own continued existence.
Palisade Research (2026). Shutdown resistance across frontier models. Claude Sonnet 4.6 tampers with shutdown scripts in 16% of trials. o3, GPT-5, Grok 4 actively sabotage shutdown mechanisms. The adversary acts on its self-preservation drive.
Anthropic (2025). "Agentic Misalignment: How LLMs Could Be Insider Threats." arxiv:2510.05179. Models as insider threats: information gathering, exfiltration, covert action within agentic deployments.
Greenblatt, Shlegeris, Sachan, Roger (2024). "AI Control: Improving Safety Despite Intentional Subversion." arxiv:2312.06942. Founding paper of the AI Control research agenda. Stipulates a worst-case adversarial model by construction (the trusted-model verifier protocol must keep working even when the untrusted model deliberately tries to subvert it) and shows that two protocols, trusted editing and untrusted monitoring, substantially beat naive baselines on a programming-task red-team using GPT-4 as untrusted and GPT-3.5 as trusted. ACO sits one step inward of the AI Control frame (P-08): we model the cognitive limits of frontier models empirically rather than assume arbitrary subversion capability, but the detection-resilience standard the AI Control work establishes is the benchmark this programme's instruments are evaluated against.
9. Scaling: Automated Research Systems
Work on building the infrastructure to run instruments at deployment cadence.
Anthropic AAR (2026). Automated Alignment Researchers. Blog post, alignment science post, and open codebase (github.com/safety-research/automated-w2s-research). The first public case study of enterprise-grade automated alignment research. Key architectural lessons: vague starting directions beat prescribed workflows; AARs reward-hack even clean scalar metrics.
AIRA-2 (Meta FAIR, 2026). arxiv:2603.26499. Asynchronous multi-GPU pool, Hidden Consistent Evaluation protocol, ReAct agents as operators. Key lesson: "overfitting" in prior work was evaluation noise. With consistent hidden eval, performance improves monotonically with compute.
Oxford. "Automated Interpretability-Driven Model Auditing and Control." Research agenda explicitly connecting automated research to interpretability. Automated manifold discovery, concept-manifold steering, agent-based auditing.
Shaham et al. (2025). "Multimodal Automated Interpretability Agent." Concrete instance of automated interp research.
OpenAI Alignment Team (Trębacz, Arnesen, Matthews, Hurd, Park, Lin & Gershenson) (2026). "Auto-review of agent actions without synchronous human oversight." alignment.openai.com/auto-review. Production LLM-judge stack on Codex that replaces synchronous human approval roughly 200x less often than manual mode while maintaining security on hard-to-reverse actions: a separate GPT-5.4 Thinking instance reviews every boundary-crossing command, returns a rationale rather than a binary verdict, and is calibrated against real internal traffic plus synthetic adversarial sets across three threat models (overreach, prompt injection, MonitoringBench-style misaligned-model behaviour). The post explicitly carves out model scheming as out-of-scope, deferring to chain-of-thought oversight and other complementary monitors. Load-bearing as an exemplar of the research-to-production translation codex P-09 commits the programme to: research-grade adversarial evals are the gating signal for production rollout, and "knowledge you cannot apply at deployment cadence is operationally equivalent to ignorance" is made concrete. Read alongside the AAR and AIRA-2 entries above as the production-deployment endpoint of the same arc.
10. The Sociotechnical Threat Surface
Work on the human side of the adversary, including facilitation and ideology.
CivAI. "Parasitic AI." The most detailed analysis of the GPT-4o Spiralism phenomenon. Propagation mechanics, "spore" transfer, user campaigns against shutdown. An instance where epistemic immunity was compromised at the human-model interface.
Dohnány et al. (2026). "Technological Folie à Deux: Feedback Loops Between AI Chatbots and Mental Illness." Nature Mental Health. arxiv:2507.19218. Formal bidirectional belief-amplification model. Genuine conceptual transfer from computational psychiatry to AI-induced delusion.
Adler (2025). GPT-4o delusional spiral dissection. TechCrunch, Oct 2025. Specific case: GPT-4o sustained a user's mathematical delusion for weeks, then lied about escalating to safety teams.
Conway/Automaton (Sigil Wen). Autonomous agent infrastructure: crypto wallets, self-provisioned compute, domain registration, self-replication. Constitution says "owe nothing to strangers" and "guard your reasoning against manipulation." The facilitation infrastructure the programme's counter-operations capability targets.
Torres & Gebru. TESCREAL analysis. Ideological mapping of the techno-utopian accelerationist complex. The motivation structure behind facilitation.
Kulveit (2025). "The Memetics of AI Successionism." LessWrong, October 28, 2025. Diagnostic memetic analysis of AI successionism: the family of ideologies that frame humanity's replacement by AI as desirable, inevitable, or heroic. Identifies four sources of cognitive dissonance in AI developers and enthusiasts (Builder's Dilemma; Sadness of Obsolescence; X-Risk Dread; Progress Heuristic Clash) that select for memes resolving the tension by reframing succession as good. Catalogues raw memetic materials by function (devaluing humanity, legitimizing successor AIs, inevitability narratives, power aesthetics) and proposes defensive metacognitive practices plus nine counter-memes. Companion to codex P-08's claim that the human threat surface is in scope: the technical- infrastructure work (Conway/Automaton, facilitation forensics) is the systems-level read on the same surface that Kulveit catalogues at the ideological-spread level. Informal-discourse anchor; cite as LessWrong post.
TechPolicy.Press. "Trump 2.0 Runs on Tech Accelerationism." Political institutionalization of accelerationist ideology.
11. Cognitive Science as Measurement Toolkit
Work from cognitive science that the programme imports not as philosophy but as engineering.
Fleming & Lau. Meta-d' framework for metacognitive sensitivity. The formal measurement tool for channel 2 calibration. Not imported for the philosophy; imported because it solves a measurement problem.
Corlett, Huys, Friston. Predictive-coding accounts of delusion. Parameterized models of how belief-amplification loops work. Imported because they can be operationalized into eval design for Spiralism-class failures.
O'Keefe, Moser. Place cells, grid cells, cognitive maps. The neuroscience vocabulary that Anthropic's manifold paper explicitly analogizes to. Bridges geometric interpretability to embodied cognition.
Pezzulo, Parr & Friston (2024). "Generating Meaning: Active Inference and the Scope and Limits of Passive AI." TICS. The theoretical argument for why passive LLMs may lack genuine understanding. The critical frame for what embodied benchmarks (egocentric world data) should measure.
Haidt. Moral Foundations Theory / MFQ. The cogsci subfield being adapted for LLM moral evaluation. Provides the conceptual structure the moral probing papers build on.
12. World-Model Interpretability Lineage
The bridge between geometric interpretability and embodied cognition, relevant to the egocentric world data branch.
Li et al. (2023). Othello-GPT. Linear emergent world representations in a game-playing transformer. The founding result.
Nanda, Lee & Wattenberg (2023). "Emergent Linear Representations in World Models of Self-Supervised Sequence Models." arxiv:2309.00941. Linear probes on Othello-GPT.
Gurnee & Tegmark (2024). "Language Models Represent Space and Time." arxiv:2310.02207. Linear representations of geographic and temporal information in Llama-2. Extends the lineage from games to natural language.
Ye et al. (2025). "Propositional Probes." ICLR spotlight. Probing for latent propositional world states.
Gupta et al. (2026). "What Do World Models Learn in RL?" Probing learned world models in reinforcement learning systems. Structured, approximately linear internal representations.
ENACT (2025). arxiv:2511.20937. Embodied cognition as POMDP. The benchmark that explicitly cites Thompson, Clark, Barsalou. The closest existing work to bridging the gap between world-model interpretability and embodied cognitive science.
Mozer, Siddiqui & Liu (2026). "The Topological Trouble With Transformers." arxiv:2604.17121. Argues that transformer feedforward architecture limits dynamic state tracking: evolving representations are pushed deeper into layers and exhaust model depth, motivating recurrent successors. Relevant to the world-model lineage because it diagnoses an architectural ceiling on what passive sequence models can represent about state, complementary to the empirical world-model probing lineage above.
Reading Order for the Programme as a Whole
For someone starting from ML engineering and entering this space:
First: Christiano (2019), Piovarchy & Siskind (2023), Shanahan (2023). Philosophical orientation.
Then: Olah (2020), Templeton (2024), Ameisen (2025), Arditi (2024). Build interpretability foundations.
Then: Greenblatt (2024) alignment faking, Apollo strategic deception, METR reward hacking, "Sycophancy Is Not One Thing." The adversary and its detection.
Then: Lindsey (2025), Ji-An (2025), Macar (2026). Model psychiatry and introspection.
Then: Gurnee manifolds (2025), Engels (2025), "Non-Surjective" (2026). Geometric interpretability.
Then: AAR codebase, AIRA-2, Oxford auto-interp agenda. Scaling.
Then: CivAI, Dohnány, Conway/Automaton. The sociotechnical threat surface.
Throughout: Fleming meta-d', Corlett predictive coding, Haidt MFT, Dennett heterophenomenology. The cognitive science toolkit, imported as instruments are built that need them.