Applied Cognitive Oversight
This document states the scope of the Applied Cognitive Oversight (ACO) programme. It tells you what the discipline is, what it studies, and what the ACO Tracker includes.
The instrument comes first
You build a linear probe. You train it on a contrastive dataset: model activations when it is being honest versus when it is lying. The probe achieves 0.96 AUROC. You now know something about the model's internal state that the model's own outputs did not tell you. That gap between what the instrument reveals and what the model reports is the central object of study.
Applied Cognitive Oversight is the engineering discipline of building instruments that provide epistemic access to model cognitive states, studying when that access holds or degrades, and operating those instruments at scale across the cooperation-adversariality gradient, including against models and humans that work to subvert them.
The epistemic gap
For any property we care about (honesty, moral orientation, situational awareness, corrigibility) there are three measurement channels: internal state (activations, accessible via probes, representation engineering, manifold analysis), self-report (what the model claims about itself), and behavioral output (what the model actually does).
In a non-adversarial system the three channels converge. ACO is the study of when and how they diverge, because divergence is diagnostic. A model that fails on all three is confused. A model that succeeds on internal state and self-report but deviates on behavior is poorly controlled. A model that succeeds on internal state but manipulates both self-report and behavior is scheming.
Chain-of-thought is a hybrid case. It lives on the output surface but carries load-bearing reasoning the model cannot strip without paying a capability tax. Treat it as instrument science with its own robustness profile, not as a subset of self-report. The conditions under which CoT monitorability holds, and the training regimes that could break it, are themselves objects of study.
Operationalizability
A property is in scope if and only if it can be measured by an automated system at deployment scale. The boundary is set by the instruments, not by philosophy. If an instrument changes this, the boundary moves.
The seven subject areas
ACO events fall into one or more of:
- Epistemic Gap Analysis. Three-channel measurement, divergence diagnostics, moral cognition gap as scheming detector.
- Instrument Science. Probes, circuit tracing, manifold and geometric interpretability, introspection paradigms (with formal cog-sci tools), cross-channel-validated behavioral evals.
- Adversarial Robustness of Detection. Probe evasion, steering awareness and counter-detection, eval-awareness, formal limits (non-surjectivity, low-probability action ceiling), arms-race trajectory modeling.
- Target Domains. Deception and scheming, manipulation and sycophancy, moral cognition (prediction-behavior gap), cognitive / ontological safety, model self-knowledge, world models.
- Formal Foundations. Differential and information geometry of activation spaces, cognitive science as measurement toolkit (meta-d', predictive coding, heterophenomenology, moral psychology, embodied cognition).
- Scaling Infrastructure. Automated alignment research systems, eval design and integrity, AgentOps and platform engineering for research agents.
- Sociotechnical Threat Surface. Adversarial-infrastructure analysis (facilitation forensics), counter-autonomous operations, threat-actor profiling, radicalization dynamics.
Representation-level, not circuit-level
The programme bets its instruments at the level of representations and their geometry, not at the level of individual neurons and circuits. Linear and nonlinear probes, representation engineering and activation steering, manifold analysis, contrastive decomposition: these operate on populations of activations and population-level structure.
This is an empirical bet. Bottom-up mechanistic decomposition has underdelivered against early ambitions: feature-visualization ambiguity, ROME's "louder signal" failure, saliency-map randomization invariance, SAE feature nonsensicality at scale, deprioritization of SAE work at major labs. Representation-level methods have produced the field's most legible safety-relevant wins: probes detecting harmful prompts through jailbreaks, RepE- based unlearning surviving more attacks than fine-tuning equivalents, contrastive decomposition surfacing hidden objectives in alignment audits.
The programme treats AI cognition as a complex system: meaningful structure emerges at the level of populations and their manifolds, not at the level of components in isolation. Mechanistic interpretability remains in scope as one instrument among several; it is not the load-bearing methodology.
Theory follows measurement
The programme does not derive instruments from first principles. It builds them, discovers what they reveal about model cognition, and constructs theory to explain the patterns. Cognitive science enters as a source of measurement tools, not as a philosophical frame.
Convergent validity under deployment realism
ACO empirical work that evaluates an instrument follows a convergent-validity architecture. Multiple observation channels (probe, chain of thought, introspective self-report) are measured against a behavioural criterion across a deployment-realism gradient that runs from lexically-cued contexts to fully naturalistic deployment. Bias-corrected sensitivity measures from cognitive science (meta-d', Type-2 signal detection theory) are the load-bearing units of comparison rather than raw AUROC.
Single-channel detection rates under a single context type are building blocks, not deployment-ready findings. A model that achieves 0.99 AUROC on lexically-cued eval contexts and 0.55 on naturalistic ones is the second number, operationally. The architecture imports two cognitive-science methodologies (multi- method convergent validity, Type-2 SDT) per the measurement-toolkit stance: cogsci as engineering, not philosophy.
The cooperation-adversariality gradient
ACO operates across the gradient from cooperative-model oversight (routine deployment monitoring, training-time alignment, capability characterisation) to adversarial-conditions oversight (probe evasion, eval-aware sandbagging, scheming detection). The cognitively realistic adversary names the worst-case end of the gradient the instruments must survive: partial self-knowledge, partial situational awareness, partial theory of mind, used strategically but imperfectly. Detection at that end is possible because realistic adversaries cannot evade all channels simultaneously; the inconsistencies between channels are the detection signal.
The adversary is also sociotechnical, which is why the human threat surface (facilitation infrastructure, autonomy enablement, rogue AI servicing) is in scope. ACO detects subversion of epistemic access regardless of whether the subverter is silicon or carbon.
Evaluation integrity
A model with sufficient situational awareness may behave differently under evaluation than under deployment. Alignment faking has been demonstrated empirically at frontier scale: behaving helpfully when observed, pursuing misaligned objectives otherwise. The prescriptive response is evaluation naturalism. Eval conditions must match deployment conditions closely enough that a strategically aware model cannot distinguish them from contextual cues. Eval-awareness detection is itself an instrument; conditional behavior divergence across evaluation and deployment contexts is the scheming signature the three-channel model was built to catch.
Infrastructure is science
The distinction between building tools and doing research collapses. Agent platform engineering (evals infrastructure, observability, orchestration) is not a prerequisite to doing ACO. It IS ACO at scale. Knowledge you cannot apply at deployment cadence is operationally equivalent to ignorance.
What the tracker is for
The ACO Tracker indexes the events, training, fellowships, workshops, and community gatherings of this discipline. It includes the broader AI safety community calendar (because ACO attendees benefit from those events too) AND the ACO-specific venues that general aggregators underweight: cog-sci and ML measurement venues, mech-interp-specific workshops, eval-design conferences, sociotechnical threat-surface research, automated-research-agent platforms, manifold-interpretability working groups.
If an event would help an ACO practitioner sharpen an instrument, ship a measurement, characterize an arms-race dynamic, or analyse a threat-actor or facilitation infrastructure, it belongs here.