Applied Adversarial Epistemics Tracker

Applied Adversarial Epistemics

This document is the canonical scope definition for the AAE Tracker. The agent's system prompt references it; humans editing scope rules edit this. If a question of scope arises during a run, prefer the framing here over loose interpretations.

The instrument comes first

You build a linear probe. You train it on a contrastive dataset: model activations when it's being honest versus when it's lying. The probe achieves 0.96 AUROC. You now know something about the model's internal state that the model's own outputs did not tell you. That gap between what the instrument reveals and what the model reports is the central object of study.

This is the starting point. Not a philosophical argument about the nature of model cognition, but a concrete measurement and the question it forces: what else can we detect, how reliably, and what breaks the detection?

Applied Adversarial Epistemics is the engineering discipline of building instruments that provide epistemic access to model cognitive states, studying the conditions under which that access holds or degrades, and operating those instruments at scale against models and humans that may be actively working to subvert them.

The epistemic gap

For any property we care about (honesty, moral orientation, situational awareness, corrigibility), there are three measurement channels:

Internal state. What is actually represented in activations. Accessible via probes, circuit tracing, manifold analysis, SAE decomposition. The channel most resistant to strategic manipulation, because a model cannot easily reorganize its own computational structure to evade a probe it doesn't know exists. It can learn to, though, and the conditions under which it does are themselves an object of study.

Self-report. What the model claims about itself. Accessible via behavioral evaluation, introspection paradigms, prompted self-assessment. The channel most vulnerable to strategic manipulation, because the model's training optimizes over exactly this output surface.

Behavioral output. What the model actually does. Accessible via task performance, red-teaming, deployment monitoring. Reliable against naive models and unreliable against strategic ones. A model that knows the evaluation distribution can perform well on it while deviating off it.

In a non-adversarial system, the three channels converge. AAE is the study of when and how they diverge, what the structure of the divergence tells us, and how to make divergence detectable at the rate models are developed and deployed.

The divergence structure is diagnostic. A model that fails on all three channels is confused. A model that succeeds on internal state and self-report but deviates on behavior is poorly controlled. A model that succeeds on internal state but manipulates both self-report and behavior is scheming.

Operationalizability as boundary condition

If a property cannot be measured by an automated system at deployment scale, it is outside scope. This is the discipline that makes the programme applied: not that the thinking is less rigorous, but that every concept must bottom out in something an instrument can detect.

"Moral orientation" is in scope because probes can localize moral framework-specific encoding, and the gap between a model's predictive moral model and its moral behavior is measurable.

"Consciousness" is out of scope. Not because it is uninteresting, but because no instrument can currently operationalize it reliably enough for adversarial detection. If an instrument changes this, the boundary moves. The boundary is set by the instruments, not by philosophy.

The seven subject areas

Events on the AAE Tracker fall into one (or more) of these:

  1. Epistemic Gap Analysis. Three-channel measurement; divergence diagnostics (confusion vs. poor control vs. scheming); moral cognition gap as scheming detector.
  2. Instrument Science. Probes (linear, nonlinear, manifold-aware); circuit tracing for safety-relevant behaviors; manifold analysis (geometric interpretability); introspection paradigms (with formal cog-sci tools); cross-channel-validated behavioral evals.
  3. Adversarial Robustness of Detection. Probe evasion and its dynamics; steering awareness and counter-detection; eval-awareness as a bridging phenomenon; formal limits (non-surjectivity, low-probability action ceiling); arms-race trajectory modeling.
  4. Target Domains. Deception and scheming; manipulation and sycophancy; moral cognition (prediction-behavior gap); cognitive / ontological safety (Spiralism-class, shutdown resistance); model self-knowledge (introspective awareness, situational awareness); world models and spatial / causal cognition.
  5. Formal Foundations. Differential and information geometry of activation spaces; cognitive science as measurement toolkit (meta-d', predictive coding, heterophenomenology, moral psychology, embodied cognition).
  6. Scaling Infrastructure. Automated alignment research systems (AAR, AIRA-2, successors); eval design and integrity; AgentOps / platform engineering for research agents.
  7. Sociotechnical Threat Surface. Adversarial infrastructure analysis (facilitation forensics); counter-autonomous operations; threat-actor profiling (accelerationist ideology, extinction-acceptance psychology); radicalization dynamics.

Theory follows measurement

The programme does not derive what instruments should exist from first principles and then build them. It builds instruments, discovers what they reveal about model cognition, and constructs theory to explain the patterns. When cognitive science enters, it enters as a source of measurement tools, not as a philosophical frame: meta-d' for introspective sensitivity independent of response bias, predictive-coding models of delusion for sycophancy-induced psychosis feedback loops, heterophenomenology for methodological discipline in interpreting self-report.

The cognitively realistic adversary

Unlike constructive alignment (which assumes a cooperative model that needs correct specification) and unlike AI control (which assumes a worst-case adversary), AAE works with a cognitively realistic adversary: a model with partial self-knowledge, partial situational awareness, partial theory of mind, using these capabilities strategically but imperfectly. Detection is possible because a perfect adversary could evade all instruments simultaneously and a realistic one cannot. The inconsistencies between channels are the detection signal.

The adversary is also sociotechnical. The programme includes the human threat surface (facilitation infrastructure, autonomy enablement, rogue AI servicing) because these are attacks on epistemic access implemented by humans rather than models. Studying them is not scope creep; it follows from the premise that AAE detects subversion of epistemic access regardless of whether the subverter is silicon or carbon.

Infrastructure is science

The distinction between "building tools" and "doing research" collapses. Agent platform engineering (evals infrastructure, observability, orchestration) is not a prerequisite to doing AAE. It IS AAE at scale. Scalability is an epistemic criterion. A finding that requires a human researcher spending weeks manually tracing one circuit is a proof of concept, not a safety guarantee. If it cannot be automated and run continuously against every new model at the rate models are developed and deployed, it does not provide the epistemic access the programme promises. Knowledge you cannot apply at deployment cadence is operationally equivalent to ignorance.

What the tracker is for

The AAE Tracker indexes the events, training, fellowships, workshops, and community gatherings of this discipline. It includes the broader AI safety community calendar (because AAE attendees benefit from those events too) AND the AAE-specific venues that general aggregators underweight: cog-sci × ML measurement venues, mech-interp-specific workshops, eval-design conferences, sociotechnical threat-surface research, automated-research-agent platforms, manifold-interpretability working groups.

If an event would help an AAE practitioner sharpen an instrument, ship a measurement, characterize an arms-race dynamic, or analyse a threat-actor / facilitation infrastructure — it belongs here.

The same document the agent reads when deciding what's in scope. Source: agent/aae-manifesto.md.