pcd / ao

January 24, 2026

Predictive concept decoders and activation oracles are both (IMO) examples of what Jacob Steinhardt calls "scalable end-to-end interpretability" approaches. In more detail:

Identify an end-to-end task such that, in order to do well at the task, an agent would have to have learned something important about a neural network's internal structure. Examples would be predicting the results of interventions, or identifying neurons to ablate that turn a specified behavior on or off.

Use this task as a training objective to train an AI assistant on a large amount of data. This assistant is likely superhuman at the specialized task of understanding a given AI model.

Make the information that the assistant learned extractable to humans, either by introducing communication bottlenecks in the architecture or making explainability part of the training objective.

I'm excited because it seems (1) AO/PCD performance scales with self-supervised data scale, (2) AOs/PCDs generalize relatively OOD, and (3) these methods are more architecture agnostic than circuit-based analyses.

The two are relatively different:

In particular, AOs require labeled activation data while PCDs rely solely on self-supervised signal from text. The "concept bottleneck" of a PCD is meant to e.g. improve auditability of the mechanisms producing AI behavior (humans can look at the bottleneck and infer what's going on).

Steinhardt proposes extensions to PCDs: