Yudhister Kumar

Predictive concept decoders and activation oracles are both (IMO) examples of what Jacob Steinhardt calls “scalable end-to-end interpretability” approaches. In more detail:

Identify an end-to-end task such that, in order to do well at the task, an agent would have to have learned something important about a neural network’s internal structure. Examples would be predicting the results of interventions, or identifying neurons to ablate that turn a specified behavior on or off.

Use this task as a training objective to train an AI assistant on a large amount of data. This assistant is likely superhuman at the specialized task of understanding a given AI model.

Make the information that the assistant learned extractable to humans, either by introducing communication bottlenecks in the architecture or making explainability part of the training objective.

I’m excited because it seems (1) AO/PCD performance scales with self-supervised data scale, (2) AOs/PCDs generalize relatively OOD, and (3) these methods are more architecture agnostic than circuit-based analyses.

The two are relatively different:

activation oracles take a base LLM and finetune it to take LLM activations as input and answer natural language queries about them. Training data is a mix of SPQA, binary classification datasets, and a self-supervised objective where the AO has to predict tokens before and after activation injection
predictive concept decoders are trained as end-to-end encoder-decoder models with a topK bottleneck acting as a “concept bottleneck.” they’re pretrained on FineWeb + activations and finetuned (with frozen encoder) on a QA dataset SynthSys.

In particular, AOs require labeled activation data while PCDs rely solely on self-supervised signal from text. The “concept bottleneck” of a PCD is meant to e.g. improve auditability of the mechanisms producing AI behavior (humans can look at the bottleneck and infer what’s going on).

Steinhardt proposes extensions to PCDs:

making the concept bottleneck consist of more structured objects (propositional formulas), which take advantage of inherent compositionality in neural network cognition;
replacing the encoder with a transformer (or similar) to have richer representations