Notes

odes to oranges

Pablo Neruda, "Oda a la naranja"

A semejanza tuya,
a tu imagen,
naranja,
se hizo el mundo:
redondo el sol, rodeado
por cáscaras de fuego:
la noche consteló con azahares
su rumbo y su navío.
Así fue y así fuimos,
Oh tierra,
descubriéndote,
planeta anaranjado.
Somos los rayos de una sola rueda
divididos
como lingotes de oro
y alcanzando con trenes y con ríos
la insólita unidad de la naranja.
Patria
mía,
amarilla
cabellera,
espada del otoño,
cuando
a tu luz
retorno,
a la desierta
zona
del salitre lunario,
a las aristas
desgarradoras
del metal andino,
cuando
penetro
tu contorno, tus aguas,
alabo
tus mujeres,
miro cómo los bosques
balancean
aves y hojas sagradas,
el trigo se derrama en los graneros
y las naves navegan
por oscuros estuarios,
comprendo que eres,
planeta,
una naranja,
una fruta del fuego.

《橘颂》/ "Ode to the Orange [Tree]"

后皇嘉树,橘徕服兮。
受命不迁,生南国兮。
深固难徙,更壹志兮。
绿叶素荣,纷其可喜兮。
曾枝剡棘,圆果抟兮。
青黄杂糅,文章烂兮。
精色内白,类可任兮。
纷縕宜修,姱而不丑兮。

嗟尔幼志,有以异兮。
独立不迁,岂不可喜兮?
深固难徙,廓其无求兮。
苏世独立,横而不流兮。
闭心自慎,不终失过兮。
秉德无私,参天地兮。
愿岁并谢,与长友兮。
淑离不淫,梗其有理兮。
年岁虽少,可师长兮。
行比伯夷,置以为像兮。


[corpusculence]

[notably more schizophrenic]

  • plurality of decision-theoretic confusions are IMO downstream of poor understanding of canonical Sacs. (agent, overseer) frames degrade with capability and scale. (agent, agent) dynamics rely on shared Cartesian assumptions
  • Sacs are lossy boundaries. Sacs can grow and be prodded, but notably Sacs can climb and descend ontological hierarchies
  • instantaneously, Sacs are enforced by a superstructure. (bkg assumption is that cooperation is well-defined wrt a superstructure: handwaves morality / customs / law seem to be broadly similar (perhaps functionally equivalent) and these seem to be properties of norms. central question is what enforces closure; Sacs are the results. plausible this can be described solely structurally)
  • "superstructure" always exists in agent-agent transactions. "overseer." (overseer of mediated PD equilibria & Lobian cooperation are the same, both EDT-ify bad naive CDT)
  • Sacs can live inside themselves & sacs that they own (memetic instantiation). critically important is that the sacs reproduce world-models faithfully (or at least in compatible ways)
  • embedded agency is a useful taxonomy
  • Sac-ification

pcd / ao

Predictive concept decoders and activation oracles are both (IMO) examples of what Jacob Steinhardt calls "scalable end-to-end interpretability" approaches. In more detail:

Identify an end-to-end task such that, in order to do well at the task, an agent would have to have learned something important about a neural network's internal structure. Examples would be predicting the results of interventions, or identifying neurons to ablate that turn a specified behavior on or off.

Use this task as a training objective to train an AI assistant on a large amount of data. This assistant is likely superhuman at the specialized task of understanding a given AI model.

Make the information that the assistant learned extractable to humans, either by introducing communication bottlenecks in the architecture or making explainability part of the training objective.

I'm excited because it seems (1) AO/PCD performance scales with self-supervised data scale, (2) AOs/PCDs generalize relatively OOD, and (3) these methods are more architecture agnostic than circuit-based analyses.

The two are relatively different:

  • activation oracles take a base LLM and finetune it to take LLM activations as input and answer natural language queries about them. Training data is a mix of SPQA, binary classification datasets, and a self-supervised objective where the AO has to predict tokens before and after activation injection

  • predictive concept decoders are trained as end-to-end encoder-decoder models with a topK bottleneck acting as a "concept bottleneck." they're pretrained on FineWeb + activations and finetuned (with frozen encoder) on a QA dataset SynthSys.

In particular, AOs require labeled activation data while PCDs rely solely on self-supervised signal from text. The "concept bottleneck" of a PCD is meant to e.g. improve auditability of the mechanisms producing AI behavior (humans can look at the bottleneck and infer what's going on).

Steinhardt proposes extensions to PCDs:

  • making the concept bottleneck consist of more structured objects (propositional formulas), which take advantage of inherent compositionality in neural network cognition;
  • replacing the encoder with a transformer (or similar) to have richer representations

why Occam?

[1] "Ideal Bayesian reasoners" rely on the simplicity prior. This is false as stated: Solomonoff induction convergence is not dependent on the exact choice of the $2^{-|K|}$ simplicity prior; convergence over any computable distribution holds if the prior is any universal semimeasure.1

[2] The generalization bias described in the Bayesian free-energy functional is a bias towards low-description length programs. I don't know enough about this to provide a full, concise, precise accounting of the argument, but I buy it? See my earlier post on MDL and SLT. (Hopefully work on the inductive bias of SGD will shed light on a similar result in the training of neural networks).

[3] "Simple hypotheses" are adaptive because world phenomena are naturally generated by "simple" processes. Empirically, phenomena have parsimonious explanations. We do not live in the most simple world,2 but physical theories are decomposable.

[4] "Simple hypotheses" are adaptive because learning systems learn simple explanations more effectively. Singular learning theory predicts this for Bayesian reasoners; deep learning generalizes because the parameter function map is biased towards simple functions; cf. inductive bias considerations; "special snowflake" hypotheses where certain kinds of learning are only possible in environments with nice properties, one of which is likely simplicity.

[5] Simple explanations are more memetically fit. Directionally correct. Minimizing the free parameters in your model means there's less information to necessarily communicate. But the pressures shaping acceptance of one theory over another do not rely on simplicity as a primary proxy, and in any case accuracy should be prioritized.

[6] Simplicity is elegant. Deutsch argues for objectivity in aesthetics, such that "aesthetic truths are linked to factual ones by explanations." Schmidhuber defines beauty through simplicity. Surely there's a convergence here; however, attributing causality requires care.

Clearly an Occam-like hypothesis is adaptive. I find empirical justifications for simplicity biases the most compelling, yet their compelling formulations elude me. Excited about fleshing out a correct [1] and a precise [2], [3] is a philosophical goldmine, [4] is (to me) obviously correct, [5] requires specification, [6] deserves a steelman.

1

There are subleties with regards to convergence as a distribution versus "pointwise" and their appropriate characterizations in the Solomonoff induction setting.

2

We also probably do not live in the most simple world conditioning on our existence, but the arguments here are more nuanced.


a bird's-eye perspective on program equilibria

The concept of "program equilibrium" was first introduced by Tenneholtz in 2004, where he notices that the program $$ \begin{align*} &\text{IF } P_1 = P_2, \text{ THEN COOPERATE} \\ & \text{ELSE DEFECT} \end{align*} $$

cooperates with itself in the Prisoner's Dilemma, and moreover given a state of two players adopting this program in the Prisoner's Dilemma there is no incentive for either to deviate.

Unfortunately, this result as stated only holds on the basis of syntactic analysis: essentially, given two .txt files each representing a player's program, check line by line that they're equivalent and if so cooperate. (More sophisticated versions of syntactic analysis exist but all suffer from needing to coordinate on a shared language, transmission mechanism, etc.). In an ideal world, you'd replace $\text{IF } P_1 = P_2$ with a semantic guarantee: verifying effective equivalence via the behavior of programs rather than their apparent structure.

I'm aware of two ways of doing this:

(1) Simulation. Perhaps the most straightforward way to verify the behavior of a program is to run it. The most naive implementation of the simulation-based approach, running $P_2$ with $P_1$ as input, fails if applied symmetrically from a non-halting infinite regress. However, getting around this is possible with relatively low cost (see the $\epsilon$-GroundedFairBot proposed by Oesterheld and variations thereof).

(2) Proof search. "If I can prove that you cooperate with me, then I cooperate. Otherwise I defect." Typically you parametrize programs as logical sentences taking the outputs of the other programs in the game as parameters, and use some Lobian-like theorem to prove cooperation. (Interesting is Critch's proof this holds even with access to only bounded provers).

These are not the only methods for robustly achieving $(C,C)$ in a Prisoner's Dilemma: a more classical approach is to introduce some sort of information revelation and make $(C,C)$ a robust equilibrium in this "correlated" setting. However, taking this "open-source" approach to negotiation has the conceptual benefit (at least to me) of highlighting what sorts of programs form robust coalitions given broad instantiation. Interesting to ponder from the perspective of "what sorts of policy structures should I update myself towards."

Recommended: Oesterheld's annotated bibliography.