Notes

MDL meets SLT

[paper highlight: Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory]

two major contributions of the paper:

  • theoretically links minimum description length to singular learning theory, in that they prove that for all ground-truth distributions q and n i.i.d samples drawn from q, there exists a two-part code with asymptotic redundancy Rn=λlogn(m1)loglogn+Op(1), where λ is the LLC
  • experimental results showing LLC variance with model quantization (where quantization is ~roughly a stand-in for compression and LLC measures complexity, so one can study empirical correlations)

what is a two-part code? admittedly I'm still slightly bamboozled by the MDL formalism they choose, so this will be a mix of hand-wavy intuition and opaque jargon.

Let q(n)Δ(Xn) be a data-generating distribution over n-sequences drawn from the sample space (token vocabulary) X. Any distribution p(n) over Xn induces a code for any sample x(n)Xn, where a code is just an associated bitstring for the sample. The bitstring has length logp(n)(x(n)) (the entropy), and the minimum description length principle is essentially that good encodings should seek to minimize the minimum average length of samples. Given i.i.d sampling, the long-run optimal encoding distribution is the ground-truth distribution q(n), and KL(q||p) has a clean interpretation in this context: the expected excess length per-symbol given by encoding distribution p vs. q.

a two-part code is an encoding with two parts: one specifying the encoding distribution ("model") with respect to a model class, and the other specifying the message ("sample") given the model. intuition for this setup: imagine a sender and receiver having mutual information over the fact that both will communicate via some language with some grammatical structure, but the structure insufficiently specifies a full language and vocabulary, however they both have a dictionary mapping bitstrings to complete languages that they can coordinate on first before communicating. (there are much better ways of explaining this).

anyway, you want some way of measuring the performance of your encoding in the two-part setting. there's a quantity called redundancy that measures performance with regards to the underlying data distribution, roughly given by Rn=len([[p]])+KL(q||p), in the average case, where [[p]] is your bitstring encoding of your model w.r.t. your model class. a natural way of optimizing this is choosing a p which accurately models q and eating the specification cost. However! you might have a model class M uniquely unsuited to encoding q, in which case your optimization problem is more interesting.

restating the central theoretical result: there exists a two-part code for any realizable1 data generating distribution qM and dataset x(n) sampled i.i.d from q, the asymptotic redundancy is Rn=λlogn(m1)loglogn+Op(1), where λ is the LLC of q and m is the "multiplicity."1

it is late and my brain is not quite working, but i don't see optimality guarantees for this result? the construction is of the flavor "choose codes such that

len([[p]])=logVol(W)Vpn(ϵ),

and then this has Rn given above." (where p are the model encodings at the center of ϵ-balls covering model regions with sufficiently small KL divergence, necessary because discretization is needed to reduce to a set small enough to fully specify with codes). like, this is implicitly sane because KL ϵ-balls partition assigns probability proportional to the share of the ϵ-ball vs. the volume of the entire space, but IDK why is this the optimal encoding or agreement algorithm? mumbles in Jeffrey's prior

a maybe helpful image:

pythia-llc

I suppose this is why the empirical results are needed. But the empirical results are like "linear relationships between LLC estimates and critical compression thresholds for models up to 7B parameters" where the critical compression thresholds nq are literally "how many times do I need to quantize the model before the difference in loss exceeds some threshold." which is cool! but a bit confusing

pythia-llc

still don't quite understand the theory behind LLC estimation. MDL and SLT connections are cool though, it would be nice to get some naturality results bc the experimental results are not that convincing by themselves (many patterns replicate this, LLC estimation is an art not a science, and quantizing models arbitrarily and doing inference on them seems like it naturally leads to buggy implementations)

1

see technical conditions in the paper


Prediction is not Generation

[a long overdue response to Aidan :)]

In ML, generation and prediction are practically synonymous. Your model learns an appropriate, performant compression of your dataset, and somehow such artifacts generate "completions" (broadly construed) with high accuracy.

It's tempting to then make the leap to man, if I just managed to tokenize the entire past and future of the universe and train a transformer (with infinite compute) to predict the next universe state every Planck time1 from the universe history up until that point, then it'll be guaranteed to faithfully represent the true laws of physics somewhere in its weights!

I claim this is unclear! Even if the laws of physics were describable by some finite state automata, the optimal predictive representation of a process does not have to necessarily correspond to the optimal generative representation!

Here's a toy case. Consider the space of all stationary Markov processes generating the symbol set {0,1}. Clearly the best way to predict a process like this (given Markovity) is to assign some probability p to 1 being generated after a 0, and some probability q to 0 being generated after a 1. There are two "belief states" of this policy (let's call them A and B—each corresponding to the "belief" that 0,1 will be generated2) that the reasoner will occupy with probabilities

P(A)=1q2pq,P(B)=1p2pq respectively. The entropy of this two-state system is just the entropy of the stationary distribution (given above), which turns out to be

Cμ=P(A)log2P(A)P(S1)log2P(B)=q12pqlog2(1q2pq)+p12pqlog2(1p2pq).

The key point to remember here is that we're using the entropy of the stationary state distribution as a measure of "optimality," in the sense that lower entropy means higher simplicity and as a result it is "more optimal." It stands to reason that if generation and prediction are "the same," then it should be impossible to construct a generative process with lower entropy than Cμ for some values p,q. Right?

Well. Consider p=q=0.4, and consider the generating process below.

Lohr

You can check for yourself that this process outputs 01 with probability p, and 10 with probability q for 0p=q1/2. This process has a stationary distribution

π=[12p,2p,12p],

and its entropy H[π] for p=q=0.4 is approximately 0.922, less than Cμ=1.

Have we been hoodwinked? Maybe one should never trust a sentence beginning with "Clearly, . . ." in a mathematical text. Maybe there's a secret predictive policy that magically has lower entropy for p[0.38,0.5]3 that we're just missing.

I argue against this. In particular, there is a particular interpretation of "prediction" we're using here that I claim is simultaneously natural and correct.

Consider an infinite string of tokens X2,X1,X0,X1,X2,. The Markov property states that P(X0|X:0)=P(X0|X1): that my causal state is fully determined by timestep T1, and thus the last token output contains all the information I could use to predict the next token output. As an optimal predictor, I want to adhere to the optimal causal policy which is the minimal entropy policy over belief states that can be causally differentiated. In this case, it is the two-state policy μ with entropy Cμ above.

Observe that the introduction of causality meaningfully differentiates solely generative policies from predictive ones! We have constructed a lower-entropy generative process by relaxing the assumption that we only rely on meaningfully causally differentiated belief states given the token history. There's a sense in which this is the fundamental difference between prediction and generation. It remains to be seen how widely this holds, but the two concepts are canonically differentiated.

Examples taken from James et. al..

1

Ignoring that the universe doesn't have a sense of absolute time.

2

In general, belief states are not by default interpretable.

3

The ranges in which the Lohr model has lower entropy than the predictive model.


Miscellaneous Poetry Drafts

I.

a blade of grass hides
minuscule migratory men
Lilliputian fiends

II.

mighty merry rascals
fickle, high off foglefreude
die English skippers

III.

I once beheld Seneca's estate,
Credulously inspecting his wicker tomb—
Yet Margate Mennons and liced defendants
Both swore by its awkward loom.

My heart gasped and lips shuddered
When, to my utmost surprise
The elderly Roman statesman lay
As mummified nitride.

Tick-tock, goes the clock
Garrulous gyrations too
Gizzardly Gentry, nice surprise
Confiding in a martyr's womb.

Foiled! the cuckoo's dead—
Not I, not I, not I!
Lambastation! Aberration!
To defy Nero's evil eye.

IV.

weed stands stout
thistles, burr
gadolinum kraut
hummus and herb

olden spires bristle, copper-waxed
kelpish tides awash blades of glass
Betty stout, orange'd brass
vermillion mounts, dugong grass

betwixed, witched, yonder
your shivers roll down my spine
I gave you a bouquet of thorned tulips
at sunrise, on Cocoa Beach

the rocket's red glare, the bombs
bursting over a lackadaiscal mare
I wish Martians dreamt of the stars
alas


Scaling Laws for Transfer Learning

Chinchilla scaling posits that

L(N,D)=L+ANα+BDβ,

where N is the number of parameters in your language model, D is the number of training tokens seen, A,B,α,β,L are constants, and L(N,D) is the test loss.

Of course, this analysis is limited:

Still, it is remarkable that test loss performance is so clearly a function of parameters and dataset size, with little assumptions made about the distributional structure or architectural specifics (this holds across varying depth/width ratios, for instance). I find two questions natural:

  • does scaling hold outside of the cross-entropy pretraining regime?
  • can we derive scaling relationships for downstream task performance? in particular, how predictable is transfer learning?

In 2021, OpenAI studied the question of "how much more quickly does a pretrained LM achieve low loss on a finetuning dataset than an LM initialized from scratch on the dataset?" (Note pretraining and finetuning on this case are "the same operation", the objective is still cross-entropy). They find that the "effective data transferred"1 DT is described by DT=k(DF)α(N)β, where DF is the size of the finetuning dataset (in tokens) and N is the number of non-embedding parameters of the model.2 This is great! Strong evidence of the generality of abstractions the model learns in pretraining (especially given the independence of β from the source distribution). However, it doesn't explicitly tell us about downstream task performance given an external metric.

Brandfonbrener et. al. do somewhat better. They derive scaling relationships for train-train, train-test, and test-test loss transfer for models trained on different datasets, which can be expressed as

Li(f^jN,D)K(Lk(flN,D)Ek|l)κ+Ei|j,

where you have models fj,fl trained on distributions j,l evaluated on distributions i,k and you're fitting the constants K,κ.3 As an example, the case of train-train would be where (i,j)=(0,0) and (k,l)=(1,1). We pair models by (N,D) for coherence. Notably, these laws hold for diverse datasets, but only well in low-loss regimes and when Em|n terms can be well estimated. Still no breaking of the pretraining regime, and no explicit predictions for downstream metric performance!

There's a meta-analysis out this year that claims that scaling laws are unreliable for dowstream task performance prediction.. Seems correct. Metrics are noisy and don't have nice algorithmic properties like cross-entropy loss might. Perhaps intriguing is their observation that irregular scaling is (1) common and (2) can occur for cross-entropy on normal tasks and normal LM datasets. This paper claims that larger models & models trained for longer have better downstream task performance even when holding loss constant. Which is an argument for certain training setups & architectures having better inductive biases?

Honestly, I am kind of sad that the extant literature here seems to be tainted by publication bias? I wouldn't really trust these papers (or the ten others I read writing this), and I want to run the experiments myself. The Tinker API seems good for doing this quickly. I'll go do that.

(Titular question pending.)

1

In essence, the number of tokens that you "save" seeing in finetuning by pretraining.

2

Notably β only depends on architecture and TARGET distribution (not SOURCE), while α is a rough "distributional proximity" proxy that can be easily estimated.

3

Em|n is the irreducible loss of a model trained on with infinite compute on distribution n evaluated on distribution m.


On Non-Isolated Calls for Structure

Safety cases are arguments that AI deployments are safe in some specified context. The context can include restrictions on deployment environments as well as training or deployment protocols. For instance, the debate safety case only applies to low-stakes deployment environments, requires exploration guarantees on the model, and relies on a debate protocol which avoids obfuscated arguments. Given these assumptions, Buhl et. al. argue for “asymptotic guarantees”—that high performance on alignment objectives during training translate to approximate alignment during deployment. The control safety case is structurally similar, instead focusing directly on an explicit threat model and concretizing assumptions accordingly.

A naive way of constructing an “alignment portfolio” is simply to make safety cases which adequately cover all deployment environments with the appropriate degree of risk-tolerance. Formal verification for high-stakes SWE deployment, white-box interpretability for monitoring automated alignment researchers, some adapted debate protocol for use in executive decision-making. If the individual arguments are all sound, this works!

What if we introduce some error into the soundness judgements? If every safety case has some epsilon probability of failure, then straightforwardly you should make more safety cases for the scenarios in which alignment properties matter more. But if all your safety cases for non-deceptive automated alignment researchers rely on “white-box interpretability mostly working,” then if this isn’t true you’re still doomed no matter how many safety cases you write!

Anthropic’s ASL-4 safety case sketches are not quite this correlated, but only just. [1] relies on white-box methods successfully monitoring deception, [3] relies on guarantees that the pretrained model is not coherently deceptive (likely requiring successful white-box or black-box methods), and [2] still depends on linear activation probes adequately showing that the model cannot distinguish between certain classes of train and test deployments, as well as black-box evaluations providing sufficiently robust guarantees on behavior. These are similar assumptions! These assumptions are all only true in worlds where “models are sufficiently parsimonious such that present-day interpretability techniques and evals can provide rigorous guarantees on good behavior.”

In general, insufficient diversity over the world structure assumed in an alignment portfolio makes the portfolio fragile and irrobust.1

It is always necessary to make assumptions about world structure when predicting world behavior. A bounded reasoner simulates the world with a local, low-fidelity model based on the reasoner’s accumulated evidence about the world. Some assumptions on world structure are better than others—gravity following an inverse-square law vs. homeopathic remedies curing cancer, for instance.

Considering the structure of one’s structural assumptions is critically important in domains where the world behavior has not been exhibited and it is of importance. Note:

  • The largest scientific breakthroughs are accompanied by structural assumptions about the world breaking. See the atomic bomb, CRISPR, heavier-than-air flight. Fundamentally, these “expand the domain of the possible.” Sometimes, the world structure is discovered first (as in nuclear theory leading to the first controlled chain reaction). Other times, a prototype uncovers the structure (see: penicillin). In both cases, the non-specialist intelligent reasoner understands a different possibility domain before and after.
  • Top-down searches for structural guarantees must be incredibly judicious in their assumptions, because the vast majority of hypotheses are incorrect. Ex post, the structure is obvious, but ex ante it is not. Consider Newton devoting as much energy to alchemy as the study of gravitation.
  • If we take the perspective that alignment is an infinite problem, there is no good reason to expect that the world structure we can reasonably assume is simple. It might be that it is infinitely complex and is only limited by our current understanding, and that we will recover finer and finer approximations of it as our understanding improves. At each stage of this process we will have to repeat our assumption examination from a roughly equivalent epistemic vantage point of staring into the abyss.
  • Much of the existential risk from AI development comes from tail risks and black swan events. Mitigating these requires a portfolio of solutions which each rely on decorrelated or independent world models (note this is not a guarantee).

Natural corollaries of this observation:

  • we should be explicit about which world models are going into constructing safety cases,
  • we should be developing independent safety cases for high-stakes deployment situations,
  • we should emphasize diversity in theoretical agendas to buttress our ability to make such safety cases reliant on disjoint sets of assumptions.
1

This is a specific instance of the general case of “Swiss cheese models only work when the holes don’t line up in the same worlds,” which is probably not sufficiently justified in this post but is something I believe to be true.