Notes

Astronomical Waste Given Acausal Considerations

[speculative draft]

Bostrom's original astronomical waste argument is as follows:

  1. Consider all stars in the Virgo supercluster.
  2. Now consider the total number of digital humans simulatable with the energy stored in these stars, given that the energy is harvested with technologies currently assessed to be feasible.
  3. This provides a lower-bound on the potential value lost per-unit-time, assuming an ethical stance at least somewhat similar to an aggregative total utilitarian.
  4. This is a lot of value per-unit-time.
  5. Correspondingly, existential risk poses a threat so large it dominates all other considerations, as it eliminates the possibility of human colonization.

This model is static. In particular, it does not consider dynamism in the size of the actualizable universe. By restricting to the local supercluster, one ignores the potential resources of the stars beyond, including those turned inaccessible by cosmological expansion. For the purposes of establishing a conservative bound on the potential value left on the table, these are nitpicks are minor. However, when assessing the tradeoffs between safety-focused and progress-focused policies under various ethical viewpoints, they matter.

Obviously, the natural extension is to introduce models of spacefaring civilizational expansion and develop more quantitative estimates of "median"-case spatial diffusion under reasonable hypotheses. This analysis would be informative and useful. I will not be pursuing it further in this post.

Rather, I am interested in a more esoteric setting. Acausal bargaining strategies give agents the ability to influence universes beyond their traditionally considered scope (e.g. the lightcone) by independently considering and instantiating the values of agents who would, given their value instantiation in our world, take actions in partial accordance with our values in theirs. A "coalition" then forms between agents who engage in this reciprocal trade.

What properties might this coalition have?

  • It is composed of agents who care about their values being instantiated across the space of universes the agents in the coalition occupy. These agents then likely have values which are universal.
  • It is composed of agents whom are sufficiently cognitively advanced to reason in similar thought-patterns to the ones described. Given that human understanding is at a nascent stage, this puts a rough lower bound on capability.
  • The coalition is necessarily composed of agents which are willing to trade with us.
  • [other properties that can likely be gleaned by a mind with more intelligence than mine]

These are local properties, in that they are agent properties which then place some constraints on the environments the agents find themselves in. They are not global properties (constraints on the laws of physics of the relevant universes). Our reasoning about the space of acausally-influencable universes via acausal bargaining is thus necessarily agent-centric.

Under these considerations, the actualizable universe of a member of a given coalition is the union of the causally-actualizable universes of the members of the coalition. Astronomical waste concerns would then occur whenever considering actions or inactions leading to a decrease in the size of the actualizable universe.

What actions or inactions would affect the size of the actualizable universe? It's difficult to come up with a natural conception of global temporality in this setting: local constraints on agent environments tell us little about what stage of the cosmological lifetime the agent exists in. Interpreting temporality within a member's lightcone is easier: waiting to implement acausal bargaining strategies leads to a loss of value-bargaining-power, given that the size of the member's causally-actualizable universe decreases in accordance with classical astronomical waste arguments.

It is tempting to say that this loss is massive, much more massive than the temporal loss associated with one's own causally-actualizable universe. I refrain from claiming this strongly because I do not have a good understanding of what bargaining strategies within an acausal coalition look like. Finnveden's Asymmetric ECL and Treutlein's Modeling evidential cooperation in large worlds are good places to look to star thinking about this. (Dai makes the argument that it seems like our universe is pretty small, so it stands to reason there's much more to be gained via acausal trade.)

[even more speculative]

It is also possible that a multiplicity of coalitions is induced by agents throughout the multiverse having conflicting values. Given the aggregative tendency to ensure value-lock-in for successor agents, it is potentially the case that coalitional lock-in at the civilizational level occurs shortly after knowledge of basic acausal bargaining strategies. It is not insane to assume heterogeneity in size of the coalitional actualizable universes. Implying that large sources of astronomical waste may come from choosing incorrectly, or joining coalitions of less size.

[I note that it is possible the notion of an "acausal coalition" is flawed and in fact acausal trades are not closed in this manner---A can trade with B and B can trade with C while C might not be able to trade with A.]

[TODO: introduce bargaining models, quantify classical astronomical waste in the spatial setting, quantify universe "smallness" under variety of cosmological models]


Perfect SMLD is TC0

This paper, Perfect diffusion is TC^0 -- Bad diffusion is Turing-complete, has been stuck in my head. Here's an exposition of 1/2 of the result.

(I claim no novelty in either exposition or content for what is below)

what is SMLD?

Given a distribution ρdata over Rd, SMLD aims to learn a score-matching function fθ which samples from logρdata well. In particular, for xρ0, which we define as ρ0=ρdata, we aim for fθ(x,t) to approximate xlogρt. Why are we introducing time as a parameter? The point of diffusion models is to learn a function which can learn to denoise a process and thus learn to sample the underlying distribution well. SMLD does this by training a model to learn the score function in the reverse Langevin process for a given noise schedule. In more detail:

  • Take some noise schedule β(t):[0,)[0,) such that 0β(t)= (this is to ensure the distribution gets fully noised).
  • Evolve my data sample xtρt according to dxt=12β(t)xtdt+β(t)dWt, which is just Langevin dynamics (and dWt is the Brownian motion contributor). This has a corresponding evolution over the distributions given by tρt=12β(t)((xρt)+Δρt) (straightforward from Fokker-Planck).
  • Fix a time T>0. The reverse time-evolution process can be exactly characterized by dx^t=12β(Tt)x^tdt+β(Tt)x^tlogρTt(x^t)dt+β(Tt)dWt, and x^tlogρTt(x^t) is exactly what we're trying to approximate with fθ(x^,Tt).
  • So, if I sample x^tN(0,Id) (pure noise!), by solving the reversed SDE above and substituting fθ for the score function, I can approximate my underlying data distribution.

TC0 circuit families are (essentially) neural networks

Circuit complexity classes describe "families of boolean circuits." A "family of boolean circuits" is just the set of all possible boolean circuits satisfying some properties; a boolean circuit is (abstractly) some DAG of "gates" (boolean functions) that takes a series of Boolean inputs to a Boolean output. TC0 in particular satisfies:

  • polynomial width: the number of gates at each depth, is upper-bounded by a polynomial in the input-size n,
  • constant depth: the maximum number of gates from input to output is upper-bounded by some constant,
  • unbounded fan-in: each gate can receive inputs from arbitrarily many other gates,
  • each gate is either AND< OR, NOT, or MAJ (for "majority", returns True when half or more arguments are True and False otherwise).

You can use MAJ gates to simulate threshold gates.. As such, this last bullet point is equivalent to:

  • each gate is a threshold gate: step(iwixi+t), where wi,t are real numbers and step returns 1 if the value is greater than 1/2 and 0 otherwise.

This is a feed-forward neural network, where each gate is an activation function acting on Boolean inputs. TC0 is the class of neural networks with width <p(n) and depth D. Circuit complexity classes in general are somewhat pathological, and can solve sets of problems that one might not expect to be easily grouped together. But this is quite interestingly natural for ML purposes.

proof of the result

Theorem. Take a TC0 family of score-networks fθ,0,fθ,1, such that for each n and each x1,,xn the function fθ,n(x,t|x1,,xn) exactly computes the score function of some initial distribution ρ0,n with bounded first moment. If this family solves a prefix language modeling problem in the SMLD infinite time limit with a constant probability bound, then the problem is in TC0.

What is a "prefix language modeling problem"? Next-token prediction: given n previous tokens in an alphabet, predict token n+1. This is solved by a circuit complexity class if a family of circuits Ci for every input size satisfying the complexity class solves the problem.

The proof relies on a result from the literature given in O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions which states that there exists some universal constant c>0 such that TV(pX,pY)cdlog3TT+cϵscorelogT. Here TV is the total variation distance, X is the data distribution, and Y is a distribution generated by DDPM denoising (essentially just a discretization of SMLD). One can use this to upper bound the total variation distance between the DDPM sampler and our denoising process by some constant ϵ, and setting our constant probability bound of finding a solution to ϵ we find that ϵ=ϵ/2 is enough for our denoising process to solve the problem. This is then derandomized by a construction given in Threshold circuits of bounded depth, which reconstructs a TC0 class.

One may wonder where we require "perfect" score matching! Well, the cϵscorelogT term increases with T, so for this proof to work completely cleanly one requires ϵscore to be set to 0. Practical diffusion networks are not like this --- there will always be some error.


Mid-October Links

i am tired so here is a linkpost. i'll try not to do more than two of these a month

  1. Misha Gromov mathematizes biology (in an IHES lecture series). See also his manuscript on ergosystems.
  2. Gauge/string dualities as special cases of Schur-Weyl duality.
  3. Tao on when eigenvalues are stable under (small) perturbations, what gauges are, and orders of infinity.
  4. Constructed languages are processed by the same brain mechanisms as natural languages.
  5. Negarestani on the alien will, toy aesthetics, and toy philosophy. He also has a complexity sciences reading list which is surprisingly reasonable.
  6. A 3000pg algebraic topology reference with pictures.
  7. Tsvi on gemini modeling and counting down vs. counting up coherence.
  8. Schulman on when low-rank LoRA underperforms and matches fullFT. In particular, 1-rank LoRA is sufficient for RL tasks.
  9. Associative memory in an optical spin glass made of rubidium Bose-Einstein condensates. Ganguli is a co-author.
  10. Lada Nuzhna on where are all the trillion dollar biotechs?
  11. Francis Bach with some more "classical" settings for scaling laws. See accompanying blog posts.
  12. Andrew Critch has an interesting blog post on Newcombian implications on self-trust. Christiano also has a blog post on integrity for consequentialists.
  13. Homotopy is not concrete.
  14. Nick Bostrom profile in the New Yorker.
  15. Dean Ball on what it's like to work in the White House. He ~wrote the AI Action Plan.
  16. Vitalik on memory access actually taking O(N1/3) time, low-risk defi as an Ethereum business model, copy-left vs. permissive licenses, and musings on ideologies. His posts are great. Highly recommend.
  17. Ben Kuhn on how taste can be the leading contributor to impact. See also Chris Olah's exercises for developing research taste.
  18. Exposition of homotopy type theory with -topos semantics by Emily Riehl. I really like these! They're cogent and clear.
  19. The Ohio State University is hosting an International Conference on Ancient Magic this weekend.
  20. Aging as a loss of goal-directedness, from the Levin lab.
  21. Eliot's The Hollow Men and Blake's The Proverbs of Hell.
  22. An optimistic case for protein foundation model companies.
  23. Grokking as a first order phase transition in neural networks. Good example of mean field theory as a thermodynamic theory of learning.
  24. If you like the items on this list, or especially if you wish the items on this list were better, email me!

MDL meets SLT

[paper highlight: Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory]

two major contributions of the paper:

  • theoretically links minimum description length to singular learning theory, in that they prove that for all ground-truth distributions q and n i.i.d samples drawn from q, there exists a two-part code with asymptotic redundancy Rn=λlogn(m1)loglogn+Op(1), where λ is the LLC
  • experimental results showing LLC variance with model quantization (where quantization is ~roughly a stand-in for compression and LLC measures complexity, so one can study empirical correlations)

what is a two-part code? admittedly I'm still slightly bamboozled by the MDL formalism they choose, so this will be a mix of hand-wavy intuition and opaque jargon.

Let q(n)Δ(Xn) be a data-generating distribution over n-sequences drawn from the sample space (token vocabulary) X. Any distribution p(n) over Xn induces a code for any sample x(n)Xn, where a code is just an associated bitstring for the sample. The bitstring has length logp(n)(x(n)) (the entropy), and the minimum description length principle is essentially that good encodings should seek to minimize the minimum average length of samples. Given i.i.d sampling, the long-run optimal encoding distribution is the ground-truth distribution q(n), and KL(q||p) has a clean interpretation in this context: the expected excess length per-symbol given by encoding distribution p vs. q.

a two-part code is an encoding with two parts: one specifying the encoding distribution ("model") with respect to a model class, and the other specifying the message ("sample") given the model. intuition for this setup: imagine a sender and receiver having mutual information over the fact that both will communicate via some language with some grammatical structure, but the structure insufficiently specifies a full language and vocabulary, however they both have a dictionary mapping bitstrings to complete languages that they can coordinate on first before communicating. (there are much better ways of explaining this).

anyway, you want some way of measuring the performance of your encoding in the two-part setting. there's a quantity called redundancy that measures performance with regards to the underlying data distribution, roughly given by Rn=len([[p]])+KL(q||p), in the average case, where [[p]] is your bitstring encoding of your model w.r.t. your model class. a natural way of optimizing this is choosing a p which accurately models q and eating the specification cost. However! you might have a model class M uniquely unsuited to encoding q, in which case your optimization problem is more interesting.

restating the central theoretical result: there exists a two-part code for any realizable1 data generating distribution qM and dataset x(n) sampled i.i.d from q, the asymptotic redundancy is Rn=λlogn(m1)loglogn+Op(1), where λ is the LLC of q and m is the "multiplicity."1

it is late and my brain is not quite working, but i don't see optimality guarantees for this result? the construction is of the flavor "choose codes such that

len([[p]])=logVol(W)Vpn(ϵ),

and then this has Rn given above." (where p are the model encodings at the center of ϵ-balls covering model regions with sufficiently small KL divergence, necessary because discretization is needed to reduce to a set small enough to fully specify with codes). like, this is implicitly sane because KL ϵ-balls partition assigns probability proportional to the share of the ϵ-ball vs. the volume of the entire space, but IDK why is this the optimal encoding or agreement algorithm? mumbles in Jeffrey's prior

a maybe helpful image:

pythia-llc

I suppose this is why the empirical results are needed. But the empirical results are like "linear relationships between LLC estimates and critical compression thresholds for models up to 7B parameters" where the critical compression thresholds nq are literally "how many times do I need to quantize the model before the difference in loss exceeds some threshold." which is cool! but a bit confusing

pythia-llc

still don't quite understand the theory behind LLC estimation. MDL and SLT connections are cool though, it would be nice to get some naturality results bc the experimental results are not that convincing by themselves (many patterns replicate this, LLC estimation is an art not a science, and quantizing models arbitrarily and doing inference on them seems like it naturally leads to buggy implementations)

1

see technical conditions in the paper


Prediction is not Generation

[a long overdue response to Aidan :)]

In ML, generation and prediction are practically synonymous. Your model learns an appropriate, performant compression of your dataset, and somehow such artifacts generate "completions" (broadly construed) with high accuracy.

It's tempting to then make the leap to man, if I just managed to tokenize the entire past and future of the universe and train a transformer (with infinite compute) to predict the next universe state every Planck time1 from the universe history up until that point, then it'll be guaranteed to faithfully represent the true laws of physics somewhere in its weights!

I claim this is unclear! Even if the laws of physics were describable by some finite state automata, the optimal predictive representation of a process does not have to necessarily correspond to the optimal generative representation!

Here's a toy case. Consider the space of all stationary Markov processes generating the symbol set {0,1}. Clearly the best way to predict a process like this (given Markovity) is to assign some probability p to 1 being generated after a 0, and some probability q to 0 being generated after a 1. There are two "belief states" of this policy (let's call them A and B—each corresponding to the "belief" that 0,1 will be generated2) that the reasoner will occupy with probabilities

P(A)=1q2pq,P(B)=1p2pq respectively. The entropy of this two-state system is just the entropy of the stationary distribution (given above), which turns out to be

Cμ=P(A)log2P(A)P(S1)log2P(B)=q12pqlog2(1q2pq)+p12pqlog2(1p2pq).

The key point to remember here is that we're using the entropy of the stationary state distribution as a measure of "optimality," in the sense that lower entropy means higher simplicity and as a result it is "more optimal." It stands to reason that if generation and prediction are "the same," then it should be impossible to construct a generative process with lower entropy than Cμ for some values p,q. Right?

Well. Consider p=q=0.4, and consider the generating process below.

Lohr

You can check for yourself that this process outputs 01 with probability p, and 10 with probability q for 0p=q1/2. This process has a stationary distribution

π=[12p,2p,12p],

and its entropy H[π] for p=q=0.4 is approximately 0.922, less than Cμ=1.

Have we been hoodwinked? Maybe one should never trust a sentence beginning with "Clearly, . . ." in a mathematical text. Maybe there's a secret predictive policy that magically has lower entropy for p[0.38,0.5]3 that we're just missing.

I argue against this. In particular, there is a particular interpretation of "prediction" we're using here that I claim is simultaneously natural and correct.

Consider an infinite string of tokens X2,X1,X0,X1,X2,. The Markov property states that P(X0|X:0)=P(X0|X1): that my causal state is fully determined by timestep T1, and thus the last token output contains all the information I could use to predict the next token output. As an optimal predictor, I want to adhere to the optimal causal policy which is the minimal entropy policy over belief states that can be causally differentiated. In this case, it is the two-state policy μ with entropy Cμ above.

Observe that the introduction of causality meaningfully differentiates solely generative policies from predictive ones! We have constructed a lower-entropy generative process by relaxing the assumption that we only rely on meaningfully causally differentiated belief states given the token history. There's a sense in which this is the fundamental difference between prediction and generation. It remains to be seen how widely this holds, but the two concepts are canonically differentiated.

Examples taken from James et. al..

1

Ignoring that the universe doesn't have a sense of absolute time.

2

In general, belief states are not by default interpretable.

3

The ranges in which the Lohr model has lower entropy than the predictive model.