Notes

meta-learning redux

1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04


cardinal and ordinal utilities

A common mistake is to take one's formalism as metaphysics. This is especially true in domains tangentially related to the study of human behavior: "just because you can be described by a coherence theorem does not mean you are a coherence theorem."

I note that the difference between cardinal and ordinal utilities is not as deep as it may seem. Cardinalists use a utility function u:XR to describe preferences, while ordinalists restrain themselves to only defining an ordering over X.

Under natural conditions, orderings over X can be described as utility functions over X. If the ordering is complete, transitive, continuous1, and admits an order-dense subset2, there exists a continuous function u such that u(x)u(y) if and only if xy.

As an example: if X is any convex subset of Rn and is continuous, then this holds and there exists a corresponding continuous utility function.

1

Give X topological structure. If the upper and lower contour sets of an ordering are closed in X for every xX, then is continuous.

2

There exists ZX such that for every pair x,yX such that xy, there exists zZ such that xzy.


thoughts on VeLO

[paper] [code]

[1] hypernetwork architecture. there's a tensor-level LSTM that takes in bulk loss statistics and outputs a weighting vector chyper used to interpolate between a bank of pre-trained per-parameter MLPs that then gets passed the per-parameter loss statistics to compute parameters d,m such that Δp=103dexp(103clr)||p||2 (where clr is also produced by the tensor-level LSTM). not immediately obvious to me where these pre-trained MLPs come from

[2] meta-training distribution is surprisingly diverse. image classification / image generation / language modeling, across MLP / CNN / vision transformer / autoencoder / VAE / transformer architectures, with different initializations. + task augmentation! (weight reparameterization / gradient renormalization / etc.). 4k TPU-months of compute. these tasks all dealt with relatively small models: VeLO generalization to 100M param LLM was not good (compared to a tuned Adafactor baseline), but "within distribution" VeLO seems to outperform tuned Adam / Shampoo / other classical optimizers.

[3] ES as a gradient update method. meta-objective was to minimize end-of-training loss (vs. average loss per-step), and the classical method was used to estimate gradients: perturb the parameters with sampled Gaussian noise and use the estimator 12mσi=1m(L(θ+σϵi)L(θσϵi))ϵi, where ϵi is drawn from a normal distribution. I wonder how much there's to be gained in iterating on this approach? the ES literature hasn't quite stagnated

[4] Adam is really quite good? VeLO outperforms by (eyeballing) ~2-5x with intense resource input, and even then isn't pareto compared to tuned Adam OOD. (VeLO is also much more expensive, & cannot generalize to RL or GNNs (as far as they tested))

[5] convinced me that truly there's a lot of performance one could eke out of meta-training a learned optimizer, provided one doesn't care about resource constraints. many cool directions to extend this to


morley rank as dimension

[notes]

Let L be a first order language and let T be a complete L theory. Given

  • MT a model of T,
  • x a variable context,
  • ϕLx(M) a formula with parameters from M and free variables ranging in x, and
  • α an ordinal

we define the Morley rank RM(ϕ) of a formula ϕ by transfinite recursion on the condition RM(ϕ)α. The recursion satisfies:

  • RM(ϕ)0 if and only if M(x)ϕ
  • RM(ϕ)α+1 if and only if there is some NM (an elementary extension) and a sequence of formulas (ψi)iωLx(N)ω such that for each i, Nψiϕ and RM(ψi)α both hold and for ij we have that Nψi¬ψj.
  • RM(ϕ)λ for a limit ordinal λ if and only if RM(ϕ)α for all α<λ.

We define RM(ϕ):=α if RM(ϕ)α but RM(ϕ)α+1. Every formula in a totally transcendental theory has ordinal-valued Morley rank. Inconsistent formulas are given Morley rank of 1.

When taking T=ACFp (the theory of algebraically closed fields for characteristic p), the Morley rank and Krull dimension agree (on constructible sets). Examples:

  • Take finite XKn. This has dimension 0, and can be specified by some formula ϕ. One can verify that RM(ψ)=0 when the solution set of ψ in M is finite, so RM(ϕ)=0.
  • The affine line A1(K)=K has dimension 1, and RM(K)1. However, it cannot be greater than 2 because every definable subset of K is either finite or cofinite, and cofinite sets cannot be disjoint. (Similar arguments hold for Ak.)
  • . . .

on ferrante

A few points I'm not sure Ferrante tried to make but I understood regardless:

[1] "Volitional strength" does not increase with age so much as "volitional complexity" does.

[2] Reminds me of my mom's stories about las colonias. From the basics (community units are families led by patriarchs) to particulars (the Lina / Melina friendship, fancy car as status symbol, loan-shark as resident, the names).

[3] The "respect for the interiority of self" that Lina possesses --- is this an authorial artifact? Faithful developmental description? Does it look like "respect" from the inside as well as the outside?

[4] Characters' relational generators flummox me.

[5] First fifty pages were the most intense reading experience of my life?