Notes

meta-learning redux

February 05, 2026

1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04

cardinal and ordinal utilities

February 04, 2026

A common mistake is to take one's formalism as metaphysics. This is especially true in domains tangentially related to the study of human behavior: "just because you can be described by a coherence theorem does not mean you are a coherence theorem."

I note that the difference between cardinal and ordinal utilities is not as deep as it may seem. Cardinalists use a utility function $u : X \to R$ to describe preferences, while ordinalists restrain themselves to only defining an ordering over $X .$

Under natural conditions, orderings over $X$ can be described as utility functions over $X .$ If the ordering is complete, transitive, continuous¹, and admits an order-dense subset², there exists a continuous function $u$ such that $u (x) \leq u (y)$ if and only if $x \leq y .$

As an example: if $X$ is any convex subset of $R^{n}$ and $\leq$ is continuous, then this holds and there exists a corresponding continuous utility function.

¹

Give $X$ topological structure. If the upper and lower contour sets of an ordering $\leq$ are closed in $X$ for every $x \in X,$ then $\leq$ is continuous. ↩

²

There exists $Z \subseteq X$ such that for every pair $x, y \in X$ such that $x \leq y,$ there exists $z \in Z$ such that $x \leq z \leq y .$ ↩

thoughts on VeLO

February 03, 2026

[paper] [code]

[1] hypernetwork architecture. there's a tensor-level LSTM that takes in bulk loss statistics and outputs a weighting vector $c_{h y p e r}$ used to interpolate between a bank of pre-trained per-parameter MLPs that then gets passed the per-parameter loss statistics to compute parameters $d, m$ such that $Δ p = 10^{- 3} \cdot d \cdot \exp (10^{- 3} \cdot c_{l r}) | | p | |_{2}$ (where $c_{l r}$ is also produced by the tensor-level LSTM). not immediately obvious to me where these pre-trained MLPs come from

[2] meta-training distribution is surprisingly diverse. image classification / image generation / language modeling, across MLP / CNN / vision transformer / autoencoder / VAE / transformer architectures, with different initializations. + task augmentation! (weight reparameterization / gradient renormalization / etc.). 4k TPU-months of compute. these tasks all dealt with relatively small models: VeLO generalization to 100M param LLM was not good (compared to a tuned Adafactor baseline), but "within distribution" VeLO seems to outperform tuned Adam / Shampoo / other classical optimizers.

[3] ES as a gradient update method. meta-objective was to minimize end-of-training loss (vs. average loss per-step), and the classical method was used to estimate gradients: perturb the parameters with sampled Gaussian noise and use the estimator $\frac{1}{2 m σ} \sum_{i = 1}^{m} (L (θ + σ ϵ_{i}) - L (θ - σ ϵ_{i})) ϵ_{i},$ where $ϵ_{i}$ is drawn from a normal distribution. I wonder how much there's to be gained in iterating on this approach? the ES literature hasn't quite stagnated

[4] Adam is really quite good? VeLO outperforms by (eyeballing) ~2-5x with intense resource input, and even then isn't pareto compared to tuned Adam OOD. (VeLO is also much more expensive, & cannot generalize to RL or GNNs (as far as they tested))

[5] convinced me that truly there's a lot of performance one could eke out of meta-training a learned optimizer, provided one doesn't care about resource constraints. many cool directions to extend this to

morley rank as dimension

February 02, 2026

[notes]

Let $L$ be a first order language and let $T$ be a complete $L$ theory. Given

$M ⊨ T$ a model of $T$ ,
$x$ a variable context,
$ϕ \in L_{x} (M)$ a formula with parameters from $M$ and free variables ranging in $x,$ and
$α$ an ordinal

we define the Morley rank $RM (ϕ)$ of a formula $ϕ$ by transfinite recursion on the condition $RM (ϕ) \geq α .$ The recursion satisfies:

$RM (ϕ) \geq 0$ if and only if $M ⊨ (\exists x) ϕ$
$RM (ϕ) \geq α + 1$ if and only if there is some $N ⪰ M$ (an elementary extension) and a sequence of formulas $(ψ_{i})_{i \in ω} \subseteq L_{x} (N)^{ω}$ such that for each $i,$ $N ⊨ ψ_{i} \to ϕ$ and $RM (ψ_{i}) \geq α$ both hold and for $i \neq j$ we have that $N ⊨ ψ_{i} \to \neg ψ_{j} .$
$RM (ϕ) \geq λ$ for a limit ordinal $λ$ if and only if $RM (ϕ) \geq α$ for all $α < λ .$

We define $RM (ϕ) := α$ if $RM (ϕ) \geq α$ but $RM (ϕ) ≱ α + 1.$ Every formula in a totally transcendental theory has ordinal-valued Morley rank. Inconsistent formulas are given Morley rank of $- 1.$

When taking $T = {ACF}_{p}$ (the theory of algebraically closed fields for characteristic $p$ ), the Morley rank and Krull dimension agree (on constructible sets). Examples:

Take finite $X \subset K^{n} .$ This has dimension 0, and can be specified by some formula $ϕ .$ One can verify that $RM (ψ) = 0$ when the solution set of $ψ$ in $M$ is finite, so $RM (ϕ) = 0.$
The affine line $A^{1} (K) = K$ has dimension 1, and $RM (K) \geq 1.$ However, it cannot be greater than 2 because every definable subset of $K$ is either finite or cofinite, and cofinite sets cannot be disjoint. (Similar arguments hold for $A^{k} .$ )
. . .

on ferrante

February 01, 2026

A few points I'm not sure Ferrante tried to make but I understood regardless:

[1] "Volitional strength" does not increase with age so much as "volitional complexity" does.

[2] Reminds me of my mom's stories about las colonias. From the basics (community units are families led by patriarchs) to particulars (the Lina / Melina friendship, fancy car as status symbol, loan-shark as resident, the names).

[3] The "respect for the interiority of self" that Lina possesses --- is this an authorial artifact? Faithful developmental description? Does it look like "respect" from the inside as well as the outside?

[4] Characters' relational generators flummox me.

[5] First fifty pages were the most intense reading experience of my life?