meta-learning redux
1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04
1ea1a35f687fd461da1181bad3965dfb82c82128362cb8d0ff077f9e94b88d04
A common mistake is to take one's formalism as metaphysics. This is especially true in domains tangentially related to the study of human behavior: "just because you can be described by a coherence theorem does not mean you are a coherence theorem."
I note that the difference between cardinal and ordinal utilities is not as deep as it may seem. Cardinalists use a utility function
Under natural conditions, orderings over
As an example: if
Give
There exists
[1] hypernetwork architecture. there's a tensor-level LSTM that takes in bulk loss statistics and outputs a weighting vector
[2] meta-training distribution is surprisingly diverse. image classification / image generation / language modeling, across MLP / CNN / vision transformer / autoencoder / VAE / transformer architectures, with different initializations. + task augmentation! (weight reparameterization / gradient renormalization / etc.). 4k TPU-months of compute. these tasks all dealt with relatively small models: VeLO generalization to 100M param LLM was not good (compared to a tuned Adafactor baseline), but "within distribution" VeLO seems to outperform tuned Adam / Shampoo / other classical optimizers.
[3] ES as a gradient update method. meta-objective was to minimize end-of-training loss (vs. average loss per-step), and the classical method was used to estimate gradients: perturb the parameters with sampled Gaussian noise and use the estimator
[4] Adam is really quite good? VeLO outperforms by (eyeballing) ~2-5x with intense resource input, and even then isn't pareto compared to tuned Adam OOD. (VeLO is also much more expensive, & cannot generalize to RL or GNNs (as far as they tested))
[5] convinced me that truly there's a lot of performance one could eke out of meta-training a learned optimizer, provided one doesn't care about resource constraints. many cool directions to extend this to
[notes]
Let
we define the Morley rank
We define
When taking
A few points I'm not sure Ferrante tried to make but I understood regardless:
[1] "Volitional strength" does not increase with age so much as "volitional complexity" does.
[2] Reminds me of my mom's stories about las colonias. From the basics (community units are families led by patriarchs) to particulars (the Lina / Melina friendship, fancy car as status symbol, loan-shark as resident, the names).
[3] The "respect for the interiority of self" that Lina possesses --- is this an authorial artifact? Faithful developmental description? Does it look like "respect" from the inside as well as the outside?
[4] Characters' relational generators flummox me.
[5] First fifty pages were the most intense reading experience of my life?