Notes

Miscellaneous Poetry Drafts

October 12, 2025

a blade of grass hides
minuscule migratory men
Lilliputian fiends

II.

mighty merry rascals
fickle, high off foglefreude
die English skippers

III.

I once beheld Seneca's estate,
Credulously inspecting his wicker tomb—
Yet Margate Mennons and liced defendants
Both swore by its awkward loom.

My heart gasped and lips shuddered
When, to my utmost surprise
The elderly Roman statesman lay
As mummified nitride.

Tick-tock, goes the clock
Garrulous gyrations too
Gizzardly Gentry, nice surprise
Confiding in a martyr's womb.

Foiled! the cuckoo's dead—
Not I, not I, not I!
Lambastation! Aberration!
To defy Nero's evil eye.

IV.

weed stands stout
thistles, burr
gadolinum kraut
hummus and herb

olden spires bristle, copper-waxed
kelpish tides awash blades of glass
Betty stout, orange'd brass
vermillion mounts, dugong grass

betwixed, witched, yonder
your shivers roll down my spine
I gave you a bouquet of thorned tulips
at sunrise, on Cocoa Beach

the rocket's red glare, the bombs
bursting over a lackadaiscal mare
I wish Martians dreamt of the stars
alas

Scaling Laws for Transfer Learning

October 11, 2025

Chinchilla scaling posits that

$$ L(N,D) = L_{\infty} + AN^{-\alpha} + BD^{-\beta}, $$

where $N$ is the number of parameters in your language model, $D$ is the number of training tokens seen, $A,B,\alpha,\beta, L_{\infty}$ are constants, and $L(N,D)$ is the test loss.

Of course, this analysis is limited:

parameters do not hold across architectural shifts: dense vs. MoE for ex. (h/t Kushal for sending me this paper)
"scale data and model size similarly" is derived from the regime where compute $C \propto ND,$
data repetition may or may not degrade performance in the long-term: it seems like 4x is the limit for traditional autoregressive transformers, but 100x can be useful for diffusion LMs
$L(N,D)$ is in-distribution test loss, loss often not predictive of downstream task performance, although loss-to-loss predictions across different training distributions are predictable
the original Chinchilla paper likely did their regressions wrong

Still, it is remarkable that test loss performance is so clearly a function of parameters and dataset size, with little assumptions made about the distributional structure or architectural specifics (this holds across varying depth/width ratios, for instance). I find two questions natural:

does scaling hold outside of the cross-entropy pretraining regime?
can we derive scaling relationships for downstream task performance? in particular, how predictable is transfer learning?

In 2021, OpenAI studied the question of "how much more quickly does a pretrained LM achieve low loss on a finetuning dataset than an LM initialized from scratch on the dataset?" (Note pretraining and finetuning on this case are "the same operation", the objective is still cross-entropy). They find that the "effective data transferred"¹ $D_T$ is described by $$ D_T = k(D_F)^\alpha (N)^\beta, $$ where $D_F$ is the size of the finetuning dataset (in tokens) and $N$ is the number of non-embedding parameters of the model.² This is great! Strong evidence of the generality of abstractions the model learns in pretraining (especially given the independence of $\beta$ from the source distribution). However, it doesn't explicitly tell us about downstream task performance given an external metric.

Brandfonbrener et. al. do somewhat better. They derive scaling relationships for train-train, train-test, and test-test loss transfer for models trained on different datasets, which can be expressed as

$$L_i(\hat{f}_j^{N,D}) \approx K \cdot \left( L_k(f_l^{N,D}) - E_{k | l}\right)^\kappa + E_{i|j},$$

where you have models $f_j, f_l$ trained on distributions $j,l$ evaluated on distributions $i, k$ and you're fitting the constants $K, \kappa.$³ As an example, the case of train-train would be where $(i,j) = (0,0)$ and $(k, l) = (1,1).$ We pair models by $(N,D)$ for coherence. Notably, these laws hold for diverse datasets, but only well in low-loss regimes and when $E_{m|n}$ terms can be well estimated. Still no breaking of the pretraining regime, and no explicit predictions for downstream metric performance!

There's a meta-analysis out this year that claims that scaling laws are unreliable for dowstream task performance prediction.. Seems correct. Metrics are noisy and don't have nice algorithmic properties like cross-entropy loss might. Perhaps intriguing is their observation that irregular scaling is (1) common and (2) can occur for cross-entropy on normal tasks and normal LM datasets. This paper claims that larger models & models trained for longer have better downstream task performance even when holding loss constant. Which is an argument for certain training setups & architectures having better inductive biases?

Honestly, I am kind of sad that the extant literature here seems to be tainted by publication bias? I wouldn't really trust these papers (or the ten others I read writing this), and I want to run the experiments myself. The Tinker API seems good for doing this quickly. I'll go do that.

(Titular question pending.)

In essence, the number of tokens that you "save" seeing in finetuning by pretraining.

Notably $\beta$ only depends on architecture and TARGET distribution (not SOURCE), while $\alpha$ is a rough "distributional proximity" proxy that can be easily estimated.

$E_{m|n}$ is the irreducible loss of a model trained on with infinite compute on distribution $n$ evaluated on distribution $m.$

On Non-Isolated Calls for Structure

September 26, 2025

Safety cases are arguments that AI deployments are safe in some specified context. The context can include restrictions on deployment environments as well as training or deployment protocols. For instance, the debate safety case only applies to low-stakes deployment environments, requires exploration guarantees on the model, and relies on a debate protocol which avoids obfuscated arguments. Given these assumptions, Buhl et. al. argue for “asymptotic guarantees”—that high performance on alignment objectives during training translate to approximate alignment during deployment. The control safety case is structurally similar, instead focusing directly on an explicit threat model and concretizing assumptions accordingly.

A naive way of constructing an “alignment portfolio” is simply to make safety cases which adequately cover all deployment environments with the appropriate degree of risk-tolerance. Formal verification for high-stakes SWE deployment, white-box interpretability for monitoring automated alignment researchers, some adapted debate protocol for use in executive decision-making. If the individual arguments are all sound, this works!

What if we introduce some error into the soundness judgements? If every safety case has some epsilon probability of failure, then straightforwardly you should make more safety cases for the scenarios in which alignment properties matter more. But if all your safety cases for non-deceptive automated alignment researchers rely on “white-box interpretability mostly working,” then if this isn’t true you’re still doomed no matter how many safety cases you write!

Anthropic’s ASL-4 safety case sketches are not quite this correlated, but only just. [1] relies on white-box methods successfully monitoring deception, [3] relies on guarantees that the pretrained model is not coherently deceptive (likely requiring successful white-box or black-box methods), and [2] still depends on linear activation probes adequately showing that the model cannot distinguish between certain classes of train and test deployments, as well as black-box evaluations providing sufficiently robust guarantees on behavior. These are similar assumptions! These assumptions are all only true in worlds where “models are sufficiently parsimonious such that present-day interpretability techniques and evals can provide rigorous guarantees on good behavior.”

In general, insufficient diversity over the world structure assumed in an alignment portfolio makes the portfolio fragile and irrobust.¹

It is always necessary to make assumptions about world structure when predicting world behavior. A bounded reasoner simulates the world with a local, low-fidelity model based on the reasoner’s accumulated evidence about the world. Some assumptions on world structure are better than others—gravity following an inverse-square law vs. homeopathic remedies curing cancer, for instance.

Considering the structure of one’s structural assumptions is critically important in domains where the world behavior has not been exhibited and it is of importance. Note:

The largest scientific breakthroughs are accompanied by structural assumptions about the world breaking. See the atomic bomb, CRISPR, heavier-than-air flight. Fundamentally, these “expand the domain of the possible.” Sometimes, the world structure is discovered first (as in nuclear theory leading to the first controlled chain reaction). Other times, a prototype uncovers the structure (see: penicillin). In both cases, the non-specialist intelligent reasoner understands a different possibility domain before and after.
Top-down searches for structural guarantees must be incredibly judicious in their assumptions, because the vast majority of hypotheses are incorrect. Ex post, the structure is obvious, but ex ante it is not. Consider Newton devoting as much energy to alchemy as the study of gravitation.
If we take the perspective that alignment is an infinite problem, there is no good reason to expect that the world structure we can reasonably assume is simple. It might be that it is infinitely complex and is only limited by our current understanding, and that we will recover finer and finer approximations of it as our understanding improves. At each stage of this process we will have to repeat our assumption examination from a roughly equivalent epistemic vantage point of staring into the abyss.
Much of the existential risk from AI development comes from tail risks and black swan events. Mitigating these requires a portfolio of solutions which each rely on decorrelated or independent world models (note this is not a guarantee).

Natural corollaries of this observation:

we should be explicit about which world models are going into constructing safety cases,
we should be developing independent safety cases for high-stakes deployment situations,
we should emphasize diversity in theoretical agendas to buttress our ability to make such safety cases reliant on disjoint sets of assumptions.

This is a specific instance of the general case of “Swiss cheese models only work when the holes don’t line up in the same worlds,” which is probably not sufficiently justified in this post but is something I believe to be true.

Diffusion Roundup

September 24, 2025

[1] Diffusion models seem to outperform traditional autoregressive models in the large data limit on token-prediction tasks. ¹ Autoregressive models are still superior in the low-data/compute-limited regime, and the threshold at which diffusion models become optimal follows a power-law in the dataset size (typically exceeding the Chinchilla threshold by a large margin).² Diffusion models also see performance gains under “trivial” data augmentation methods for far longer than autoregressive models (e.g. reordering tokens), and this is plausibly because the generation method is fundamentally non-causal? (Much of the performance gap can be recovered by implementing similar data augmentation methods in the AR case, but it’s unclear if this scales to tasks that require “cognition” in the human sense of the word). Not entirely clear how this translates to better performance on real-world tasks in the data-limited regime; it could be that the compute scaling necessary is simply prohibitive, and it could also be that the implicit curriculum afforded by the de-noising process is simply insufficient at providing reasonable enough signal on difficult tasks.

[2] Diffusion in-practice probably has the circuit complexity depth constraints of an attention-based transformer. In the last few years, we’ve seen literature essentially claiming that attention in-practice is limited to modeling circuits in the class TC0 (polynomial width, constant depth Boolean circuit family)³ Adding chain-of-thought ~roughly increases this to NC1 (although there are some subtleties involving the lack of robustness to input-ordering).⁴ There are reasons to expect difficult problems, especially the sorts encountered in long-horizon RL, to require architectures that can internally simulate deep computation. These architectures have been recurrent thus far. However, recurrent architectures fail to adequately leverage the compute parallelism offered by GPUs and have many, many issues with unstable training dynamics, so scaling transformers is a better option. It’s probably not the case that diffusion models can prove an adequate replacement here, but it’s interesting that a diffusion process with no constraints imposed by a score function can theoretically simulate any Turing-complete process, but when perfectly matching a score function still has the limitations of a TC0 representation. Results in the approximate regime pending.⁵

[3] Diffusion is (kind of) spectral autoregression.⁶ There are two brilliant blog-posts on the subject, cumulatively arguing that DDPM has an inductive bias to generating low-frequency features before high-frequency features (in Fourier space; hence the name) but this is not necessarily true of all possible diffusion models (changing the model’s noising schedule to be frequency-agnostic doesn’t degrade performance on CIFAR10 and similar datasets, but not all noising schedules achieve the same performance!). How much does this matter for text-data domains? Audio? Video? Are there correspondences we can make between distributional structure and optimal noising schedules? In algorithmic cases, what does this mean?

I primarily find diffusion models interesting from a theoretical perspective, given that the corresponding SDE literature is rich and there are (potentially) deep connections to be made to modern ML. In particular, I expect we can better understand what feature orderings are optimal, which properties of distributions make them learnable, how much an inductive bias is a property of the model architecture vs. optimization algorithm or other factors, and to what extent recurrence can be represented with parallel architectures. This post should not be taken as definitive; it has not been edited and I welcome feedback.

This section is summarizing the paper Diffusion Beats Autoregression in Data-Constrained Settings.

The metric the authors use is “number of unique tokens”—which is quite strange, given that the vocab size of a model is typically quite limited, and they mention training a 2.3B parameter diffusion model on a 500M unique token dataset. Perhaps they mean just the token size of a dataset with no repeated entries?

See The Parallelism Tradeoff: Limitations of Log-Precision Transformers, Transformers, parallel computation, and logarithmic depth, and Theoretical limitations of multi-layer Transformer.

⁴

See Chain of Thought Empowers Transformers to Solve Inherently Serial Problems. CoT/neuralese introducing “effective recurrence” into modern models seems to be important for timeline modeling.

⁵

Reach out if you have thoughts!

⁶

See A Fourier Space Perspective on Diffusion Models.

Linear Contracts Are Optimally Robust

August 20, 2025

nb: attempting a daily posting cadence. adjust quality priors accordingly

Consider the following game:

Alice offers a contract $w: \mathcal{Y} \to \mathbb{R}^+$ to Bob.
Bob, knowing his compact action space over lotteries $\mathcal{A} \subseteq \Delta (\mathcal{Y}) \times \mathbb{R}^+,$ chooses action $(F,c) \in \mathcal{A},$
The output $y \sim F$ is realized (sampling from the lottery chosen)
Alice receives $y - w(y)$ payoff; Bob receives $w(y) - c$ payoff.

Importantly, Alice receives limited information about Bob's action space (interchangeable with "technology"). What is the optimal contract Alice should give Bob, if she wants to maximize her worst case outcome? (We assume Alice and Bob are rational actors: their dynamics will be given shortly). This is an identical problem to studying the structure of the optimal $w$ in this scenario.

[Car15]¹ proves that the optimal $w$ is linear, subject to the following assumptions: $\mathcal{Y} \subset \mathbb{R}^+,$ $\mathcal{A}$ must be compact; Alice (the "principal") knows $\mathcal{A}_0 \subseteq \mathcal{A}$ possible actions for Bob (the "agent") such that there exists $(F,c) \in \mathcal{A_0}$ such that $\mathbb{E}_F[y]-c >0$ (the principal should have some reason for hiring the agent); $w$ must be continuous.

Bob's behavior is quite simple, given that Bob has all information. The set of actions $(F,c) \in \mathcal{A}$ Bob will consider are those which maximize $\mathbb{E}_F(y) - c.$² We denote this by $\mathcal{A}^* (w | \mathcal{A}),$ and we denote by $V_A(w|A)$ the expected payoff of Bob given rational behavior. Alice's expected payoff is then

$$ V_P(w|\mathcal{A}) = \max_{(F,c) \in \mathcal{A}^*(w|\mathcal{A})} \mathbb{E}_F[y-w(y)], $$

and Alice searches over expected payoffs as

$$ V_P(w) = \inf_{\mathcal{A} \supseteq \mathcal{A}_0} V_P(w|\mathcal{A}). $$ We are interested in $w$ such that $V_P(w)$ is maximized.

optimality in the zero-shot game

Motivating example: $w(y) = \alpha y$ always guarantees the principal Alice positive worst-case payoff, for $\alpha \in [0,1]$. This analysis holds independently of the possible technology $\mathcal{A},$ due to the nontriviality assumption we impose on $\mathcal{A}_0.$

Proof: Rewrite $y - w(y)$ as $w(y)/\alpha - w(y) = \frac{1-\alpha}{\alpha}w(y).$ Lower bound $\mathbb{E}_F[w(y)]$ (the expected payment of Bob the agent) with $\mathbb{E}_F[w(y)] \geq \mathbb{E}_F[w(y)] - c = V_A(w|\mathcal{A})$, which is $\geq V_A(w|\mathcal{A}_0)$ (because adding more actions to the agent can't decrease the optimal outcome given $\mathcal{A}_0$). Combining the two gives

$$\mathbb{E}_F[y-w(y)] \geq \frac{1-\alpha}{\alpha}\mathbb{E}_F[w(y)] \geq \frac{1-\alpha}{\alpha}V_A(w|\mathcal{A}_0),$$

which gives $V_P(w) \geq \frac{1-\alpha}{\alpha}V_A(w|\mathcal{A}_0).$ Given the nontrivality assumption $V_A(w|\mathcal{A}_0) > 0,$ this means that this gives Alice a positive lower bound on the worst case outcome independent of the choice of technology $\mathcal{A}!$

Is there a sense in which linear contracts are "the best you can do?" Carroll shows that any contract $w(y)$ can be improved to a linear contract, even if $w(y)$ is pathological. The gist of the argument is as follows: consider the convex hull of the curve $w(y)$ that lies above the value $V_A(w|\mathcal{A}_0).$ Consider the $Q = (y',w(y'))$ point which minimizes $\mathbb{E}_F[y] - \mathbb{E}_F[w(y)].$ This is the worst case for Alice, the principal. $Q$ will typically be where $V_A(w|\mathcal{A}_0)$ intersects the left side of the convex hull. Note that the line parametrizing the intersecting boundary of this convex hull is itself a contract $w'(y)$ which dominates $w(y)!$ Repeating this process and considering some technical details gives you the optimality result.

The full technical details can be found the paper, and I will not discuss them now. However, I would like to discuss the generalization of this lemma to include more observables to the principal.

Let $z = (z_1, \ldots, z_k)$ range over a compact set $\mathcal{Z} \subseteq \mathbb{R}^k.$ Define a cost function $b: \mathbb{R}^k \to \mathbb{R}^+$ such that actions depend on $b:$ an action is then $(F,c)$ such that $F \in \Delta(\mathcal{Z})$ and $c \geq b(\mathbb{E}_F(z)).$ A contract is now a function $w: \mathcal{Z} \to \mathbb{R}^+.$ Changing the definitions of $\mathcal{A}_0, V_A$ appropriately etc., it turns out the optimally robust contract is linear in the observables available to the principal. Precisely, the optimal contract is of the form

$$w(z) = \alpha_1z_1 + \cdots + \alpha_kz_k + \beta$$

for real numbers $\alpha_i, \beta.$

learnability

It seems like the linear optimality result for robust contracts is pretty general and not too sensitive to assumptions: load-bearing here is that $\mathcal{Y}$ has a minimum that we normalize to be zero, some relaxed version of the nontrivality assumption, and that the risk associated with any particular action is quantifiable in a shared manner by both the principal and the agent.³

One obvious consideration: consider the possibility of unbounded risk to the principal (as suggested in UK AISI's Economics and Game Theory research agenda). It is difficult to construct contracts that then robustly protect against these scenarios, even with partial information. What are the minimum viable assumptions necessary to get guarantees in this regime?

Another consideration that I am interested in: this feels very similar to classical bandit problems in RL. Heck, the agent is engaging in the optimal bandit policy given a set of lotteries! Unifying the two literatures (perhaps [KZ25] is of interest) might tell us something interesting about certain classes of decision problems.

Also, learnable agents would (I think) perform better than static ones on a kind of mixed-objective performance metric! Perhaps one which mixes expected average and expected worst-case reward, with some weighting. Generally, intuitive restrictions & extensions on this result are things I'm excited about.

The majority of this post is a distillation of this paper. All credit to Gabriel Carroll.

$\mathcal{A}^* (w | \mathcal{A})$ is guaranteed to be nonempty by continuity and compactness.

This one in particular has some philosophical implications.