Scaling Laws for Transfer Learning

October 11, 2025

Chinchilla scaling posits that

$$ L(N,D) = L_{\infty} + AN^{-\alpha} + BD^{-\beta}, $$

where $N$ is the number of parameters in your language model, $D$ is the number of training tokens seen, $A,B,\alpha,\beta, L_{\infty}$ are constants, and $L(N,D)$ is the test loss.

Of course, this analysis is limited:

Still, it is remarkable that test loss performance is so clearly a function of parameters and dataset size, with little assumptions made about the distributional structure or architectural specifics (this holds across varying depth/width ratios, for instance). I find two questions natural:

In 2021, OpenAI studied the question of "how much more quickly does a pretrained LM achieve low loss on a finetuning dataset than an LM initialized from scratch on the dataset?" (Note pretraining and finetuning on this case are "the same operation", the objective is still cross-entropy). They find that the "effective data transferred"1 $D_T$ is described by $$ D_T = k(D_F)^\alpha (N)^\beta, $$ where $D_F$ is the size of the finetuning dataset (in tokens) and $N$ is the number of non-embedding parameters of the model.2 This is great! Strong evidence of the generality of abstractions the model learns in pretraining (especially given the independence of $\beta$ from the source distribution). However, it doesn't explicitly tell us about downstream task performance given an external metric.

Brandfonbrener et. al. do somewhat better. They derive scaling relationships for train-train, train-test, and test-test loss transfer for models trained on different datasets, which can be expressed as

$$L_i(\hat{f}_j^{N,D}) \approx K \cdot \left( L_k(f_l^{N,D}) - E_{k | l}\right)^\kappa + E_{i|j},$$

where you have models $f_j, f_l$ trained on distributions $j,l$ evaluated on distributions $i, k$ and you're fitting the constants $K, \kappa.$3 As an example, the case of train-train would be where $(i,j) = (0,0)$ and $(k, l) = (1,1).$ We pair models by $(N,D)$ for coherence. Notably, these laws hold for diverse datasets, but only well in low-loss regimes and when $E_{m|n}$ terms can be well estimated. Still no breaking of the pretraining regime, and no explicit predictions for downstream metric performance!

There's a meta-analysis out this year that claims that scaling laws are unreliable for dowstream task performance prediction.. Seems correct. Metrics are noisy and don't have nice algorithmic properties like cross-entropy loss might. Perhaps intriguing is their observation that irregular scaling is (1) common and (2) can occur for cross-entropy on normal tasks and normal LM datasets. This paper claims that larger models & models trained for longer have better downstream task performance even when holding loss constant. Which is an argument for certain training setups & architectures having better inductive biases?

Honestly, I am kind of sad that the extant literature here seems to be tainted by publication bias? I wouldn't really trust these papers (or the ten others I read writing this), and I want to run the experiments myself. The Tinker API seems good for doing this quickly. I'll go do that.

(Titular question pending.)

1

In essence, the number of tokens that you "save" seeing in finetuning by pretraining.

2

Notably $\beta$ only depends on architecture and TARGET distribution (not SOURCE), while $\alpha$ is a rough "distributional proximity" proxy that can be easily estimated.

3

$E_{m|n}$ is the irreducible loss of a model trained on with infinite compute on distribution $n$ evaluated on distribution $m.$