MDL meets SLT
[paper highlight: Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory]
two major contributions of the paper:
- theoretically links minimum description length to singular learning theory, in that they prove that for all ground-truth distributions
and i.i.d samples drawn from there exists a two-part code with asymptotic redundancy where is the LLC - experimental results showing LLC variance with model quantization (where quantization is ~roughly a stand-in for compression and LLC measures complexity, so one can study empirical correlations)
what is a two-part code? admittedly I'm still slightly bamboozled by the MDL formalism they choose, so this will be a mix of hand-wavy intuition and opaque jargon.
Let
a two-part code is an encoding with two parts: one specifying the encoding distribution ("model") with respect to a model class, and the other specifying the message ("sample") given the model. intuition for this setup: imagine a sender and receiver having mutual information over the fact that both will communicate via some language with some grammatical structure, but the structure insufficiently specifies a full language and vocabulary, however they both have a dictionary mapping bitstrings to complete languages that they can coordinate on first before communicating. (there are much better ways of explaining this).
anyway, you want some way of measuring the performance of your encoding in the two-part setting. there's a quantity called redundancy that measures performance with regards to the underlying data distribution, roughly given by
restating the central theoretical result: there exists a two-part code for any realizable1 data generating distribution
it is late and my brain is not quite working, but i don't see optimality guarantees for this result? the construction is of the flavor "choose codes such that
and then this has
a maybe helpful image:
I suppose this is why the empirical results are needed. But the empirical results are like "linear relationships between LLC estimates and critical compression thresholds for models up to 7B parameters" where the critical compression thresholds
still don't quite understand the theory behind LLC estimation. MDL and SLT connections are cool though, it would be nice to get some naturality results bc the experimental results are not that convincing by themselves (many patterns replicate this, LLC estimation is an art not a science, and quantizing models arbitrarily and doing inference on them seems like it naturally leads to buggy implementations)
see technical conditions in the paper