Singular Learning Theory

Jay Gupta

Bayesian Setup

We are given a set of input and output pairs drawn i.i.d from unknown \(q(x, y)\).

\[ D_n = \{ (X_i, Y_i) \} \quad \text{for} \quad X_i \in \R^{N} \quad \text{and} \quad Y_i \in \R^{M} \]

Our goal is to find parameters \(w\) s.t. a model \(p(y \mid x \, ; w)\) approximates the truth. There are two flavors These directly correspond to Frequentist and Bayesian perspectives. of parameter estimation: MLE and MAP.

MLE chooses \(w\) s.t. the likelihood of observing the data is maximized. Often, it is easier to work with the negative log-likelihood which preserves the argmax by monotonicity.

\[ L_n(w) = \prod_{i = 1}^{n} p(y_i \mid x_i \, ; w) \quad \text{and} \quad \ell_{n}(w) = -\frac{1}{n} \sum_{i = 1}^{n} \log p(y_i \mid x_i ; w) \]

MAP chooses the most likely \(w\) given the data. We apply Bayes Rule and rename terms to study the posterior.

\[ p(w \mid D_n) = \frac{p(D_n \mid w) p(w)}{p(D_n)} = \frac{e^{-n \ell_n(w)} \psi(w)}{Z_n} \]

We obtain the evidence \(Z_n\) by applying the Law of Total Probability and jointly define the free energy \(F_n\) of the system.

\[ Z_n = \int_{W} e^{-n \ell_n(w)} \psi(w) \, \mathrm{d}w \quad \text{and} \quad F_n = - \log(Z_n) \]

Intuitively, \(Z_n\) measures the posterior mass in low loss The integral is dominated by minimizers of \(\ell_n(w)\) due to the \(\exp\) term. regions. This makes it a tool for principled model selection. It evaluates how well the model class explains the data. Performing asymptotic analysis on free energy is a goal of SLT.

Information Criterion

We apply Laplace's Method Laplace's Method evaluates integrals of the form \[ \int e^{M f(x)} \, \mathrm{d}x \] under mild constraints e.g., \(f\) is analytic. The basic strategy is to compute a 2nd order expansion and draw analogy to the Gaussian Integral. to approximate the evidence.

\[ L_n(w) \approx L_n(\hat{w}) + \frac{1}{2} (w - \hat{w})^{\top} \mathrm{H} (w - \hat{w}) \quad \text{for} \quad \mathrm{H} = \nabla^{2} L_n \]

The solution form is directly given to us.

\[ Z_n \approx \psi(\hat{w}) e^{-n \ell_n(\hat{w})} \cdot \left ( \frac{(2 \pi)^d}{n^d \det \mathrm{H}} \right )^{1/2} \]

We arrive at a 2nd order expression of free energy.

\begin{align} F_n &= -\log Z_n \\ &= \underbrace{n\ell_n(\hat{w}) + \frac{d}{2} \log n}_{\texttt{BIC}} - \underbrace{\log \phi(\hat{w}) - \frac{d}{2} \log 2\pi + \frac{1}{2} \log \det \mathrm{H} (\hat{w})}_{\texttt{O(1)}} \end{align}

Dropping lower order e.g., O(1) terms, we obtain the Bayesian Information Criterion. We interpret this as an accuracy-complexity tradeoff. This observation will motivate the RLCT.

Singular

To understand the difference between Singular and Regular models, we draw our attention to the Kullback-Leibler Divergence Intuitively, this is the expectation of the likelihood ratio over all data e.g., the average difficulty of the test \(H_0 : P\) and \(H_{A} : Q\) under the alternative. A low KL tells us the test is difficult e.g., it is hard to discern \(P\) and \(Q\) and large KL otherwise. (KL) between the truth and model.

\[ K_n(w) = \frac{1}{n} \sum_{i = 1}^{n} \log \left ( \frac{q(y_i \mid x_i)}{p(y_i \mid x_i \, ; w)} \right ) = \ell_n(w) - S_n(w) \]

Informally, this the loss landscape. We define the the set of true parameters as the choice(s) of \(w\) where KL vanishes.

\[ W_{0} = \{ w \in W \mid K(w) = 0 \} \]

The Fisher Information Matrix is an object with useful properties. The remainder of this section will prove key results.

\[ \mathcal{I}_{j,k}(w) = \E \left [ \left( \frac{\partial}{\partial w_j} \log p(y \mid x, w) \right) \left( \frac{\partial}{\partial w_k} \log p(y \mid x, w) \right) \right ] \]

Proposition 1. The Fisher Information Matrix is positive semi-definite and symmetric.

Proof. Let \(s(w) = \nabla_w \log p(y \mid x; w)\) be the score function. Notice \(\mathcal{I}(w)\) is the expectation of the score's outer product with itself.

\[ \mathcal{I}(w) = \E \left ( s(w) s(w)^\top \right ) \]

For any \(v \in \mathbb{R}^d\) the quadratic form is non-negative, so \(\mathcal{I}(w)\) is positive semi-definite.

\[ v^\top \mathcal{I}(w) v = \mathbb{E} \left[ (v^\top s(w))^2 \right] \geq 0 \]

Proposition 2. The entries of the Fisher Information Matrix are the covariances of mixed partials on the likelihood. This is reasonable. Clairaut's Theorem gives us equality in mixed partials for nice \(f\) \[ \frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x} \]

Proof. Under regularity conditions, for a fixed parameter \(w' \in w\), the expectation of its partial vanishes.

\begin{align} \frac{\partial}{\partial w'} \E \left [ \log p(X \mid w) \right ] &= \frac{\partial}{\partial w'} \int \log p(X \mid w) \, p(X \mid w)\, \mathrm{d}x \\ &= \frac{\partial}{\partial w'} \int p(X \mid w) \, \mathrm{d}x \\ &= 0 \end{align}

We expand the covariance of \(i\)-th and \(j\)-th component of the score and recover the corresponding entry.

\[ \mathrm{Cov}(S_i S_j) = \E \left [ S_i S_j \right ] - \E[S_i] \E[S_j] = \mathcal{I}_{i,j} \]

In other words, the Fisher is the variance of the score e.g., \(\mathrm{Var}(s(w))\) and thus measures its sensitivity.

Proposition 3. The Fisher Information is the expected Hessian of the negative log-likelihood.

Proof. Recall for fixed \(w_i\) the expectation vanishes. The strategy is to take the partial w.r.t. \(w_j\) on both sides and massage the result.

\begin{align} 0 &= \E \left [ \frac{\partial \log p(x)}{\partial w_i} \right ] \\ &= \frac{\partial}{\partial w_j} \E \left [ \left ( \frac{\partial \log p(x)}{\partial w_i} \right ) \right ] \\ &= \int \frac{\partial}{\partial w_j} \left ( \frac{\partial \log p(x)}{\partial w_i} \right ) p(x) \, \mathrm{d}x \\ &= \int \frac{\partial^2 \log p(x)}{\partial w_j w_i} p(x) \, \mathrm{d}x + \int \frac{\partial \log p(x)}{\partial w_i} \frac{\partial p(x)}{\partial w_j} \, \mathrm{d}x \\ &= \E \left [ \frac{\partial^2 \log p(x)}{\partial w_j w_i} \right ] + \E \left [ \frac{\partial \log p(x)}{\partial w_i} \frac{\partial \log p(x)}{\partial w_j} \right ] \end{align}

Re-arranging, we obtain the desired result.

This lends to a natural interpretation: the fisher captures the curvature of the NLL around its minimum and therein Intuitively, if the likelihood is peaked around its maximizer, you are fairly certain about the estimate. Otherwise, the estimate may only be in the vicinity of the true optimal parameter. its sensitivity. If the Fisher's determinant is empty, the model is singular. This problematizes the BIC since the Hessian is asymptotically the Fisher.

\[ \mathrm{H}(\hat{w}) = \frac{\partial^2 \ell_n(w)}{\partial w \, \partial w^T} = \frac{\partial^2 \left( K_n(w) + S_n \right)}{\partial w \, \partial w^T} \approx \mathcal{I}(\hat{w}) \]

Key Lesson: when a model class is singular, the complexity of a parameter \(w\) in parameter space \(W \subseteq \R^{d}\) needs a new interpretation.

RLCT

The dimensionality term \(d\) in the BIC fails for singular models. To fix this, we re-frame dimensionality in terms of a scaling exponent of the volume of "nearly" true parameters.

\[ W_\varepsilon = \left\{ w \in W \;\middle|\; K(w) < \varepsilon \right\} \quad \text{and} \quad V(\varepsilon) = \int_{W_\varepsilon} \varphi(w) \, dw \]

In general, for any singular model defined by \(\hat{w}\), the volume integral resolves to the following form.

\[ V(\varepsilon) = c \varepsilon^\lambda + o(\varepsilon^\lambda) \]

The quantity \(\lambda\) generalizes the dimensionality of a singularity.