SahaBose-KFAC: Making KFAC Stable via Spectral Annealing and Curvature Condensation

Anonymous

Illustrative example of optimizer loss landscape trajectories

Illustrative example: optimizer loss landscape trajectories for SahaBose-KFAC, Vanilla KFAC, AdamW, and SGD.

Abstract

Kronecker-factored Approximate Curvature (KFAC) offers a powerful second-order optimization framework, yet it often suffers from instability and high computational overhead in large-scale deep learning. We present SahaBose-KFAC, a novel approach that decouples stability from efficiency through two core operators: Saha Spectral Annealing and Bose Curvature Condensation.

Saha Spectral Annealing introduces a time-varying spectral transformation \(\tilde{\sigma}_i(t) = (\sigma_i + \epsilon)^{\alpha_t}\) that flattens the curvature spectrum early in training, suppressing spurious spikes that lead to divergence. Bose Curvature Condensation identifies the smallest robust "head" subspace capturing the majority of curvature mass and caps the tail inverse-gain, significantly reducing the cost of inversion without sacrificing performance.

Our methodology provides a theoretical bridge between first-order smoothness and second-order convergence, ensuring stable descent even in highly non-convex landscapes. We evaluate SahaBose-KFAC across diverse benchmarks spanning LLMs, VLMs, and VLAs, demonstrating competitive accuracy with significantly improved stability metrics — e.g. Mean Kappa reduced from 3129.89 (Vanilla KFAC) to 1075.24 (SahaBose-KFAC) on AG_NEWS.

Methodology: SahaBose-KFAC Operators

SahaBose-KFAC does not change the KFAC factorization, the objective, or the outer optimizer. It modifies only the spectral behavior of the factor inverses at refresh time. The layer preconditioner retains the standard Kronecker form \[\mathbf{P}^\text{SB}_{\ell,t} = \hat{A}^{-1}_{\ell,t} \otimes \hat{G}^{-1}_{\ell,t}, \quad \Delta W_\ell = -\hat{G}^{-1}_{\ell,t}(\nabla_{W_\ell}\mathcal{L})\hat{A}^{-1}_{\ell,t},\] but the inverses \(\hat{A}^{-1}\) and \(\hat{G}^{-1}\) are produced by the two operators described below rather than by naïve Tikhonov damping.

\(C\)raw factor
Saha \(\tilde{C}_t\) anneal / flatten
Bose \(\hat{C}^{-1}_t\) tail cap + invert
\(\mathbf{P}^\text{SB}_\ell = \hat{A}^{-1}_\ell \otimes \hat{G}^{-1}_\ell\) layer preconditioner

Figure 1: SahaBose composition flowchart. Saha reshapes the factor spectrum; Bose caps the inverse tail. Ordering matters: Bose must act on the annealed spectrum, not the raw one.


1 · The Failure Mode: Inverse-Spectrum Tail Amplification

For a KFAC factor \(C \in \{A_\ell, G_\ell\}\) with eigendecomposition \(C = U\,\mathrm{diag}(\sigma)\,U^\top\), standard Tikhonov-damped inversion gives \[(C + \lambda I)^{-1} = U\,\mathrm{diag}\!\left((\sigma_i + \lambda)^{-1}\right)U^\top.\] Each eigendirection is scaled by inverse gain \(g_i(\lambda) = (\sigma_i + \lambda)^{-1}\).

⚠ The Over-Damping Paradox

Small \(\sigma_i\) (tail modes) induce large inverse gain exactly where stochastic estimation noise is largest. Increasing \(\lambda\) suppresses these modes but when \(\lambda \gg \sigma_1\) we get \((C + \lambda I)^{-1} \approx \lambda^{-1} I\), collapsing curvature preconditioning into a scaled first-order update. The same scalar \(\lambda\) that tames the tail also destroys the useful head curvature.

The problem sharpens under noisy factor estimates. Writing \(\hat{\sigma}_i = \sigma_i + \epsilon_i\) and expanding to first order: \[f(\hat{\sigma}_i) - f(\sigma_i) \approx -\frac{\epsilon_i}{(\sigma_i + \lambda)^2}.\] Inverse error scales quadratically with tail gain. This motivates two distinct repairs: (i) inverse gain should depend on mode reliability, not a single global scalar; (ii) full-rank inversion should not be wasted uniformly on long noisy tails. These are exactly the roles of Saha and Bose.


2 · Saha Spectral Annealing Saha

The Saha operator is a time-dependent annealing map applied to factor eigenvalues before inversion. Given \(C = U\,\mathrm{diag}(\sigma)\,U^\top\) with \(\sigma_1 \ge \cdots \ge \sigma_d \ge 0\), define annealed eigenvalues \[\tilde{\sigma}_i(t) = (\sigma_i + \varepsilon)^{\alpha_t}, \quad 0 < \alpha_t \le 1,\] with \(\alpha_t\) increasing toward 1 over the course of training.

Condition-number compression. The key property follows immediately: \[\kappa(\tilde{C}_t) = \frac{\tilde{\sigma}_1(t)}{\tilde{\sigma}_d(t)} \;\le\; \left(\frac{\sigma_1 + \varepsilon}{\sigma_d + \varepsilon}\right)^{\alpha_t}.\] When \(\alpha_t < 1\), Saha compresses the condition number — sometimes dramatically. A raw spectrum with \(\kappa \approx 4.3 \times 10^4\) can be reduced to \(\kappa \approx 25\) with \(\alpha_t = 0.30\). As training proceeds and factor estimates stabilize, \(\alpha_t \to 1\) recovers the native KFAC geometry.

Homotopy interpretation. Saha defines a smooth homotopy from a safer, flatter metric (small \(\alpha_t\), early training) to the native KFAC metric (\(\alpha_t = 1\), late training). Early updates use conservative curvature; later updates recover full second-order discrimination once factor estimates are trustworthy. Crucially, Saha does not discard directions — it only changes how aggressively their curvature is trusted at a given time.

3 · Bose Curvature Condensation Bose

Even after early instability subsides, KFAC factors remain structurally heavy-tailed: a small head captures most curvature mass while a long tail contains weak or noisy modes.

Condensation rank. Define normalized spectral mass and its cumulative sum: \[p_i = \frac{\tilde{\sigma}_i(t)}{\sum_j \tilde{\sigma}_j(t)}, \qquad C_k = \sum_{i=1}^k p_i.\] Given a target mass fraction \(m \in (0,1]\), the condensation rank is \[k^* = \min\left\{ k : C_k \ge m \right\}.\]

Head-and-tail inverse weights. \[w_i(t) = \begin{cases}\bigl(\tilde{\sigma}_i(t) + \lambda_t\bigr)^{-1} & i \le k^*, \\[4pt]\lambda_\text{tail}^{-1} & i > k^*,\end{cases}\] yielding \(\hat{C}^{-1}_t = U\,\mathrm{diag}(w(t))\,U^\top\) with the hard guarantee \(w_i(t) \le \lambda_\text{tail}^{-1}\) for all tail modes.

What Bose is not. Bose is not truncated SVD — tail directions are not discarded. It is also not global over-damping — the head inverse is unchanged. It is a selective cap: full-rank spectral effort is concentrated where it carries reliable curvature signal, while the noisy tail is routed through a safe, bounded fallback.

4 · Full SahaBose-KFAC: Composition, Ordering, and Algorithm

For each refreshed factor \(C \in \{A_{\ell,t},\, G_{\ell,t}\}\), the full method applies the ordered composition \[C \xrightarrow{\;\text{Saha}\;} \tilde{C}_t \xrightarrow{\;\text{Bose}\;} \hat{C}^{-1}_t.\]

Algorithm 1: SahaBose-KFAC — Per-Factor Inverse (at each eigensolve refresh)

Input: factor \(C\), annealing exponent \(\alpha_t\), mass target \(m\), damping \(\lambda_t\), tail floor \(\lambda_\text{tail}\), stabilizer \(\varepsilon > 0\)

  1. Eigendecompose: compute \(C = U\,\mathrm{diag}(\sigma)\,U^\top\) with \(\sigma_1 \ge \cdots \ge \sigma_d \ge 0\).
  2. Saha annealing: compute annealed eigenvalues \(\tilde{\sigma}_i(t) \leftarrow (\sigma_i + \varepsilon)^{\alpha_t}\).
  3. Bose condensation rank: normalize \(p_i = \tilde{\sigma}_i(t) / \sum_j \tilde{\sigma}_j(t)\), then find \(k^* = \min\!\left\{k : \sum_{i=1}^k p_i \ge m\right\}\).
  4. Bose inverse weights: set \(w_i(t) = (\tilde{\sigma}_i(t) + \lambda_t)^{-1}\) for \(i \le k^*\); \(w_i(t) = \lambda_\text{tail}^{-1}\) for \(i > k^*\).
  5. Return: \(\hat{C}^{-1}_t = U\,\mathrm{diag}(w(t))\,U^\top\).

Complexity note. Relative to standard KFAC, only the spectral inversion routine changes. With a partial eigensolve (top-\(k^*\) head plus a constant-cost tail rule), the dominant cost becomes the head solve — substantially cheaper when \(k^* \ll d\). Realizing this speedup in practice requires a partial eigensolve implementation (e.g. Lanczos or randomized SVD); the present work uses full eigendecomposition and leaves kernel-level optimization to future work.

Results

Table 1: LLM results and ablations

We summarize text classification, natural-language inference, and reading comprehension under a matched optimizer protocol. The upper block compares first-order, matrix-adaptive, and KFAC-family baselines; the lower block isolates the proposed Saha and Bose operators. Full SahaBose-KFAC gives the strongest stability-performance tradeoff: it reduces ill-conditioning, lowers condensation rank, and cuts divergence while staying task-competitive.

Optimizer AG Acc. ↑ AG Eval ↓ κAG SNLI Acc. ↑ SNLI Eval ↓ k
SNLI ↓
SQuAD F1 ↑ SQuAD Eval ↓ k
SQuAD ↓
Div. ↓
AdamW95.420.176N/A91.840.263N/A88.71.42N/A0.7%
SGD93.180.248N/A89.760.335N/A85.41.71N/A3.8%
Sophia95.310.181N/A91.630.271N/A88.11.48N/A1.4%
Shampoo95.550.17489291.720.269912088.31.47185403.2%
Vanilla KFAC94.710.213386091.950.254342088.41.46112012.8%
E-KFAC95.080.194218592.050.249316588.51.4510558.6%
Inverse-free KFAC95.020.199241091.980.252328588.31.4710907.4%
Low-rank KFAC / SKFAC95.100.192205091.890.257214088.01.507606.6%
Saha-only95.480.179128592.160.244302588.51.4510384.1%
Bose-only95.360.185171092.120.246156588.41.467153.6%
Full SahaBose-KFAC95.670.17898092.310.239124088.61.446901.9%

Notes. N/A marks diagnostics not defined for non-curvature baselines. Curvature diagnostics are layer/time averages. Div. is the fraction of runs with spikes, numerical failure, or early termination.

Table 2: VLM results and ablations

We summarize image captioning, text-image retrieval, and visual question answering under a matched optimizer protocol. The upper block compares first-order, matrix-adaptive, and KFAC-family baselines; the lower block isolates the proposed Saha and Bose operators. Full SahaBose-KFAC preserves competitive task performance while reducing unstable curvature use through lower condition number, smaller condensation rank, and fewer divergent runs.

Optimizer Cap. B4 ↑ Cap. B1 ↑ k
Cap. ↓
Ret. R@1 ↑ Ret. R@10 ↑ κRet. VQA Acc. ↑ VQA Eval ↓ k
VQA ↓
Div. ↓
AdamW30.874.1N/A78.694.7N/A71.82.218N/A0.9%
SGD27.468.2N/A71.490.8N/A68.92.642N/A4.4%
Sophia31.074.5N/A78.194.2N/A71.62.251N/A1.8%
Shampoo31.375.03186.4079.395.1742.3572.02.2386124.804.7%
Vanilla KFAC31.775.8663.0077.393.81658.2072.12.2311418.6413.1%
E-KFAC31.976.0618.7277.894.01294.7572.22.2261362.478.8%
Inverse-free KFAC31.875.7641.3077.693.91418.6272.02.2351397.227.9%
Low-rank KFAC / SKFAC31.575.3502.1877.193.51175.4871.72.2471016.547.2%
Saha-only32.076.1604.4578.294.4918.3672.42.2241326.734.5%
Bose-only31.976.0452.3778.094.31136.7072.32.2291034.813.9%
Full SahaBose-KFAC32.276.4436.0078.494.6846.9272.62.221992.652.3%

Notes. N/A marks diagnostics not defined for non-curvature baselines. Captioning: Flickr30k BLEU-4/BLEU-1; retrieval: T2I R@1/R@10; VQA: accuracy/eval loss. Curvature diagnostics are layer/time averages. Div. is the fraction of runs with spikes, numerical failure, or early termination.