A Unified Framework for the Complexity of Annealing

Sampling & Normalizing-Constant Estimation

Wei Guo  ·  joint work with Molei Tao & Yongxin Chen

Georgia Institute of Technology

Two tasks

Given an unnormalized probability density function  $\pi \propto \hat\pi := \mathrm{e}^{-V}$  on $\mathbb{R}^{d}$:

Sampling

Draw  $X \sim \pi$.

Normalizing constant estimation

Estimate  $Z = \int_{\mathbb{R}^d} \mathrm{e}^{-V(x)}\,\mathrm{d}x$,   or equivalently, the free energy $F = -\log Z$.

two tasks

Applications: Bayesian posteriors & evidence; the partition function / free energy in statistical mechanics; energy-based & latent-variable models; volume of convex bodies.

They look like two problems — I'll argue they're closely related, and the same number tells you how hard both are.

Why we care

Where these show up

The same $\pi\propto\mathrm{e}^{-V}$ shows up across domains:

Statistical mechanics / molecular dynamics

Boltzmann distribution and molecular energies

(Figure: Guan-Horng Liu's talk on ASBS.)

Bayesian inference

prior, likelihood and posterior densities
$$p(\theta\mid\mathcal{D}) = \frac{p(\mathcal{D}\mid\theta)\,p(\theta)}{p(\mathcal{D})}, \quad p(\mathcal{D}) = \int p(\mathcal{D}\mid\theta)\,p(\theta)\,\mathrm{d}\theta.$$

unnormalized = likelihood $\times$ prior;   $Z=p(\mathcal{D})$ = evidence (marginal likelihood).

Sampling $\leftrightarrow$ configurations / posterior samples;   normalizing constant $\leftrightarrow$ free energy / model evidence.

Background

The classical recipes

Each task has a classical approach:

Sampling — MCMC

Build a Markov chain whose stationary law is $\pi$ (Metropolis–Hastings, Gibbs, Langevin, …) and run the chain till convergence.

a Markov chain exploring a multimodal density
Normalizing constant — Importance sampling

With a tractable proposal $\pi_0$:   $Z = \mathbb{E}_{\pi_0}\!\left[\frac{\hat\pi}{\pi_0}\right] \approx \frac1n\sum_{i=1}^n \frac{\hat\pi(X_i)}{\pi_0(X_i)}$,   $X_i\sim\pi_0$.

importance sampling: proposal versus target with weighted samples

Clean and provably efficient when $\pi$ is well-conditioned, or close to $\pi_0$.

Why it's hard in practice

Sampling

Plain Langevin / MCMC gets trapped in one mode $\Rightarrow$ mixing time exponential in mode distance / dimension.

Normalizing constant

Naive importance sampling has high variance from proposal–target mismatch.

High dimension + multimodality is the shared difficulty.

multimodal landscape, trapped sampler

Annealing: a smooth interpolation between distributions

Bridge $\pi_0 \to \pi$ with a curve of distributions

$$\Big(\pi_\theta = \frac{1}{Z_\theta}\mathrm{e}^{-V_\theta}\Big)_{\theta\in[0,1]}, \quad \pi_0 \text{ easy}, \;\; \pi_1 = \pi \text{ target}.$$

Two common choices:

  • Geometric:   $\pi_\theta \propto \exp\!\big(-V - \frac{\lambda(\theta)}{2}\|\cdot\|^2\big)$.
  • Gaussian convolution:   $\pi_\theta = \pi * \mathcal{N}\!\big(0,\sigma(\theta)^2 I\big)$.
two annealing curves

Replace one hard jump with many easy steps — break the hard problem down along the curve.

Prerequisite

Langevin diffusion (LD) and Langevin Monte Carlo (LMC)

Langevin diffusion (LD):   $\mathrm{d}X_t = -\nabla V(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t$,   stationary distribution $\pi\propto\mathrm{e}^{-V}$.
Langevin Monte Carlo (LMC):   $X_{(k+1)h} = X_{kh} - h\,\nabla V(X_{kh}) + \sqrt{2h}\,\mathcal{N}(0,I)$   (Euler–Maruyama).
  • Fast under log-concavity / isoperimetry (e.g., log-Sobolev or Poincaré); but slow when $\pi$ is multimodal.

Vempala & Wibisono (2019). Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. NeurIPS.

Algorithm · left rail (sampling)

Annealed Langevin diffusion (ALD)

Run Langevin with a moving target — Annealed Langevin Diffusion (ALD):

$$\mathrm{d}X_t = \nabla\log\tilde\pi_t(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t, \qquad \tilde\pi_t := \pi_{t/T},\qquad t\in[0,T].$$
  • Write its path measure, i.e., the law of the whole trajectory, as $\mathbb{P}^{\rightarrow}$.
  • Discretize (one step per intermediate distribution) $\Rightarrow$ Annealed Langevin Monte Carlo (ALMC).

Algorithm · right rail (normalizing constant)

Jarzynski equality (JE)

A non-equilibrium identity from statistical mechanics. Along ALD, define the work functional

$$W(X) = \frac1T\int_0^T \partial_\theta V_\theta\big|_{\theta=t/T}(X_t)\,\mathrm{d}t.$$
$$\mathbb{E}_{\mathbb{P}^{\rightarrow}}\,\mathrm{e}^{-W} = \mathrm{e}^{-\Delta F}, \qquad \Delta F = -\log\frac{Z_1}{Z_0}.$$
  • Unbiased estimator $\hat Z = Z_0\,\mathrm{e}^{-W(X)}$ of $Z = Z_0\,\mathrm{e}^{-\Delta F}$.
  • Corollary: Second law of thermodynamics: $\mathbb{E}_{\mathbb{P}^{\rightarrow}}W \ge \Delta F$.

Jarzynski (1997). Nonequilibrium equality for free energy differences. PRL.

Algorithm · right rail (normalizing constant)

Annealed Importance Sampling (AIS): discretized JE

  • Annealed Importance Sampling (Neal ’01): distributions $\pi_\ell = \frac{1}{Z_\ell}f_\ell$, $\ell\in\{0,\dots,M\}$, and kernels $F_\ell$ that leave $\pi_\ell$ invariant. Forward path measure:
$$\mathbb{P}^{\rightarrow}(x_{0:M}) = \pi_0(x_0)\prod_{\ell=1}^{M} F_\ell(x_{\ell-1}, x_\ell).$$
$$\mathbb{E}_{\mathbb{P}^{\rightarrow}}\,\mathrm{e}^{-W} = \mathrm{e}^{-\Delta F}, \quad W(x_{0:M}) = \log\prod_{\ell=0}^{M-1}\frac{f_\ell(x_\ell)}{f_{\ell+1}(x_\ell)}, \quad \Delta F = -\log\frac{Z_M}{Z_0}.$$

ALD : ALMC  ::  JE : AIS

continuous $\leftrightarrow$ discrete, mirrored on each task

Neal (2001). Annealed importance sampling. Statistics and Computing.

The question

The gap we close

Question: how many oracle calls to $(V,\nabla V)$, with weak assumption on $\pi$?

Gelman & Meng (1998), Simulating normalizing constants: from importance sampling to bridge sampling to path sampling; Brosse, Durmus & Moulines (2018), Normalizing constants of log-concave densities; Ge, Lee & Lu (2020), Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds.

The toolkit

Optimal transport

  • Curve “speed” = Wasserstein-2 metric derivative   $|\dot\rho|_t := \lim_{\delta\to0}\frac{W_2(\rho_{t+\delta},\rho_t)}{|\delta|}$.
  • Continuity equation: a vector field $(v_t)$ generates $(\rho_t)$ iff $\partial_t\rho_t + \nabla\!\cdot(\rho_t v_t)=0$. Equivalently, if $\dot X_t = v_t(X_t)$ and $X_0 \sim \rho_0$, then $X_t \sim \rho_t$.
Key lemma
For an absolutely continuous curve $(\rho_t)$, any vector field $(v_t)$ that generates it satisfies $|\dot\rho|_t \le \|v_t\|_{L^2(\rho_t)}$ for every $t$. Moreover, there is a unique generating field $(v^*_t)$ that reaches equality for every $t$, given by $v^*_t = \lim_{\delta\to0}\frac{T_{\rho_t\to\rho_{t+\delta}} - \mathrm{id}}{\delta}$, where $T_{\mu\to\nu}$ is the optimal transport map from $\mu$ to $\nu$.
an optimal transport plan coupling a source and a target distribution

Optimal transport. (Figure: Wikipedia)

Ambrosio, Gigli & Savaré (2008), Gradient Flows: In Metric Spaces and in the Space of Probability Measures.

The core quantity

The action

$$\mathcal{A} := \int_0^1 |\dot\pi|_\theta^2\,\mathrm{d}\theta$$

the “kinetic energy” of the curve

Example (heat-flow curve $\rho_t=\rho_0*\mathcal{N}(0,2St\,I)$). For any Gaussian mixture $\rho_0=\sum_{i=1}^N w_i\,\mathcal{N}(\mu_i,\sigma^2 I)$:

$$\mathcal{A} \le \frac{Sd}{2}\,\log\!\Big(1+\frac{2S}{\sigma^2}\Big)$$

Independent of the number of modes $N$, weights $w_i$, and means $\mu_i$ — while LSI / PI constants blow up exponentially in $\max_{i,j}\|\mu_i-\mu_j\|$.

The action sees through multimodality that isoperimetry cannot.

The toolkit

Girsanov on path measures

Take two SDEs driven by Brownian motion, $t\in[0,T]$:

$$\mathrm{d}X_t = a_t(X_t)\,\mathrm{d}t + \sigma\,\mathrm{d}B_t,\;\; X_0\sim\mu; \qquad \mathrm{d}Y_t = b_t(Y_t)\,\mathrm{d}t + \sigma\,\mathrm{d}B_t,\;\; Y_0\sim\nu.$$

The KL between their path measures $\mathbb{P}^X, \mathbb{P}^Y$ is

$$\mathrm{KL}(\mathbb{P}^X\,\|\,\mathbb{P}^Y) = \mathrm{KL}(\mu\,\|\,\nu) + \frac{1}{2\sigma^2}\int_0^T \mathbb{E}_{\mathbb{P}^X}\big\|a_t(X_t) - b_t(X_t)\big\|^2\,\mathrm{d}t.$$

The big picture

The proof idea: three path measures

analysis framework: forward, backward, reference paths

Forward $\mathbb{P}^{\rightarrow}$ (the algorithm), backward $\mathbb{P}^{\leftarrow}$, and the compensated reference $\mathbb{P}$ threading the intermediates $\pi_{\theta_\ell}$.

Continuous analysis · left rail (sampling)

Error analysis of annealed Langevin diffusion

Consider two SDEs started at $X_0\sim\tilde\pi_0$,   and let $\tilde\pi_t := \pi_{t/T}$:

$$\begin{aligned} \mathbb{P}^{\rightarrow}\ (\text{ALD}):\quad & \mathrm{d}X_t = \nabla\log\tilde\pi_t(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t,\\[2pt] \mathbb{P}\ (\text{compensated}):\quad & \mathrm{d}X_t = \big(\nabla\log\tilde\pi_t(X_t) + v_t(X_t)\big)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t. \end{aligned}$$

Pick $v_t$ s.t. $X_t\sim\tilde\pi_t$ for all $t$; by Fokker–Planck, this means $v_t$ generates $\tilde\pi_t$:

$$\partial_t\tilde\pi_t + \nabla\!\cdot(\tilde\pi_t\, v_t) = 0.$$
$$\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\rightarrow}) = \frac14\int_0^T \|v_t\|_{L^2(\tilde\pi_t)}^2\,\mathrm{d}t \;\xrightarrow{\ \text{optimal } v^*\ }\; \frac{1}{4T}\int_0^1 |\dot\pi|_\theta^2\,\mathrm{d}\theta = \frac{\mathcal{A}}{4T}.$$
$$\mathrm{KL}(\pi\,\|\,\nu^{\mathrm{ALD}}) = \mathrm{KL}(\mathbb{P}_T\,\|\,\mathbb{P}^{\rightarrow}_T) \le \mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\rightarrow}) = \frac{\mathcal{A}}{4T}\quad(\text{data-processing}).$$
Theorem
$T = \frac{\mathcal{A}}{4\varepsilon^2} \;\Rightarrow\; \mathrm{KL}(\pi\,\|\,\nu^{\mathrm{ALD}})\le\varepsilon^2$,   where $\nu^{\mathrm{ALD}}$ is the law of ALD at time $T$.

Continuous analysis · right rail (normalizing constant)

Error analysis of Jarzynski equality

A new player: the backward SDE $\mathbb{P}^{\leftarrow}$ — ALD run backward in time with time-reversed Brownian motion $B^{\leftarrow}$:

$$\mathrm{d}X_t = -\nabla\log\tilde\pi_t(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B^{\leftarrow}_t, \quad t\in[0,T], \quad X_T\sim\tilde\pi_T.$$

Crooks fluctuation theorem:

$$\log\frac{\mathrm{d}\mathbb{P}^{\rightarrow}}{\mathrm{d}\mathbb{P}^{\leftarrow}}(X) = -\int_0^T (\partial_t\log\tilde\pi_t)(X_t)\,\mathrm{d}t = W(X) - \Delta F \;\Longrightarrow\; \mathbb{E}_{\mathbb{P}^{\rightarrow}}\,\mathrm{e}^{-W} = \mathrm{e}^{-\Delta F}.$$

Estimator $\hat Z = Z_0\,\mathrm{e}^{-W(X)}$, $X\sim\mathbb{P}^{\rightarrow}$.

Theorem
$T = \dfrac{32\,\mathcal{A}}{\varepsilon^2} \;\Longrightarrow\; \mathrm{Pr}\!\left(\left|\dfrac{\hat Z}{Z} - 1\right| \le \varepsilon\right) \ge \dfrac{3}{4}.$   (median trick boosts $\frac34$ to $1-\zeta$.)

Key step: the error is the fluctuation of the RN derivative, bounded through the reference $\mathbb{P}$:

$$\mathrm{Pr}\!\left(\Big|\tfrac{\hat Z}{Z}-1\Big|\ge\varepsilon\right) = \mathbb{P}^{\rightarrow}\!\left(\Big|\tfrac{\mathrm{d}\mathbb{P}^{\leftarrow}}{\mathrm{d}\mathbb{P}^{\rightarrow}}-1\Big|\ge\varepsilon\right) \le \frac{\sqrt{2}}{\varepsilon}\left(\sqrt{\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\rightarrow})} + \sqrt{\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\leftarrow})}\right).$$

Crooks (1998), Nonequilibrium Measurements of Free Energy Differences for Microscopically Reversible Markovian Systems; Crooks (1999), Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences; Vargas, Padhy, Blessing & Nüsken (2024), Transport meets Variational Inference: Controlled Monte Carlo Diffusions.

The reveal

One engine, two tasks

$$T \asymp \frac{\mathcal{A}}{\varepsilon^2} \quad \text{for both sampling and normalizing constant estimation}$$

Key quantity: the action $\mathcal{A}=\int_0^1 |\dot\pi|_\theta^2\,\mathrm{d}\theta$.

Jerrum, Valiant & Vazirani (1986), Random generation of combinatorial structures from a uniform distribution, Theoretical Computer Science.

Discrete time

From continuous diffusion to implementable algorithm: final complexities

Sampling (ALMC)
$$\widetilde{O}\!\left(\frac{d\,\beta^2\,\mathcal{A}^2}{\varepsilon^6}\right)$$

to reach $\mathrm{KL}(\pi\|\cdot)\le\varepsilon^2$.

Normalizing constant (AIS)
$$\widetilde{O}\!\left(\frac{d^{4/3}}{\varepsilon^2} \vee \frac{m\beta\mathcal{A}^\frac12}{\varepsilon^2} \vee \frac{d\beta^2\mathcal{A}^2}{\varepsilon^4}\right)$$

to get $\hat Z$ s.t. relative error $\le\varepsilon$ w.h.p.

Theoretical payoff

Mixture of Gaussians: polynomial vs. exponential

Target: a Gaussian mixture with equal-norm means $\|y_i\|=r$;  $V=-\log\pi$ is $B$-smooth with $B=\beta(4r^2\beta+1)$:

$$\pi = \sum_{i=1}^N p_i\,\mathcal{N}\!\left(y_i,\ \beta^{-1}I\right), \qquad \textstyle\sum_i p_i = 1,\ \ p_i>0.$$

Geometric annealing curve and its action:

$$\pi_\theta \propto \exp\!\left(-V - \tfrac{\lambda(\theta)}{2}\|\cdot\|^2\right),\ \ \lambda(\theta)=dB(1-\theta)^\gamma; \qquad \mathcal{A} = O\!\left(d(r^2\beta+1)\Big(r^2 + \tfrac{d}{\beta}\Big)\right).$$
Ours (via the action)

$\mathcal{A}$ is polynomial $\Rightarrow$ $\mathrm{poly}\!\left(d,\beta,r,\tfrac{1}{\varepsilon}\right)$ oracle complexity.

LMC under LSI

LSI constant $C_{\mathrm{LSI}} = \Omega\!\left(\mathrm{e}^{\Theta(\beta r^2)}\right)$ $\Rightarrow$ exponential complexity.

Beyond geometric interpolation

The catch with geometric interpolation

  • The whole story hinges on $\mathcal{A}$ being small, but it need not be.
  • 1-dimensional example: $\pi = \frac12\mathcal{N}(0,1)+\frac12\mathcal{N}(m,1)$,   $\pi_\theta(x)\propto\pi(x)\,\mathrm{e}^{-\lambda(\theta)x^2/2}$,   $\lambda(\theta)=m^2(1-\theta)^r$.
$$\mathcal{A}_r \gtrsim m^4\,\mathrm{e}^{m^2/40}.$$

exponential in the mode separation

Mass teleportation / Mode switching: weights of the modes change dramatically, so mass must teleport across a large distance.

mode switching: the far mode appears only as theta approaches 1

Figure: Chemseddine, Wald, Duong & Steidl (2025), Neural Sampling from Boltzmann Densities: Fisher–Rao Curves in the Wasserstein Geometry, ICLR.

Beyond geometric interpolation

A better curve: reverse the OU process

  • Borrow the diffusion-model curve: run the OU process from $Y_0\sim\pi$,   $\mathrm{d}Y_t = -Y_t\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t,\quad Y_t\sim\bar\pi_t$,   $\bar\pi_t\to\mathcal{N}(0,I)$ as $t\to\infty$.
  • Its time-reverse $(\bar\pi_{T-t})_{t\in[0,T]}$ is the annealing curve we use.
Proposition
Let $\pi\propto\mathrm{e}^{-V}$ with $V$ $\beta$-smooth, and finite second moment $m^2 := \mathbb{E}_\pi\|\cdot\|^2 < \infty$. Then the action of the reverse-OU curve satisfies $$\int_0^\infty |\dot{\bar\pi}|_t^2\,\mathrm{d}t \;\le\; d\beta + m^2.$$
  • By our JE theorem, this curve should estimate $Z$ far more efficiently.
  • This also partially explains why diffusion models work well: their trajectory is smooth in the Wasserstein geometry.

Beyond geometric interpolation

Reverse diffusion samplers (RDS) for normalizing constant estimation

RDS (RDMC / RSDMC / ZODMC / SNDMC) simulate the time-reversal of OU with a learning-free score estimator $s_t \approx \nabla\log\bar\pi_t$:

$$\begin{aligned} \widehat{\mathbb{Q}}\ (\text{run}):\quad & \mathrm{d}X_t = \big(X_t + 2\,s_{T-t_-}(X_{t_-})\big)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t, & X_0\sim\phi=\mathcal{N}(0,I),\\[2pt] \mathbb{Q}\ (\text{ideal}):\quad & \mathrm{d}X_t = \big(X_t + 2\,\nabla\log\bar\pi_{T-t}(X_t)\big)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t, & X_0\sim\bar\pi_T. \end{aligned}$$

Estimator $\hat Z = \mathrm{e}^{-W(X)}$, $X\sim\widehat{\mathbb{Q}}$, with the work functional

$$W(X) = \log\phi(X_0) + V(X_{T-\delta}) - (T-\delta)d + \int_0^{T-\delta}\!\Big(\|s_{T-t_-}(X_{t_-})\|^2\,\mathrm{d}t + \sqrt{2}\,\big\langle s_{T-t_-}(X_{t_-}),\,\mathrm{d}B_t\big\rangle\Big).$$
Theorem
If $\mathrm{KL}(\mathbb{Q}\,\|\,\widehat{\mathbb{Q}})\lesssim\varepsilon^2$ and $\mathrm{TV}(\pi,\bar\pi_\delta)\lesssim\varepsilon$, then $\mathrm{Pr}\!\left(\left|\dfrac{\hat Z}{Z}-1\right|\le\varepsilon\right)\ge\dfrac{3}{4}$.

RDMC: Huang, Dong, Hao, Ma & Zhang (ICLR 2024), Reverse Diffusion Monte Carlo.  RSDMC: Huang, Zou, Dong, Ma & Zhang (COLT 2024), Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo.  ZODMC: He, Rojas & Tao (NeurIPS 2024), Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion.  SNDMC: Vacher, Chehab & Korba (NeurIPS 2025), Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion.

Beyond geometric interpolation · the closing idea

The fundamental trade-off

Tractable curve (geometric)

Closed-form drift $\nabla\log\pi_\theta$, but possibly exponential action $\mathcal{A}$.

OU curve

Polynomial action $\mathcal{A}$, but the drift $\nabla\log\bar\pi_t$ has no closed form $\Rightarrow$ must be estimated.

You pay either in the curve or in the drift.

Wrap up

Summary

Looking ahead

Future directions

Looking ahead

Closely related & follow-up work

Thank you!

Joint work with Molei Tao & Yongxin Chen.

Sampling (ICLR’25):  arxiv.org/abs/2407.16936

Normalizing constant (ICLR’26):  arxiv.org/abs/2502.04575

Created with Opus 4.8 using the Codeck skill (https://github.com/hiyeshu/codeck)

Zoomed