A Unified Framework for the Complexity of Annealing

Sampling & Normalizing-Constant Estimation

Wei Guo · joint work with Molei Tao & Yongxin Chen

Georgia Institute of Technology

Two tasks

Given an unnormalized probability density function $\pi \propto \hat\pi := \mathrm{e}^{-V}$ on $\mathbb{R}^{d}$:

Sampling

Draw $X \sim \pi$.

Normalizing constant estimation

Estimate $Z = \int_{\mathbb{R}^d} \mathrm{e}^{-V(x)}\,\mathrm{d}x$, or equivalently, the free energy $F = -\log Z$.

Applications: Bayesian posteriors & evidence; the partition function / free energy in statistical mechanics; energy-based & latent-variable models; volume of convex bodies.

They look like two problems — I'll argue they're closely related, and the same number tells you how hard both are.

Why we care

Where these show up

The same $\pi\propto\mathrm{e}^{-V}$ shows up across domains:

Statistical mechanics / molecular dynamics

Boltzmann distribution and molecular energies

(Figure: Guan-Horng Liu's talk on ASBS.)

Bayesian inference

prior, likelihood and posterior densities

p(\theta\mid\mathcal{D}) = \frac{p(\mathcal{D}\mid\theta)\,p(\theta)}{p(\mathcal{D})}, \quad p(\mathcal{D}) = \int p(\mathcal{D}\mid\theta)\,p(\theta)\,\mathrm{d}\theta.

unnormalized = likelihood $\times$ prior; $Z=p(\mathcal{D})$ = evidence (marginal likelihood).

Sampling $\leftrightarrow$ configurations / posterior samples; normalizing constant $\leftrightarrow$ free energy / model evidence.

Background

The classical recipes

Each task has a classical approach:

Sampling — MCMC

Build a Markov chain whose stationary law is $\pi$ (Metropolis–Hastings, Gibbs, Langevin, …) and run the chain till convergence.

a Markov chain exploring a multimodal density

Normalizing constant — Importance sampling

With a tractable proposal $\pi_0$: $Z = \mathbb{E}_{\pi_0}\!\left[\frac{\hat\pi}{\pi_0}\right] \approx \frac1n\sum_{i=1}^n \frac{\hat\pi(X_i)}{\pi_0(X_i)}$, $X_i\sim\pi_0$.

importance sampling: proposal versus target with weighted samples

Clean and provably efficient when $\pi$ is well-conditioned, or close to $\pi_0$.

Why it's hard in practice

Sampling

Plain Langevin / MCMC gets trapped in one mode $\Rightarrow$ mixing time exponential in mode distance / dimension.

Normalizing constant

Naive importance sampling has high variance from proposal–target mismatch.

High dimension + multimodality is the shared difficulty.

Annealing: a smooth interpolation between distributions

Bridge $\pi_0 \to \pi$ with a curve of distributions

\Big(\pi_\theta = \frac{1}{Z_\theta}\mathrm{e}^{-V_\theta}\Big)_{\theta\in[0,1]}, \quad \pi_0 \text{ easy}, \;\; \pi_1 = \pi \text{ target}.

Two common choices:

Geometric: $\pi_\theta \propto \exp\!\big(-V - \frac{\lambda(\theta)}{2}\|\cdot\|^2\big)$.
Gaussian convolution: $\pi_\theta = \pi * \mathcal{N}\!\big(0,\sigma(\theta)^2 I\big)$.

Replace one hard jump with many easy steps — break the hard problem down along the curve.

Prerequisite

Langevin diffusion (LD) and Langevin Monte Carlo (LMC)

Langevin diffusion (LD):   $\mathrm{d}X_t = -\nabla V(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t$,   stationary distribution $\pi\propto\mathrm{e}^{-V}$.

Langevin Monte Carlo (LMC):   $X_{(k+1)h} = X_{kh} - h\,\nabla V(X_{kh}) + \sqrt{2h}\,\mathcal{N}(0,I)$   (Euler–Maruyama).

Fast under log-concavity / isoperimetry (e.g., log-Sobolev or Poincaré); but slow when $\pi$ is multimodal.

Vempala & Wibisono (2019). Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices. NeurIPS.

Algorithm · left rail (sampling)

Annealed Langevin diffusion (ALD)

Run Langevin with a moving target — Annealed Langevin Diffusion (ALD):

\mathrm{d}X_t = \nabla\log\tilde\pi_t(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t, \qquad \tilde\pi_t := \pi_{t/T},\qquad t\in[0,T].

Write its path measure, i.e., the law of the whole trajectory, as $\mathbb{P}^{\rightarrow}$.
Discretize (one step per intermediate distribution) $\Rightarrow$ Annealed Langevin Monte Carlo (ALMC).

Algorithm · right rail (normalizing constant)

Jarzynski equality (JE)

A non-equilibrium identity from statistical mechanics. Along ALD, define the work functional

W(X) = \frac1T\int_0^T \partial_\theta V_\theta\big|_{\theta=t/T}(X_t)\,\mathrm{d}t.

\mathbb{E}_{\mathbb{P}^{\rightarrow}}\,\mathrm{e}^{-W} = \mathrm{e}^{-\Delta F}, \qquad \Delta F = -\log\frac{Z_1}{Z_0}.

Unbiased estimator $\hat Z = Z_0\,\mathrm{e}^{-W(X)}$ of $Z = Z_0\,\mathrm{e}^{-\Delta F}$.

Corollary: Second law of thermodynamics: $\mathbb{E}_{\mathbb{P}^{\rightarrow}}W \ge \Delta F$.

Jarzynski (1997). Nonequilibrium equality for free energy differences. PRL.

Algorithm · right rail (normalizing constant)

Annealed Importance Sampling (AIS): discretized JE

Annealed Importance Sampling (Neal ’01): distributions $\pi_\ell = \frac{1}{Z_\ell}f_\ell$, $\ell\in\{0,\dots,M\}$, and kernels $F_\ell$ that leave $\pi_\ell$ invariant. Forward path measure:

\mathbb{P}^{\rightarrow}(x_{0:M}) = \pi_0(x_0)\prod_{\ell=1}^{M} F_\ell(x_{\ell-1}, x_\ell).

\mathbb{E}_{\mathbb{P}^{\rightarrow}}\,\mathrm{e}^{-W} = \mathrm{e}^{-\Delta F}, \quad W(x_{0:M}) = \log\prod_{\ell=0}^{M-1}\frac{f_\ell(x_\ell)}{f_{\ell+1}(x_\ell)}, \quad \Delta F = -\log\frac{Z_M}{Z_0}.

ALD : ALMC :: JE : AIS

continuous $\leftrightarrow$ discrete, mirrored on each task

Neal (2001). Annealed importance sampling. Statistics and Computing.

The question

The gap we close

These methods are very commonly used across many domains…
…but very few non-asymptotic complexity guarantees. Existing analyses are either:

asymptotic in the number of particles;
restricted to a simple (e.g. Gaussian) work distribution; or
rely on isoperimetry / log-concavity of the target.

Question: how many oracle calls to $(V,\nabla V)$, with weak assumption on $\pi$?

Gelman & Meng (1998), Simulating normalizing constants: from importance sampling to bridge sampling to path sampling; Brosse, Durmus & Moulines (2018), Normalizing constants of log-concave densities; Ge, Lee & Lu (2020), Estimating Normalizing Constants for Log-Concave Distributions: Algorithms and Lower Bounds.

The toolkit

Optimal transport

Curve “speed” = Wasserstein-2 metric derivative $|\dot\rho|_t := \lim_{\delta\to0}\frac{W_2(\rho_{t+\delta},\rho_t)}{|\delta|}$.
Continuity equation: a vector field $(v_t)$ generates $(\rho_t)$ iff $\partial_t\rho_t + \nabla\!\cdot(\rho_t v_t)=0$. Equivalently, if $\dot X_t = v_t(X_t)$ and $X_0 \sim \rho_0$, then $X_t \sim \rho_t$.

Key lemma

For an absolutely continuous curve $(\rho_t)$, any vector field $(v_t)$ that generates it satisfies $|\dot\rho|_t \le \|v_t\|_{L^2(\rho_t)}$ for every $t$. Moreover, there is a unique generating field $(v^*_t)$ that reaches equality for every $t$, given by $v^*_t = \lim_{\delta\to0}\frac{T_{\rho_t\to\rho_{t+\delta}} - \mathrm{id}}{\delta}$, where $T_{\mu\to\nu}$ is the optimal transport map from $\mu$ to $\nu$.

an optimal transport plan coupling a source and a target distribution

Optimal transport. (Figure: Wikipedia)

Ambrosio, Gigli & Savaré (2008), Gradient Flows: In Metric Spaces and in the Space of Probability Measures.

The core quantity

The action

\mathcal{A} := \int_0^1 |\dot\pi|_\theta^2\,\mathrm{d}\theta

the “kinetic energy” of the curve

$\mathcal{A} \ge W_2^2(\pi_0,\pi_1)$, with equality on the constant-speed geodesic.
Weaker than isoperimetry: if $\rho_t$ is $C_{\mathrm{PI}}(\rho_t)$-PI then $\mathcal{A}\le\int_0^1 2\,C_{\mathrm{PI}}(\rho_t)\,\|\partial_t\log\rho_t\|_{L^2(\rho_t)}^2\,\mathrm{d}t$.

Example (heat-flow curve $\rho_t=\rho_0*\mathcal{N}(0,2St\,I)$). For any Gaussian mixture $\rho_0=\sum_{i=1}^N w_i\,\mathcal{N}(\mu_i,\sigma^2 I)$:

\mathcal{A} \le \frac{Sd}{2}\,\log\!\Big(1+\frac{2S}{\sigma^2}\Big)

Independent of the number of modes $N$, weights $w_i$, and means $\mu_i$ — while LSI / PI constants blow up exponentially in $\max_{i,j}\|\mu_i-\mu_j\|$.

The action sees through multimodality that isoperimetry cannot.

The toolkit

Girsanov on path measures

Take two SDEs driven by Brownian motion, $t\in[0,T]$:

\mathrm{d}X_t = a_t(X_t)\,\mathrm{d}t + \sigma\,\mathrm{d}B_t,\;\; X_0\sim\mu; \qquad \mathrm{d}Y_t = b_t(Y_t)\,\mathrm{d}t + \sigma\,\mathrm{d}B_t,\;\; Y_0\sim\nu.

The KL between their path measures $\mathbb{P}^X, \mathbb{P}^Y$ is

\mathrm{KL}(\mathbb{P}^X\,\|\,\mathbb{P}^Y) = \mathrm{KL}(\mu\,\|\,\nu) + \frac{1}{2\sigma^2}\int_0^T \mathbb{E}_{\mathbb{P}^X}\big\|a_t(X_t) - b_t(X_t)\big\|^2\,\mathrm{d}t.

The big picture

The proof idea: three path measures

analysis framework: forward, backward, reference paths

Forward $\mathbb{P}^{\rightarrow}$ (the algorithm), backward $\mathbb{P}^{\leftarrow}$, and the compensated reference $\mathbb{P}$ threading the intermediates $\pi_{\theta_\ell}$.

Continuous analysis · left rail (sampling)

Error analysis of annealed Langevin diffusion

Consider two SDEs started at $X_0\sim\tilde\pi_0$, and let $\tilde\pi_t := \pi_{t/T}$:

\begin{aligned} \mathbb{P}^{\rightarrow}\ (\text{ALD}):\quad & \mathrm{d}X_t = \nabla\log\tilde\pi_t(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t,\\[2pt] \mathbb{P}\ (\text{compensated}):\quad & \mathrm{d}X_t = \big(\nabla\log\tilde\pi_t(X_t) + v_t(X_t)\big)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t. \end{aligned}

Pick $v_t$ s.t. $X_t\sim\tilde\pi_t$ for all $t$; by Fokker–Planck, this means $v_t$ generates $\tilde\pi_t$:

\partial_t\tilde\pi_t + \nabla\!\cdot(\tilde\pi_t\, v_t) = 0.

\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\rightarrow}) = \frac14\int_0^T \|v_t\|_{L^2(\tilde\pi_t)}^2\,\mathrm{d}t \;\xrightarrow{\ \text{optimal } v^*\ }\; \frac{1}{4T}\int_0^1 |\dot\pi|_\theta^2\,\mathrm{d}\theta = \frac{\mathcal{A}}{4T}.

\mathrm{KL}(\pi\,\|\,\nu^{\mathrm{ALD}}) = \mathrm{KL}(\mathbb{P}_T\,\|\,\mathbb{P}^{\rightarrow}_T) \le \mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\rightarrow}) = \frac{\mathcal{A}}{4T}\quad(\text{data-processing}).

Theorem

$T = \frac{\mathcal{A}}{4\varepsilon^2} \;\Rightarrow\; \mathrm{KL}(\pi\,\|\,\nu^{\mathrm{ALD}})\le\varepsilon^2$, where $\nu^{\mathrm{ALD}}$ is the law of ALD at time $T$.

Continuous analysis · right rail (normalizing constant)

Error analysis of Jarzynski equality

A new player: the backward SDE $\mathbb{P}^{\leftarrow}$ — ALD run backward in time with time-reversed Brownian motion $B^{\leftarrow}$:

\mathrm{d}X_t = -\nabla\log\tilde\pi_t(X_t)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B^{\leftarrow}_t, \quad t\in[0,T], \quad X_T\sim\tilde\pi_T.

Crooks fluctuation theorem:

\log\frac{\mathrm{d}\mathbb{P}^{\rightarrow}}{\mathrm{d}\mathbb{P}^{\leftarrow}}(X) = -\int_0^T (\partial_t\log\tilde\pi_t)(X_t)\,\mathrm{d}t = W(X) - \Delta F \;\Longrightarrow\; \mathbb{E}_{\mathbb{P}^{\rightarrow}}\,\mathrm{e}^{-W} = \mathrm{e}^{-\Delta F}.

Estimator $\hat Z = Z_0\,\mathrm{e}^{-W(X)}$, $X\sim\mathbb{P}^{\rightarrow}$.

Theorem

$T = \dfrac{32\,\mathcal{A}}{\varepsilon^2} \;\Longrightarrow\; \mathrm{Pr}\!\left(\left|\dfrac{\hat Z}{Z} - 1\right| \le \varepsilon\right) \ge \dfrac{3}{4}.$ (median trick boosts $\frac34$ to $1-\zeta$.)

Key step: the error is the fluctuation of the RN derivative, bounded through the reference $\mathbb{P}$:

\mathrm{Pr}\!\left(\Big|\tfrac{\hat Z}{Z}-1\Big|\ge\varepsilon\right) = \mathbb{P}^{\rightarrow}\!\left(\Big|\tfrac{\mathrm{d}\mathbb{P}^{\leftarrow}}{\mathrm{d}\mathbb{P}^{\rightarrow}}-1\Big|\ge\varepsilon\right) \le \frac{\sqrt{2}}{\varepsilon}\left(\sqrt{\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\rightarrow})} + \sqrt{\mathrm{KL}(\mathbb{P}\,\|\,\mathbb{P}^{\leftarrow})}\right).

Crooks (1998), Nonequilibrium Measurements of Free Energy Differences for Microscopically Reversible Markovian Systems; Crooks (1999), Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences; Vargas, Padhy, Blessing & Nüsken (2024), Transport meets Variational Inference: Controlled Monte Carlo Diffusions.

The reveal

One engine, two tasks

T \asymp \frac{\mathcal{A}}{\varepsilon^2} \quad \text{for both sampling and normalizing constant estimation}

Key quantity: the action $\mathcal{A}=\int_0^1 |\dot\pi|_\theta^2\,\mathrm{d}\theta$.

A quantitative sampling $\rightleftarrows$ counting bridge in the continuous, non-log-concave setting — a continuous analogue of Jerrum, Valiant & Vazirani.
Isoperimetry does not appear explicitly.

Jerrum, Valiant & Vazirani (1986), Random generation of combinatorial structures from a uniform distribution, Theoretical Computer Science.

Discrete time

From continuous diffusion to implementable algorithm: final complexities

Discretize ALD with the exponential integrator; one update step per intermediate distribution.
A second error term (discretization error) appears on top of the continuous error $\frac{\mathcal{A}}{T}$, yielding the non-asymptotic complexities:

Sampling (ALMC)

\widetilde{O}\!\left(\frac{d\,\beta^2\,\mathcal{A}^2}{\varepsilon^6}\right)

to reach $\mathrm{KL}(\pi\|\cdot)\le\varepsilon^2$.

Normalizing constant (AIS)

\widetilde{O}\!\left(\frac{d^{4/3}}{\varepsilon^2} \vee \frac{m\beta\mathcal{A}^\frac12}{\varepsilon^2} \vee \frac{d\beta^2\mathcal{A}^2}{\varepsilon^4}\right)

to get $\hat Z$ s.t. relative error $\le\varepsilon$ w.h.p.

Both rates are for the geometric interpolation $\pi_\theta\propto\exp\!\big(-V-\frac{\lambda(\theta)}{2}\|\cdot\|^2\big)$, with $\mathcal{A}$ its action.
$\beta$ is the smoothness of potential and $m=\sqrt{\mathbb{E}_\pi\|\cdot\|^2}$.
First isoperimetry-free guarantee for normalizing constant estimation.

Theoretical payoff

Mixture of Gaussians: polynomial vs. exponential

Target: a Gaussian mixture with equal-norm means $\|y_i\|=r$; $V=-\log\pi$ is $B$-smooth with $B=\beta(4r^2\beta+1)$:

\pi = \sum_{i=1}^N p_i\,\mathcal{N}\!\left(y_i,\ \beta^{-1}I\right), \qquad \textstyle\sum_i p_i = 1,\ \ p_i>0.

Geometric annealing curve and its action:

\pi_\theta \propto \exp\!\left(-V - \tfrac{\lambda(\theta)}{2}\|\cdot\|^2\right),\ \ \lambda(\theta)=dB(1-\theta)^\gamma; \qquad \mathcal{A} = O\!\left(d(r^2\beta+1)\Big(r^2 + \tfrac{d}{\beta}\Big)\right).

Ours (via the action)

$\mathcal{A}$ is polynomial $\Rightarrow$ $\mathrm{poly}\!\left(d,\beta,r,\tfrac{1}{\varepsilon}\right)$ oracle complexity.

LMC under LSI

LSI constant $C_{\mathrm{LSI}} = \Omega\!\left(\mathrm{e}^{\Theta(\beta r^2)}\right)$ $\Rightarrow$ exponential complexity.

Beyond geometric interpolation

The catch with geometric interpolation

The whole story hinges on $\mathcal{A}$ being small, but it need not be.
1-dimensional example: $\pi = \frac12\mathcal{N}(0,1)+\frac12\mathcal{N}(m,1)$, $\pi_\theta(x)\propto\pi(x)\,\mathrm{e}^{-\lambda(\theta)x^2/2}$, $\lambda(\theta)=m^2(1-\theta)^r$.

\mathcal{A}_r \gtrsim m^4\,\mathrm{e}^{m^2/40}.

exponential in the mode separation

Mass teleportation / Mode switching: weights of the modes change dramatically, so mass must teleport across a large distance.

mode switching: the far mode appears only as theta approaches 1

Figure: Chemseddine, Wald, Duong & Steidl (2025), Neural Sampling from Boltzmann Densities: Fisher–Rao Curves in the Wasserstein Geometry, ICLR.

Beyond geometric interpolation

A better curve: reverse the OU process

Borrow the diffusion-model curve: run the OU process from $Y_0\sim\pi$, $\mathrm{d}Y_t = -Y_t\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t,\quad Y_t\sim\bar\pi_t$, $\bar\pi_t\to\mathcal{N}(0,I)$ as $t\to\infty$.
Its time-reverse $(\bar\pi_{T-t})_{t\in[0,T]}$ is the annealing curve we use.

Proposition

Let $\pi\propto\mathrm{e}^{-V}$ with $V$ $\beta$-smooth, and finite second moment $m^2 := \mathbb{E}_\pi\|\cdot\|^2 < \infty$. Then the action of the reverse-OU curve satisfies $$\int_0^\infty |\dot{\bar\pi}|_t^2\,\mathrm{d}t \;\le\; d\beta + m^2.$$

By our JE theorem, this curve should estimate $Z$ far more efficiently.

This also partially explains why diffusion models work well: their trajectory is smooth in the Wasserstein geometry.

Beyond geometric interpolation

Reverse diffusion samplers (RDS) for normalizing constant estimation

RDS (RDMC / RSDMC / ZODMC / SNDMC) simulate the time-reversal of OU with a learning-free score estimator $s_t \approx \nabla\log\bar\pi_t$:

\begin{aligned} \widehat{\mathbb{Q}}\ (\text{run}):\quad & \mathrm{d}X_t = \big(X_t + 2\,s_{T-t_-}(X_{t_-})\big)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t, & X_0\sim\phi=\mathcal{N}(0,I),\\[2pt] \mathbb{Q}\ (\text{ideal}):\quad & \mathrm{d}X_t = \big(X_t + 2\,\nabla\log\bar\pi_{T-t}(X_t)\big)\,\mathrm{d}t + \sqrt{2}\,\mathrm{d}B_t, & X_0\sim\bar\pi_T. \end{aligned}

Estimator $\hat Z = \mathrm{e}^{-W(X)}$, $X\sim\widehat{\mathbb{Q}}$, with the work functional

W(X) = \log\phi(X_0) + V(X_{T-\delta}) - (T-\delta)d + \int_0^{T-\delta}\!\Big(\|s_{T-t_-}(X_{t_-})\|^2\,\mathrm{d}t + \sqrt{2}\,\big\langle s_{T-t_-}(X_{t_-}),\,\mathrm{d}B_t\big\rangle\Big).

Theorem

If $\mathrm{KL}(\mathbb{Q}\,\|\,\widehat{\mathbb{Q}})\lesssim\varepsilon^2$ and $\mathrm{TV}(\pi,\bar\pi_\delta)\lesssim\varepsilon$, then $\mathrm{Pr}\!\left(\left|\dfrac{\hat Z}{Z}-1\right|\le\varepsilon\right)\ge\dfrac{3}{4}$.

RDMC: Huang, Dong, Hao, Ma & Zhang (ICLR 2024), Reverse Diffusion Monte Carlo. RSDMC: Huang, Zou, Dong, Ma & Zhang (COLT 2024), Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo. ZODMC: He, Rojas & Tao (NeurIPS 2024), Zeroth-Order Sampling Methods for Non-Log-Concave Distributions: Alleviating Metastability by Denoising Diffusion. SNDMC: Vacher, Chehab & Korba (NeurIPS 2025), Sampling from multi-modal distributions with polynomial query complexity in fixed dimension via reverse diffusion.

Beyond geometric interpolation · the closing idea

The fundamental trade-off

Insight: RDS = optimally compensated ALD on the OU trajectory.

Tractable curve (geometric)

Closed-form drift $\nabla\log\pi_\theta$, but possibly exponential action $\mathcal{A}$.

OU curve

Polynomial action $\mathcal{A}$, but the drift $\nabla\log\bar\pi_t$ has no closed form $\Rightarrow$ must be estimated.

You pay either in the curve or in the drift.

Wrap up

Summary

One framework (Girsanov + optimal transport) $\to$ both sampling and $Z$-estimation.
Action $\mathcal{A}$ is the governing quantity; continuous time complexity $T \asymp \frac{\mathcal{A}}{\varepsilon^2}$ for both.
Isoperimetry-free; a continuous sampling $\rightleftarrows$ counting bridge.
Curve-vs-drift trade-off explains when annealing wins and motivates RDS.

Looking ahead

Future directions

More general diffusion processes beyond overdamped Langevin.
Task-specific design of the annealing curve / schedule.

Looking ahead

Closely related & follow-up work

Provable Convergence and Limitations of Geometric Tempering for Langevin Dynamics. Chehab, Korba, Stromme & Vacher (ICLR 2025) — geometric interpolation may slow down Langevin dynamics due to mode switching.
Diffusion Annealed Langevin Dynamics: a Theoretical Study. Cattiaux, Cordero-Encinar & Guillin (arXiv:2511.10406) — rigorous study of annealed Langevin diffusions and the action of interpolation curves for general targets.
Discrete Optimal Transport: Rapid Convergence of Simulated Annealing Algorithms. He, Jiang, Wang & Zhang (arXiv:2605.05877) — extends the action-based annealing analysis to discrete distributions and CTMCs.

Thank you!

Joint work with Molei Tao & Yongxin Chen.

Sampling (ICLR’25): arxiv.org/abs/2407.16936

Normalizing constant (ICLR’26): arxiv.org/abs/2502.04575

Created with Opus 4.8 using the Codeck skill (https://github.com/hiyeshu/codeck)