The score function of a probability distribution with density \(p(x)\) is the gradient of the log-density: \[ \nabla_x \log p(x) \]
Working with this quantity has several advantages:
Initial proposal: score matching(Hyvärinen and Dayan 2005)
The quantity \(Trace\left(\nabla_x^2 \log p_\theta(x)\right)\) is difficult to compute
The score is untractable in low density areas
Given \(\{x_1, x_2, ..., x_T\} \sim p_\text{data}(x)\) Objective: Minimize the quantity \[ E_{p(x)}\left[\frac{1}{2}|| \log p_{\theta}(x)||² + Trace\left(\nabla_x^2 \log p_\theta(x)\right)\right]\]
Initial proposal: score matching(Hyvärinen and Dayan 2005)
Learning the score of a noisy distribution(Vincent 2011)
No score of noise-free distribution
Loss: \(\mathbb{E}\left[\frac{1}{2}\left|\left| s_\theta(\tilde{x})- \frac{\tilde{x}-x}{\sigma²}\right|\right|²\right]\)
Initial proposal: score matching(Hyvärinen and Dayan 2005)
Learning the score of a noisy distribution(Vincent 2011)
Denoising diffusion models(Sohl-Dickstein et al. 2015), annealed Langevin dynamics(Y. Song and Ermon 2019)
Noise conditional score model, with objective : \[ \frac{1}{L} \sum_{i=1}^L \lambda(\sigma_i) \mathbb{E}\left[\left\lvert\left\lvert s_\theta(x_i, \sigma_i) - \frac{(\tilde{x}_i - x_i)}{\sigma_i²}\right\rvert\right\rvert ² \right] \]
Initial proposal: score matching(Hyvärinen and Dayan 2005)
Learning the score of a noisy distribution(Vincent 2011)
Denoising diffusion models(Sohl-Dickstein et al. 2015), annealed Langevin dynamics(Y. Song and Ermon 2019)
DDPM beats GAN(Ho, Jain, and Abbeel 2020)!
Proposed solution: Score modeling using Stochastic Differential Equations(Y. Song et al. 2021)!
Equations of functions, of the form \(\frac{dx}{dt} = f(x, t)\) (order 1).
Equation of time-dependent stochastic processes, noted \(X_t\).
They are of the form
\(dx = \underbrace{f(x, t)dt}_{\text{"drift" term}} + \underbrace{g(t) dW_t}_{\text{"diffusion" term}}\),
where \(W_t\) is a “standard Wiener process” or Brownian motion.
They are used in many domains (finance, physics, biology, and even shape analysis)
Given an initial condition, an SDE has now multiple possible realizations!
The initial condition is now always \(x_0\), the time only goes Forward
Solving an SDE means looking for the trajectories density \(p_t(x)\)
A stochastic process \(W_t\) is a Wiener process, or Brownian motion, if:
If we sample 100 trajectories, we obtain: