Mining information using diffusion models

Outline

  • Diffusion models
  • Overview of distillation techniques
  • Diff-mining paper
  • Current project/idea

Diffusion models

  • A forward process gradually noising an image \(x\): \[ x_t^\epsilon = \sqrt{\alpha_t} x + (1-\sqrt{\alpha_t})\epsilon \]
  • A denoiser \(\epsilon_\theta(x_t^\epsilon, t)\)
  • Trained by minimizing \[ L_t(x, \epsilon) = ||\epsilon_\theta(x_t^\epsilon, t) - \epsilon||^2 \]
  • Optionnally the denoiser \(\epsilon_\theta(x_t^\epsilon, t, c)\) is conditioned on an input \(c\) (e.g. text)

Diffusion models

Diffusion models are generative models that gradually denoise a noisy input

 

Diffusion models: score function

The score of a probability density \(p(x)\) is the derivative of the log-density \(\nabla_x \log p(x)\) (vector field)

Following this vector field is equivalent to sample new examples

Denoising process ~ following the score at different noise levels

 

Main application: text-to-image (memes and jokes)

“A ghibli style photo of the group”

Diffusion models carry out information

“A man”

“A happy man”

“A sad man”

Diffusion models carry out information

“A living room”

“A french living room”

“A japanese living room”

There are many ways to distill or mine this information !

Score distillaton sampling (2D-to-3D)

Score distillation sampling 1 is a technique for transferring knowledge from a source domain (2D) to a target domain (3D) with

  • A trained denoiser on the source domain
  • A differentiable representation \(\psi\) on the target domain
  • A differentiable function \(x = g(\psi)\) target -> source domain

By “following” the score on the source domain, we can successfully obtain realistic/faithful representation on the target domain

Score distillaton sampling (2D-to-3D)

 

Score distillaton sampling (2D-to-3D)

 

Score distillation (shape matching)

An efficient way to compute matching between shapes is via deep functional maps 1, that approximates a matching via a functional map matrix C.

Deep functional maps pipeline

Assuming we have a set of groundtruth functional maps, we can learn a functional maps diffusion model and leverage SDS for zero-shot shape matching!

Score distillation (shape matching)

 

Dataset generation/augmentation

DiffusionDB 1 contains 14 million curated images generated using StableDiffusion

 

Application : Fine-tuning a Image-to-3D diffusion model

Hi-3DGen 1 finetunes TRELLIS 2, an Image to 3D model, using synthetic detailed Image-3D pairs  

Application : Fine-tuning a Image-to-3D diffusion model

Hi-3DGen is able to capture finer details from images compared to the initial TRELLIS model

Trellis vs Hi-3DGen

Application: Discovering Human-Object Interaction

CHORUS 1 learns Human-Object interaction from synthetic images.

 

Image correspondence

Diffusion models implicitly learns correspondences. The features of a trained denoiser can be used for zero-shot image correspondences (no fine-tuning) 1.

 

Diffusion models as data mining tools

Can you guess where the photo comes from?

 

Diffusion models as data mining tools

Answer: USA

 

Mining typical visual elements

Typical visual elements of a location/style/date are:

  • discriminative: they distinguish one location from another
  • frequent: they appear repeatedly across multiple images of the same location.
  • localized (patches)

Diff-mining 1 proposes a diffusion model based solution.

 

Diffusion models

  • A denoiser \(\epsilon_\theta(x_t^\epsilon, t)\)
  • Trained by minimizing \[ L_t(x, \epsilon) = ||\epsilon_\theta(x_t^\epsilon, t) - \epsilon||^2 \]
  • Optionnally the denoiser \(\epsilon_\theta(x_t^\epsilon, t, c)\) is conditioned on an input \(c\) (e.g. text)

Diffusion based typicality

“We design our measure of typicality based on the following intuition: a visual element is typical of a conditioning class label (e.g., country name or date) if the diffusion model is better at denoising the input image in the presence of the label than in its absence.”

Typicality is defined as \[ T(x |c) = \mathbb{E}_{\epsilon, t} [L_t(x, \epsilon, \varnothing) - L(x, \epsilon, c)] \]

Typicality

 

Mining visual elements of a dataset

  • Finetuning a diffusion model (Stable Diffusion) on the dataset with class condition (“An image of {label}”)
  • Gather most typical images in patches, and keep the 1000 most typical patches in the dataset
  • K-Means clustering to obtain 32 patches (using DIFT features)

Results

 

Results

 

Can we lift this idea to 3D?

“Co-Locating Style-Defining Elements on 3D Shapes” 1

 

Can we lift this idea to 3D?

Main difficulties/question

  • What would be the patch representation?
  • Diff-mining is designed for dataset of > 10K images
  • No similar dataset for shapes
  • Can we generate the dataset ?

Stable diffusion prompt: “A Victorian-era chair crafted from natural wood, featuring carved spindles and a high, elegant back.”