Diffusion models are generative models that gradually denoise a noisy input
The score of a probability density \(p(x)\) is the derivative of the log-density \(\nabla_x \log p(x)\) (vector field)
Following this vector field is equivalent to sample new examples
Denoising process ~ following the score at different noise levels
“A ghibli style photo of the group”
There are many ways to distill or mine this information !
Score distillation sampling 1 is a technique for transferring knowledge from a source domain (2D) to a target domain (3D) with
By “following” the score on the source domain, we can successfully obtain realistic/faithful representation on the target domain
An efficient way to compute matching between shapes is via deep functional maps 1, that approximates a matching via a functional map matrix C.
Assuming we have a set of groundtruth functional maps, we can learn a functional maps diffusion model and leverage SDS for zero-shot shape matching!
DiffusionDB 1 contains 14 million curated images generated using StableDiffusion
Hi-3DGen 1 finetunes TRELLIS 2, an Image to 3D model, using synthetic detailed Image-3D pairs
Hi-3DGen is able to capture finer details from images compared to the initial TRELLIS model
Trellis vs Hi-3DGen
CHORUS 1 learns Human-Object interaction from synthetic images.
Diffusion models implicitly learns correspondences. The features of a trained denoiser can be used for zero-shot image correspondences (no fine-tuning) 1.
Can you guess where the photo comes from?
Answer: USA
Typical visual elements of a location/style/date are:
Diff-mining 1 proposes a diffusion model based solution.
“What makes Paris look like Paris?” 1
“We design our measure of typicality based on the following intuition: a visual element is typical of a conditioning class label (e.g., country name or date) if the diffusion model is better at denoising the input image in the presence of the label than in its absence.”
Typicality is defined as \[ T(x |c) = \mathbb{E}_{\epsilon, t} [L_t(x, \epsilon, \varnothing) - L(x, \epsilon, c)] \]
“Co-Locating Style-Defining Elements on 3D Shapes” 1
Main difficulties/question
Stable diffusion prompt: “A Victorian-era chair crafted from natural wood, featuring carved spindles and a high, elegant back.”