Flow Matching and Diffusion Models

I transcribe MIT 6.S184: Flow Matching and Diffusion Models – Lecture – Generative AI with SDEs. I try to do this using Tikz diagrams too.

tl:dr

  1. Ordinary Differential Equations have to be studied separately.
  2. Stochastic Differential Equations is a separate subject.
  3. I believe that I need to solve the exercises by coding to understand everything reasonably well.

Lecture 1

Conditional generation means sampling the conditional data distribution.

{ P_{data} (\bullet \mid y )}

Generative models generate samples from data distribution.

Initial Distribution : {P_{init}  }

Default is {P_{init} \sim \mathcal{N}(0,{I}_d) }

Flow Model

Trajectory { X : \overbrace{[0,1]}^\text{Time component} \rightarrow \mathbb{R}^D} , t \rightarrow X_t So for each time component, t , we get a Vector out.

Vector Field. { u : \mathbb{R}^D x [0,1] \rightarrow \mathbb{R}^D} (There is a space component and Time component )

Flow

{ \psi \bullet \mathbb{R}^D x [0,1] \rightarrow \mathbb{R}^D}

{ (X_0, t) \rightarrow {\psi}_t(X_0)} means for every initial condition I want this to be a solution to my ODE.

{ {\psi}_0(X_0) = X_0 } which is the initial condition

.The time derivative of { \frac{d}{dt} {\psi}_t(X_0)  } is {  u_t ( {\psi}_t(X_0)  }

Neural Network. { {{\psi}_t}^{\theta} \mathbb{R}^D x [0,1] \rightarrow \mathbb{R}^D}

Random Initialization { X_0 \rightarrow P_{init}}

Ordinary Differential Equation \frac{d}{dt} X_t = \psi_t^{\theta} X_t ( Time Derivative)

Goal Simulate to get { X_1 = P_{data}}

{ \fbox{ TRAJECTORY } \rightarrow \fbox{ ODE } \overbrace{\leftarrow}^\text{defined by} \fbox{ VECTOR FIELD } }  
This means that Flow is a collection of Trajectories that conform to the ODE

Diffusion Model

Stochastic Process

{ X_t, 0 \leq t \leq 1} . { X_t } is a random variable

{ X : [0,1] \rightarrow \mathbb{R}^D , t \rightarrow X_t}

Vector Field. { u : \mathbb{R}^D x [0,1] \rightarrow \mathbb{R}^D} + Differential Coefficient

Stochastic Differential Equation

{X_0 = x_0 \left(\text {Initial} \\ \text {Condition} \right)}

The following means that the change of X_t in time is given by the change of direction of the Vector field {u_t X_t {dt}}

{{d}{X_t} = \underbrace{u_t  X_t {dt}}_\text{ODE} + \underbrace{\sigma_t  {d}{W_t}}_\text{Stochastic Noise} }

{{W_t} {\text{ is known as Brownian Motion}}}

Brownian Motion

Stochastic Process {W = {W_t}{(t=0)}} and in this case the time can be infinite. We don’t have to stop at {t = 1}

  1. {{W_0} = 0 }
  2. It has Gaussian increments. What does it mean ?

{{W_t} - {W_s} \thicksim \mathcal{N}(0,\,(t - s){I}_d) }

These are two arbitrary time points and t is before s, { 0 \leq t \leq s } and Variance of the Gaussian Distribution varies linearly with time.

3. Independent increments. This means that {{W_1} - {W_0} \ldots {W_n} - {W_{n-1}}  }

So at this stage, in order to understand the following, I need a book or another course in ODE’s

\frac{d}{dt} X_t = u_t (X_t) \Leftrightarrow X_{t+h} = X_t + h u_t (X_t) + h (R_t(h)) \left( \displaystyle\lim_{h \to 0} \underbrace{R_t(h)}_\text{Error Term} = 0 \right)

This means the trajectory with ODE { u_t (X_t)}  is equivalent to the timestep { X_t} plus h times the direction of the vector field { u_t {X_t}} plus a remainder term that can be ignored.

How are derivatives defined ?

This is the basic definition that I have to understand by learning Calculus.

Derivative of a trajectory \frac{d}{dt} X_t = u_t (X_t)

\frac{d}{dt} X_t = u_t (X_t) \Leftrightarrow  \left( \displaystyle\lim_{h \to 0} \frac{X_{t+h} - X_t}{h} = u_t (X_t)  \right)

\Leftrightarrow  \left(  \frac{(X_{t+h} - X_t)}{h} = u_t (X_t) + R_t(h) \right)

And by applying linear algebra we get the ODE shown above( this section ).

Ordinary Differential Equation to Stochastic Differential Equation

{d}X_t = u_t (X_t){dt} + \sigma {d} W_t

\Leftrightarrow    X_{t+h}  = X_t + h u_t (X_t) + \sigma_t \underbrace{(W_{t + h} - W_t)}_\text{Brownian Motion} + h R_t(h)  \left( \displaystyle\lim_{h \to 0} \mathbb(E)[\sqrt {\lVert {R_t(h)} \rVert_2^2}] =0 \right)

This is the recap. We found a term that doesn’t depend on the derivatives that can be specified with an error term. \sigma_t  is the diffusion coefficient used to scale the Brownian Motion. If \sigma_t  is zero, it is equivalent to the original ODE.

Why do we need Brownian Motion ?

I didn’t really follow this. But the answer given was this. The Brownian Motion is equivalent to the Gaussian Distribution as far its universal value is concerned

Lecture 2

Reminder of what was covered in Lecture 1

\fbox{\begin{minipage}[t][.5cm]{0,4\textwidth} {\textbf{Flow Model}} \end{minipage}} \qquad {P_{init}} {\textbf{(Gaussian)}} \qquad {d}X_t = \overbrace{{{u_t}^{\theta} (X_t) {dt} }^\text{{\textbf{ODE}}}}

\fbox{\begin{minipage}[t][.8cm]{0,7\textwidth} \textbf{Diffusion Model} \end{minipage}} \qquad {P_{init}} {\textbf{(Gaussian)}} \qquad {d}X_t = \overbrace{ {u_t}^{\theta} (X_t){dt} + \underbrace{{\sigma}_t {d} W_t}_\text{{\textbf{Diffusion Co-efficient}}} }^\text{{\textbf{SDE}}}

Deriving a Training Target

Typically, we train the model by minimizing a mean-squared error

L_{\theta} = {\lVert {{u_t}^{\theta}(x) - {u_t}^{\mathrm{target}}(x)} \rVert_2^2}

In regression or classification, the training target is the label. Here we have no label. We have to derive a training target.

The professor states that you don’t have to understand all the derivations. I anticipate some mathematics I haven’t studied earlier.

We have to make sure we understand the formulas for these.

\begin{minipage}[t][.6cm]{0.9\textwidth} {\textbf{Conditional \\ Probability Path}} \end{minipage}  \qquad \begin{minipage}[t][.6cm]{0.9\textwidth} {\textbf{Conditional \\ Vector field}} \end{minipage} \qquad \begin{minipage}[t][.6cm]{0.9\textwidth} {\textbf{Conditional \\ Score Function}} \end{minipage}

\begin{minipage}[t][.6cm]{0.9\textwidth} {\textbf{Marginal \\ Probability Path}} \end{minipage}  \qquad \begin{minipage}[t][.6cm]{0.9\textwidth} {\textbf{Marginal \\ Vector field}} \end{minipage} \qquad \begin{minipage}[t][.6cm]{0.9\textwidth} {\textbf{Marginal \\ Score Function}} \end{minipage}

The key terminology to remember are the following.

\fbox{\begin{minipage}[t]{1,7\textwidth}  \textbf{Conditional = Per Single data point \\ Marginal = Across distribution of data points } \end{minipage}}

Conditional and Marginal Probability Path

\textbf{ Dirac distribution : } z \in \underbrace{\mathbb{R}^d}_\text{Data we want to generate}, {\delta^z}

X \sim {\delta^z}, X = z, z \text{is the data point set}

This dirac distribution is not what I understand as of now. But it seems to return the same \textbf{z}

\textbf{ Conditional Probability Path : } \underbrace{P_t( \bullet \mid z)}_\text{Data we want to generate}, {\delta^z}

\textbf{ 1) } {P_t( \bullet \mid z)} \text{is a distribution over} {\mathbb{R}^d}

\textbf{ 2) } {P_0( \bullet \mid z) =  P_{init} } \text{and} {P_1( \bullet \mid z) = {\delta^z}}

Example : Gaussian Probability Path

{P_0( \bullet \mid z) =  \mathcal{N}({\alpha_t} z,{\beta_t}^2{I}_d) } }
{\alpha_t} \text{ and } {\beta_t} \text{ are termed noise schedulers }

{\beta_{\underbrace{1}_\text{Time 1}} =0} \\ {\beta_{\underbrace{0}_\text{Time 0}} =1} \\ {\alpha_{\underbrace{1}_\text{Time 1}} =1} \\ {\alpha_{\underbrace{0}_\text{Time 0}} =0}

The diagram is small. But the idea is that when Time is 0, mean is 0 and variance is 1 which is {P_{init} \sim \mathcal{N}(0,{I}_d) } and when Time is 1, mean is \textbf{z} and variance is 0.

The distributions with variance 1 is dirac, {\delta^z}

Example : Marginal Probability Path {P_t}

Well. This is not clear at this stage. But we take one data point(sampling) from {z \sim {P_{data} }} and marginal probability path means that we forget it. {\textbf{X} \sim {P( \bullet \mid z) }} \Rightarrow {\textbf{X} \sim P_t} . As far as I understand the data distribution and conditional distribution leads to the marginal path.

\textbf{Formula for the density is  1) } {\underbrace{{P_t( X )}}_\text{Likelihood at a point X}} {\int {(P_t \mid z)}} {P_{data} ( z ) } dz

The density formula is not clear at this stage.

\textbf{ 2) } {P_0 = P_{init}}, {P_1 = P_{data}}
All of this seems to describe that we move from noise to our distribution of the data we are dealing with.

Conditional and Marginal Vector Field

Conditional Vector field {{u_t}^{target} (X \mid Z)} \\  {0 \leq t \leq 1 } \\ {X,Z \in \mathbb{R}^d}

We want to condition such that starting from initial point

{\mathrm{P_0 = P_{Init}}} \textbf{ if we follow the vector field }  {\frac{d}{dt} {X_t}} ={u_t}^{target}(X_t \mid z )

then the distribution of t at every time point is given by this probablity path \mathrm{ \Rightarrow {X_t \sim (\bullet \mid z)}}, \mathrm{0 \leq t \leq 1 }

We simulate the ODE like this.

Example : Conditional Gaussian Vector field

\dot{\alpha_t} is the time deritive and is from Physics

{u_t}^{target} (X \mid Z) =  \left( \dot{\alpha_t} -  {\frac{\dot{\beta_t}}  {\beta_t}}  \alpha_t  \right) Z + {\frac{\dot{\beta_t}}  {\beta_t}} X

Marginalization trick

The marginal vector field is {u_t}^{target} (X) = {\int {{u_t}^{target} (X \mid Z)}}  \left( {\frac{{P_t (X \mid Z)}  {P_{data} ( Z )}} {P_t (X) } } dz \right)

Not very clear at this point but the following is the application of the Bayes’ rule to look at the posterior distribution. What could have been the data point from the point set Z that gave rise to X ?

\left( {\frac{{P_t (X \mid Z)}  {P_{data} ( Z )}} {P_t (X) } } dz \right)

{\mathrm{P_0 = P_{Init}}} \large \textbf{ if we follow the vector field } {\frac{d}{dt} {X_t}} ={u_t}^{target}(X_t \mid z )

then the distribution of t at every time point is given by this marginal path \mathrm{ \Rightarrow {X_t \sim P_t}}, \mathrm{0 \leq t \leq 1 }

Proof of Marginalization Trick

We start with the left side of the \large \textbf{Continuity Equation}. This equation should be
in the notes.

\frac{d}{dt} \underbrace{P_t(x)}_\text{Time derivative} = \underbrace{{\frac{d}{dt}} \int P_t(x \mid z) P_{data}(z) dz}_\text{Density formula as shown above }

We can swap integrals and derivatives under certain conditions. Which conditions ?

\frac{d}{dt} P_t(x) = \int \frac{d}{dt} P_t(x \mid z) P_{data}(z) dz
This can be represented as the Continuity equation as applied to the Conditional Probability path.
\int -div \left( P_t( \bullet \mid z) {U_t}^{target}( \bullet \mid z)\right) (x) P_{data}(z) dz

What is the divergence operator ?
It seems to have applications in Physics. So it is not entirely clear.
But it is div = V_t (x) = \Sigma_i \frac{\partial }{\partial x_i} {V_t}^i x

Because since it is a sum we can move it outside.

-\mathrm{div} \left( \int P_t(x \mid z)\,{U_t}^{target}(x \mid z) P_{data}(z) dz \right)
Multiplying and dividing by the same quantity we get

-\mathrm{div} \left( P_t(x) \int P_t(x \mid z) {U_t}^{target}(x \mid z)(x) \frac{P_{data}(z)}{P_t(x)} dz \right)

I didn’t follow this but this satisfies the continuity equation.

-\mathrm{div} \left( P_t {U_t}^{target} (x) \right)

Flow Matching

This is about learning the Marginal Vector field.
So we start with a neural net {U_t}^{theta} (x) \approx  {U_t}^{target} (x)

\mathrm{L}_{FM}(\theta) = \mathrm{E} \lVert{U_t}^{theta} (x) - {U_t}^{target} (x)\rVert_2^2 . t \sim uniform, \underbrace{z \sim P_{Data}}_\text{Sample from the data}, \underbrace{x \sim P(\bullet \mid z)}_\text{Sample from the Conditional Probability Path(add noise)}

Since the above is intractable we consider the the following.

\mathrm{L}_{CFM}(\theta) = \mathrm{E}_{t,x,z} \lVert{U_t}^{theta} (x) - \underbrace{{U_t}^{target} (x \mid z))}_\text{Conditional Vector field}\rVert_2^2

But this loss is intractable. Look at notes above.

Theorem

\mathrm{L}_{CFM}(\theta) = \mathrm{L}_{CFM}(\theta) + \underbrace{ \mathrm{C}}_\text{Constant}
The following image roughly shows the separation by a constant.

There are two implications

1. \nabla_{\theta} \mathrm{L}_{FM} = \nabla_{\theta} \mathrm{L}_{CFM}

2. {\theta}^{\ast} \qquad \text{Minimizer of FM} = {\theta}^{\ast} \qquad \text{Minimizer of CFM}

So it holds that the neural network equals the Marginal Vector Field.

I have to conclude this part before Lecture 3 as many loose threads have to tied up. That is pending.

Lecture 3

Reminder of what was converted in Lecture 2

Conditional Prob. Path, Vector field and Score.

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Conditional Probability Path}\end{minipage}} \qquad P_t(\bullet \mid z) \qquad \text{Interpolates } P_{init} \text{ and a data point } \qquad \mathcal{N}({\alpha}_t z, {{\beta}_t}^2 {I}_d)

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Conditional Vector field}\end{minipage}} \qquad {U_t}^{target}(x \mid z) \qquad \text{ODE follows conditional path } P_{init} \text{ and a data point } \qquad \left({\dot{\alpha}}_t - \frac{{\dot{\beta}}_t}{\beta_t}{\alpha}_t\right) z + \left(\frac{{\dot{\beta}}_t}{\beta_t}\right) x

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Conditional Score function}\end{minipage}} \qquad \nabla \log P_t(x \mid z) \qquad \text{Gradient of log likelihood } P_{init} \text{ and a data point } \qquad -\left(\frac{x - {\alpha}_t z}{{\beta}_t^2}\right)

Marginal Prob. Path, Vector field and Score.

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Marginal probability path }\end{minipage}} \qquad P_t \qquad \text{Interpolates } P_{init} \text{ and a data point } \qquad P_t(x \mid z) P_{data} (z ) dz

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Marginal vector field }\end{minipage}} \qquad {U_t}^{target}(x) \qquad \text{ODE follows marginal path } P_{init} \text{ and a data point } \qquad \int {U_t}^{target}(x \mid z) \left( \frac{P_t(x \mid z ) P_{data}(z)}{P_t(x)} \right) dz

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Marginal Score function}\end{minipage}} \qquad \nabla \log P_t(x) \qquad \text{Can be used to convert ODE target to SDE target } P_{init} \text{ and a data point } \qquad \nabla \log P_t(x \mid z) \left(\frac{P_t(x \mid z ) P_{data}(z)}{P_t (x)}\right) dz

Flow Matching

\mathrm{L}_{FM} \text{ for Gaussian Conditional Path}

Recall {P_t( \bullet \mid z) = \mathcal{N}({\alpha_t} z,{\beta_t}^2{I}_d) }

And {u_t}^{target} (X \mid Z) = \left( \dot{\alpha_t} - {\frac{\dot{\beta_t}} {\beta_t}} \alpha_t \right) Z + {\frac{\dot{\beta_t}} {\beta_t}} X

Now we can sample noise from Uniform Gaussian and add that up.

Noise is distributed over a Uniform Gaussian like this \underbrace{\epsilon}_\text{Noise} \sim \mathcal{N}(0,{I}_d) \Rightarrow {\alpha_t} z + {\beta_t} \epsilon \stackrel{\text{def}}{=} x

I have to determine exactly how the formula shown above is derived.

L_{CFM}(\theta) = \mathcal(E)_{t \sim Unif}\atop{z \sim P_{Data}, { \mathcal{N}({\alpha_t} z,{\beta_t}^2{I}_d)}} \left[ \lVert{U_t}^{theta} (x) - \left( \dot{\alpha_t} - {\frac{\dot{\beta_t}} {\beta_t}} \alpha_t \right) Z - {\frac{\dot{\beta_t}} {\beta_t}} X \rVert_2^2 \right]

These formulas too long for the renderer. There is also a negative sign below which I couldn’t understand.

Since X = {\alpha}_t z + {\beta}_t \epsilon

L_{CFM}(\theta) = \mathcal(E)_{t \sim Unif}\atop{z \sim P_{Data}, { \mathcal{N}({\alpha_t} z,{\beta_t}^2{I}_d)}} \left[ \lVert{U_t}^{theta} (x) \left({\alpha}_t z + {\beta}_t \epsilon \right) - \left( \dot{\alpha_t} z + \dot{\beta}_t \epsilon \right) \rVert_2^2 \right]

The following looks like the simples forumula one could explain during the interview.

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Flow Matching Training for CondOT Path}\end{minipage}}

\textbf{ Require : } \text{ A dataset of samples } z \sim P_{Data} \text{ : neural network }{U_t}^{\theta}

\text{ for each minibatch of data do }

\text{ Sample a data example z from the dataset }

\text{ Sample noise } \epsilon \sim \mathcal{N}(0, {I}_d)

\text{ Set x } = tz + (1 - t) \epsilon

\text{ Compute noise}

\mathcal{L}(\theta) = \lVert{U_t}^{theta} (x) - \left( z -  \epsilon \right)\rVert_2^2

\text{ Update the model parameters } (\theta) { via gradient descent on } \mathcal{L}(\theta)

\text{ end }

Score Matching

Score Network

{S_t}^{\theta} Goal is to

approximate {S_t}^{\theta} \approx \underbrace{{\nabla \log P_t }}_\text{ Marginal Score is same a Marginal Vector Field }

We have to show the same thing we dealt with above. Show that the marginal loss is same as the conditional loss upto a constant.

\mathcal{L}_{SM} (\theta) = \mathcal(E) \left[ \lVert{S_t}^{theta} (x) - {\nabla \log P_t (x) }\rVert_2^2 \right]

Denoising Score Matching Loss

\mathcal{L}_{CSM} (\theta) = \mathcal(E) \left[ \lVert{S_t}^{theta} (x) - {\nabla \log P_t (x \mid z) }\rVert_2^2 \right]

The proof is similar to the Flow Matching lost and Flow Matching Conditional loss shown above.

Denoising Score Matching For Gaussian Probability Path

\nabla \log P_t (x \mid z) = \frac{x - {\alpha}_t z }{ {{\beta}_t}^2}

Next we look at the Gaussian noise. If the Gaussian noise is scaled up by {\beta}_t \epsilon
it is going to have the distribution shown below.

\epsilon \sim \mathcal{N}(0,{I}_d) \Rightarrow x = {\alpha}_t z + \underbrace{{\beta}_t \epsilon}_\text{Scale it by}

We get the distribution {\alpha}_t z + \underbrace{{\beta}_t \epsilon}_\text{Scale it by} \sim \mathcal{N}({\alpha}_t z, {{\beta}_t}^2 {I}_d)

It is again similar to the Flow Matching formulas.

\mathcal{L}_{DSM} (\theta) = \mathrm{E}_{t \sim Unif}\atop{z \sim P_{Data}}, { x \sim p(\bullet \mid z)} \left[ \lVert{S_t}^{\theta} (x) + \frac{x - {\alpha}_t z }{ {{\beta}_t}^2}\rVert_2^2 \right]

\mathcal{L}_{DSM} (\theta) = \mathrm{E}_{t \sim Unif}\atop{z \sim P_{Data}}, { \epsilon \sim \mathcal{N}(0,{I}_d)} \left[ \lVert{S_t}^{\theta} ({\alpha}_t z + {\beta}_t \epsilon) + \frac{\epsilon}{ {\beta}_t }\rVert_2^2 \right]

The instructor at this stage mentioned that the above formula predicts the noise that was injected and so it
is termed ‘Denoising’. I need to understand this better.

Conditional and marginal Score function

Conditional Score

\underbrace{ {\nabla  {\log  P_t( X \mid Z )}}}_\text{Gradient of log likelihood}

Derivation

{\nabla  {\log  P_t( X  )}} = \underbrace{ \nabla \left( {\frac{P_t( X )}{P_t( X )} } \right) } _\text{How is chain rule applied here ? }

Substituting the formula show previously we get this after moving the gradient inside the integral.

= \nabla {\frac {{\int { {P_t( X \mid Z )} {P_{data}(  Z )}}}} {{P_t (X)} }} dz  = {\frac {{\int {\nabla {P_t( X \mid Z )} {P_{data}(  Z )}}}} {{P_t (X)} }} dz

Using the result {\frac {d}{dx} {\log x}} = {\frac {1} {x} }

= {\int {\nabla  {\log  P_t( X )}} {\frac {{P_t( X \mid Z )} {P_{data}(  Z )}} {{P_t (X)} }}} dz

What is the score of the Conditional Gaussian Vector Field ?

{u_t}^{target} (X \mid Z) = \left( \dot{\alpha_t} - {\frac{\dot{\beta_t}} {\beta_t}} \alpha_t \right) Z + {\frac{\dot{\beta_t}} {\beta_t}} X

\nabla {\log  {P_t( X \mid Z )}}  = X - {\frac{\alpha_t(  Z )}  {\beta_t^2} }

Theorem : SDE extension trick

\text{ Let  } {u_t}^{target} (X) \text{  be as before } Then for any diffusion co-efficient {\alpha_t} \geq 0

{X_0 \sim {P_{init}}} , d{X_t} = \left[  {{u_t}^{target} (X_t)} + {\frac {{\alpha_t}^2}{2}} \nabla {\log  {P_t( X_t )}} \right] dt + {\alpha_t} d{W_t} \Rightarrow {X_t \sim P_t}, {0 \leq t \leq 1}

{\sigma_t} is the noise injected.

Lecture 4

\textbf{Recall : } \text{ So far we have focussed on } \textbf{ unconditionsl } generation

\textbf{Problem : } \text{ Sample from } \mathrm{P_{Data}}

\textbf{Train : } \text{ (e.g) Use the conditional Flow Matching Objective } \mathrm{L_{CFM}(\theta)} = \mathrm{E}_{\square} \lVert{U_t}^{\theta} (x) - {U_t}^{target} (x \mid z )\rVert_2^2

\square = {t \sim Unif}\atop{z \sim P_{Data}}, { x \sim p(x \mid z)}

\textbf{Simulate  } \text{ the corresponding } \textbf{ ODE } \text{ or } \textbf{ SDE}

\mathrm{d{X_t}} = {U_t}^{\theta}(\mathrm{X_t}) \mathrm{dT}, X_0 = P_{init}

A Guided CFM objective

\textbf{ Observation :} \text{ For fixed y, the problem is unguided and may require an unguided objective }

\mathrm{L}_{\text{Guided CFM}}({\theta} ; y ) = \mathrm{E}_{\square} \lVert{U_t}^{\theta} (x) - {U_t}^{target} (x \mid z )\rVert_2^2

\square = {t \sim Unif}\atop{z \sim P_{Data}( x \mid y )}, { x \sim {P_t}(x \mid z)}

\textbf{ Observation :} \text{ By varying y, the above yields a guided objective } \textbf{ for general y }


\mathrm{L}_{\text{Guided CFM}}({\theta}  ) = \mathrm{E}_{\square} \lVert{U_t}^{\theta} (x) - {U_t}^{target} (x \mid z )\rVert_2^2

\square = {t \sim Unif}\atop{\underbrace{(z,y)}_\text{(e.g) Image/label} \sim P_{Data}( z, y )}, { x \sim {P_t}(x \mid z)}

Classifier-free Guidance

\textbf{ Recall :} \text{ A Gaussian Conditional Probability Path }

{P_t}(x \mid z) = \mathcal{N}({\alpha}_t z ,{{{\beta}_t}^2}{I}_d) where {\alpha}_t , {\beta}_t \text{ are continuosly differentiable, monotonic functions satisfying } {\alpha}_1 = {\beta}_0 = 1 \text{ and } {\alpha}_0 = {\beta}_1 = 0

{u_t}^{target} ( x \mid z ) = a_t x + b_t {\nabla \log P_t }(x \mid y ) +  (a_t, b_t) = \left ( \frac{{\dot{\alpha_t}}} {\alpha_t}, \frac{{\dot{\alpha_t}} {{\beta_t}^2} - {\dot{\beta_t}}{\beta_t}{\alpha_t}}{{\alpha_t}} \right)

Bayes’ Rule

\tilde{{U_t}} (x \mid y ) = {U_t}^{target} (x) + w b_t + \nabla \log P_t (y \mid x )

= {U_t}^{target} (x) + w b_t + ( \nabla \log P_t (x \mid y ) - \nabla \log P_t(x)

= {U_t}^{target} (x) + w a_t x - w a_T x + w b_t + ( \nabla \log P_t (x \mid y ) - \nabla \log P_t(x) )

= {U_t}^{target} (x) - w \underbrace{ ( a_t x + a_t {\nabla \log P_t (x) }) }_{{U_t}^{target} (x)} + w \underbrace{( a_t x +  \nabla \log P_t (x \mid y ) )}_{{U_t}^{target} (x \mid y)}

= ( 1 - w ){U_t}^{target} (x) + w {U_t}^{target} (x \mid y)

This procedure is for Classifier-free Guidance.

Classifier-free guidance sampling

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Classifiere-free guidance sampling Procedure }\end{minipage}}

\textbf{ Require : } \text{ A trained guided Vector field } {U_t}^{\theta} (x \mid y)

\textbf{ Select a prompt } y \in \mathrm{y} \text{ or take } y = \Phi \text{ for unguided sampling }

\textbf{ Select a guidance scale } w > 1

\textbf{ Select } X_0 \sim P_{init}

\textbf{ Simulate } dX_t \left[ ( 1- w ) {U_t}^{\theta} (X_t \mid \Phi ) + w {U_t}^{\theta} (X_t \mid y ) \right] dt \text{ from } t = 0 to t = 1

Lecture 4

It starts with this. What is a Score function ?

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Marginal probability path }\end{minipage}} \qquad P_t (x) \qquad \text{Conditional probability path }  \qquad P_t(x \mid z)

\fbox{\begin{minipage}[t][1cm]{0,35\textwidth}\centering \small \textbf{Marginal Score }\end{minipage}} \qquad \nabla \log P_t(x \mid z) \qquad \text{Conditional Score }  \qquad \nabla \log P_t(x)

Proof

\nabla \log P_t(x)= \frac{{\nabla}_x P_t(x)}{ P_t(x)}

We use this rule. \frac{\partial}{ {\partial} x)} \log x = \frac{1}{x}

{\nabla}_x \log x P_t(x) = \frac{{\nabla}_x P_t(x)}{P_t(x)}

Gradient log of the marginal is the Gradient of the function divided by the function itself.

= \frac{1}{P_t(x)}{\nabla}_x \int \underbrace{P_t(x \mid z) P_{Data} (z)}_\text{Density of the marginal} dz

Swap the derivative and integral. Which is the derivative ?

Apply the rule( shown in the first line) in reverse.

= \int {\nabla}_x \log P_t(x \mid z) \frac{P_t(x \mid z) P_{Data} (z)}{P_t(x)} dz

At this stage we are asked to read the notes. But the gist is that these two Gaussian examples

Reparameterization = \textbf{ Velocity Field } \rightarrow \textbf{ Score Function }

This proof is in the notes. It seems that early Diffusion models learnt the Score function and transformed it into a Vector field.

Score Matching

\textbf{SM loss } \mathrm{L}_{\text{SM loss}}(\theta)  = \mathrm{E}_{t,f,x} \lVert{S_t}^{\theta} (x) - \nabla \log P_t(x)\rVert_2^2

\textbf{Denoising SM loss } \mathrm{L}_{\text{Denosing SM loss}}(\theta)  = \mathrm{E}_{t,f,x} \lVert{S_t}^{\theta} (x) - \nabla \log P_t(x \mid z)\rVert_2^2

Key Points

Learning the marginal vector field and learning the Score function are equivalent for Gaussian Probability Paths.

Denoising score matching is a simple way of learning Marginal Score functions by approximating Conditional Score Functions.

Sampling with score models is achieved by adding the desired amount of noise and applying correction to the vector field.

Leave a comment