Diffusion Models in hi-fi Image Synthesis

Bijula Ratheesh
6 min readMar 15, 2022

Diffusion model was developed by Ratcliff in the year 1978, and was used for reaction time analysis in the field of cognitive psychology. Diffusion model gives the detailed explanations of behavior in a two choice decision making process. The model translates behavioral data — accuracy, mean response times, and response time distributions — into components of cognitive processing. It assumes that decisions are made by a noisy process that accumulates information over time from a starting point toward one of two response criteria or boundaries. The rate of accumulation of information is called the drift rate (v), and it is determined by the quality of the information extracted from the stimulus.

For recognition memory, for example, drift rate would represent the quality of the match between a test word and memory. A word presented for study three times would have a higher degree of match (i.e., a higher drift rate) than a word presented once. There are several experiments conducted in this area and can be read from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4534506/

The diffusion decision model. (Top panel) Three simulated paths with drift rate v, boundary separation a, and starting point z. (Middle panel) Fast and slow processes from each of two drift rates to illustrate how an equal size slowdown in drift rate (X) produces a small shift in the leading edge of the RT distribution (Y) and a larger shift in the tail (Z). (Bottom panel) Encoding time (u), decision time (d), and response output (w) time. The nondecision component is the sum of u and w with mean = Ter and with variability represented by a uniform distribution with range st. (ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2474742/)

Diffusion Model can be applied to many field but was limited earlier because it was computationally expensive. This model has gained traction in the recent times due to availability of advanced software and algorithms to simulate these models. Its application can be seen in econometrics of options pricing and also in image synthesis.

We will talk more about Image synthesis.

Diffusion probabilistic models

It starts with the paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics of 2015, which talks about gaining flexibility and tractability in complex data using an iterative forward diffusion process. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process and then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.

This method uses Markov chain to gradually convert one distribution into another, an idea used in non-equilibrium statistical physics (Jarzynski, 1997) and sequential Monte Carlo (Neal, 2001). Learning in this framework involves estimating small perturbations to a diffusion process. Estimating small perturbations is more tractable than explicitly describing the full distribution with a single, non-analytically-normalizable, potential function.

The goal is to define a forward (or inference) diffusion process which converts any complex data distribution into a simple, tractable, distribution, and then learn a finite-time reversal of this diffusion process. Basically convert the data into pure noise and reverse this process by de-noising the data and use the synthesis as an optimization algorithm that that follows the gradient of the data density to produce likely samples.

Forward Diffusion Process

We label the data distribution q(X0). The data distribution is gradually converted into a well behaved (analytically tractable) distribution π(y) by repeated application of a Markov diffusion kernal Tπ (y|y ‘; β) for π (y), where β is the diffusion rate

This is also the forward trajectory corresponding to starting at the data distribution and performing T steps of diffusion

Reverse trajectory

The generative distribution will be trained to describe the same trajectory, but in reverse,

For both Gaussian and binomial diffusion, the reversal of the diffusion process has the identical functional form as the forward process.

During learning only the mean and covariance for a Gaussian diffusion kernel, or the bit flip probability for a binomial kernel, need be estimated. For all results in this paper, multi-layer perceptrons are used to define these functions. A wide range of regression or function fitting techniques would be applicable however, including non-parameteric methods.

Model Probability

The probability the generative model assigns to the data is

Naively this integral is intractable — but taking a cue from annealed importance sampling and the Jarzynski equality, we instead evaluate the relative probability of the forward and reverse trajectories, averaged over forward trajectories,

Training

Training amounts to maximizing the model log likelihood

And by using log likelihood bound in variational Bayesian methods, we can calculate the bound here and arrive at,

If the forward and reverse trajectories are identical, corresponding to a quasi-static process, then the inequality becomes an equality.

Training consists of finding the reverse Markov transitions which maximize this lower bound on the log likelihood

Thus, the task of estimating a probability distribution has been reduced to the task of performing regression on the functions which set the mean and covariance of a sequence of Gaussians (or set the state flip probability for a sequence of Bernoulli trials).

The proposed modeling framework trained on 2-d swiss roll data. The top row shows time slices from the forward trajectory q x (0···T )  . The data distribution (left) undergoes Gaussian diffusion, which gradually transforms it into an identity-covariance Gaussian (right). The middle row shows the corresponding time slices from the trained reverse trajectory p x (0···T )  . An identity-covariance Gaussian (right) undergoes a Gaussian diffusion process with learned mean and covariance functions, and is gradually transformed back into the data distribution (left). The bottom row shows the drift term, fµ x (t) , t − x (t) , for the same reverse diffusion process.

In all the experiments here, U- nets were used and the architecture is as below:

Network architecture for mean function fµ x (t) , t and covariance function fΣ x (t) , t , for experiments in Section 3.2. The input image x (t) passes through several layers of multiscale convolution (Section D.2.1). It then passes through several convolutional layers with 1 × 1 kernels. This is equivalent to a dense transformation performed on each pixel. A linear transformation generates coefficients for readout of both mean µ (t) and covariance Σ (t) for each pixel. Finally, a time dependent readout function converts those coefficients into mean and covariance images, as described in Section D.2.1. For CIFAR-10 a dense (or fully connected) pathway was used in parallel to the multi-scale convolutional pathway. For MNIST, the dense pathway was used to the exclusion of the multi-scale convolutional pathway.

Architectural improvements

The current best architectures for image diffusion models are U-Nets (Ronneberger et al., 2015; Salimans et al., 2017), which are a natural choice to map corrupted data xt to reverse process parameters (µθ , Σθ) that have the same spatial dimensions as xt . Scalar conditioning, such as a class label or a diffusion timestep t, is provided by adding embeddings into intermediate layers of the network (Ho et al., 2020). Lower resolution image conditioning is provided by channel-wise concatenation of the low resolution image, processed by bilinear or bicubic up-sampling to the desired resolution, with the reverse process input xt (Saharia et al., 2021; Nichol and Dhariwal, 2021).

The U-Net model uses a stack of residual layers and downsampling convolutions, followed by a stack of residual layers with upsampling convolutions, with skip connections connecting the layers with the same spatial size. In addition, they use a global attention layer at the 16×16 resolution with a single head, and add a projection of the timestep embedding into each residual block. In the paper Nichol and Dhariwal, 2021shows the same result on ImageNet 128×128, finding that architecture can indeed give a substantial boost to sample quality on much larger and more diverse datasets at a higher resolution.

Increasing depth versus width, holding model size relatively constant.

  • Increasing the number of attention heads.
  • Using attention at 32×32, 16×16, and 8×8 resolutions rather than only at 16×16.
  • Using the BigGAN [5] residual block for upsampling and downsampling the activations, following [60].
  • Rescaling residual connections with √ 1 2 , following [60, 27, 28]
Training compute requirements for our diffusion models compared to StyleGAN2 and BigGAN-deep. Training iterations for each diffusion model are mentioned in parenthesis. Compute is measured in V100-days. † ImageNet 256×256 classifier with 150K iterations (instead of 500K). ‡ ImageNet 64×64 classifier with batch size 256 (instead of 1024). *ImageNet 128×128 classifier with batch size 256 (instead of 1024).

References:

https://arxiv.org/abs/2105.05233

--

--