Diffusion Models in hi-fi Image Synthesis
Diffusion model was developed by Ratcliff in the year 1978, and was used for reaction time analysis in the field of cognitive psychology. Diffusion model gives the detailed explanations of behavior in a two choice decision making process. The model translates behavioral data — accuracy, mean response times, and response time distributions — into components of cognitive processing. It assumes that decisions are made by a noisy process that accumulates information over time from a starting point toward one of two response criteria or boundaries. The rate of accumulation of information is called the drift rate (v), and it is determined by the quality of the information extracted from the stimulus.
For recognition memory, for example, drift rate would represent the quality of the match between a test word and memory. A word presented for study three times would have a higher degree of match (i.e., a higher drift rate) than a word presented once. There are several experiments conducted in this area and can be read from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4534506/
Diffusion Model can be applied to many field but was limited earlier because it was computationally expensive. This model has gained traction in the recent times due to availability of advanced software and algorithms to simulate these models. Its application can be seen in econometrics of options pricing and also in image synthesis.
We will talk more about Image synthesis.
Diffusion probabilistic models
It starts with the paper Deep Unsupervised Learning using Nonequilibrium Thermodynamics of 2015, which talks about gaining flexibility and tractability in complex data using an iterative forward diffusion process. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process and then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.
This method uses Markov chain to gradually convert one distribution into another, an idea used in non-equilibrium statistical physics (Jarzynski, 1997) and sequential Monte Carlo (Neal, 2001). Learning in this framework involves estimating small perturbations to a diffusion process. Estimating small perturbations is more tractable than explicitly describing the full distribution with a single, non-analytically-normalizable, potential function.
The goal is to define a forward (or inference) diffusion process which converts any complex data distribution into a simple, tractable, distribution, and then learn a finite-time reversal of this diffusion process. Basically convert the data into pure noise and reverse this process by de-noising the data and use the synthesis as an optimization algorithm that that follows the gradient of the data density to produce likely samples.
Forward Diffusion Process
We label the data distribution q(X0). The data distribution is gradually converted into a well behaved (analytically tractable) distribution π(y) by repeated application of a Markov diffusion kernal Tπ (y|y ‘; β) for π (y), where β is the diffusion rate
This is also the forward trajectory corresponding to starting at the data distribution and performing T steps of diffusion
Reverse trajectory
The generative distribution will be trained to describe the same trajectory, but in reverse,
For both Gaussian and binomial diffusion, the reversal of the diffusion process has the identical functional form as the forward process.
During learning only the mean and covariance for a Gaussian diffusion kernel, or the bit flip probability for a binomial kernel, need be estimated. For all results in this paper, multi-layer perceptrons are used to define these functions. A wide range of regression or function fitting techniques would be applicable however, including non-parameteric methods.
Model Probability
The probability the generative model assigns to the data is
Naively this integral is intractable — but taking a cue from annealed importance sampling and the Jarzynski equality, we instead evaluate the relative probability of the forward and reverse trajectories, averaged over forward trajectories,
Training
Training amounts to maximizing the model log likelihood
And by using log likelihood bound in variational Bayesian methods, we can calculate the bound here and arrive at,
If the forward and reverse trajectories are identical, corresponding to a quasi-static process, then the inequality becomes an equality.
Training consists of finding the reverse Markov transitions which maximize this lower bound on the log likelihood
Thus, the task of estimating a probability distribution has been reduced to the task of performing regression on the functions which set the mean and covariance of a sequence of Gaussians (or set the state flip probability for a sequence of Bernoulli trials).
In all the experiments here, U- nets were used and the architecture is as below:
Architectural improvements
The current best architectures for image diffusion models are U-Nets (Ronneberger et al., 2015; Salimans et al., 2017), which are a natural choice to map corrupted data xt to reverse process parameters (µθ , Σθ) that have the same spatial dimensions as xt . Scalar conditioning, such as a class label or a diffusion timestep t, is provided by adding embeddings into intermediate layers of the network (Ho et al., 2020). Lower resolution image conditioning is provided by channel-wise concatenation of the low resolution image, processed by bilinear or bicubic up-sampling to the desired resolution, with the reverse process input xt (Saharia et al., 2021; Nichol and Dhariwal, 2021).
The U-Net model uses a stack of residual layers and downsampling convolutions, followed by a stack of residual layers with upsampling convolutions, with skip connections connecting the layers with the same spatial size. In addition, they use a global attention layer at the 16×16 resolution with a single head, and add a projection of the timestep embedding into each residual block. In the paper Nichol and Dhariwal, 2021shows the same result on ImageNet 128×128, finding that architecture can indeed give a substantial boost to sample quality on much larger and more diverse datasets at a higher resolution.
Increasing depth versus width, holding model size relatively constant.
- Increasing the number of attention heads.
- Using attention at 32×32, 16×16, and 8×8 resolutions rather than only at 16×16.
- Using the BigGAN [5] residual block for upsampling and downsampling the activations, following [60].
- Rescaling residual connections with √ 1 2 , following [60, 27, 28]
References: