Diffusion model
Diffusion means that it is like a drop of ink dropped into a glass of clear water, which slowly spreads out and eventually becomes turbid. If this process is reversible, then we can create a method to explore the initial state of the ink drop from a turbid state.
Diffusion Models It is divided into two parts:
- The forward process (Forward Diffusion Process) adds noise to the image, just like ink drops gradually spreading out. This process is used in the training phase;
- The reverse process (Reverse Diffusion Process) removes noise from the image, just like a piece of turbid water gradually reverses, and time flows back to the state of a drop of ink. This overshoot is used in the generation stage.
In the forward process, noise is gradually added to the image until the image becomes a complete Gaussian noise image. Then in the reverse process, the Gaussian noise is gradually restored to a certain image, and the reverse process is the process of generating an image, and image generation relies on continuous noise removal. First, given a full Gaussian noise image, the noise estimated by the trained U-Net network is gradually denoised until the image is finally reproduced.
Stable Diffusion
The difference between the Diffusion model and other generative models is that it is not a one-step process from image to latent variable and then from latent variable to image. It is a process of gradual decomposition and denoising step by step. This also leads to the disadvantage of Diffusion, that is, the full-size image needs to be input into the U-Net network during the reverse diffusion process. This makes Diffusion run very slowly when the image size and random time step are large enough, and the system computing power consumption is huge. So in order to solve this problem, Stable Diffusion came into being.
Stable Diffusion itself is not a model, but a system architecture composed of multiple modules and models. It consists of three core components, each of which is a neural network system, also known as the three basic models:
1. CLIPText is used for text encoding and digitization of text:
- Input: input text (prompt);
- Output: 77 token embeddings vectors, each token vector has 768 dimensions;
2. Image information creator The image information creator is used to gradually process/diffuse the information transformed into the latent space:
- Input: text embedding and a starting multidimensional matrix consisting of noise points (a structured list of numbers, also called a tensor);
- Output: processed information matrix;
3. AutoEncoder Decoder (mainly a VAE: Variational AutoEncoder) uses the processed information matrix to decode and draw the final image, decoding the operation results of the latent space into the actual image dimensions:
- Input: processed information matrix, dimension: 4, 64, 64;
- Output: The generated image has dimensions of 3, 512, 512, i.e., three RGB channels and two-dimensional pixel size.
CLIPText text encoder
CLIPText is a text encoder, which is the dark blue module in the previous figure. It is a special Transformer natural language model that generates a token embedding matrix from the input text prompt word. Embedding refers to the process of mapping high-dimensional data (which can be text, pictures, sounds, etc.) to a low-dimensional space, and the result can also be called embeddings.
CLIP, the full name is Contrastive Language-Image Pre-Training, which is translated into Chinese as: pre-training through language and image comparison, which can be referred to as image-text matching model. That is, through natural language understanding and computer vision analysis, and finally the one-to-one correspondence between language and image is compared and trained, and then a pre-trained model is generated so that it can be used for future tasks of generating images involving text.
CLIP itself is also a neural network. It matches and trains the semantic features extracted from text by Text Encoder and the image features extracted from images by Image Encoder. That is, it continuously adjusts the internal parameters of the two models so that the text feature values and image feature values output by the models can allow the corresponding “text-image” to be confirmed to match after simple verification.
Image information creator Image information creator
This module is the core weapon of the Stable Diffusion stable diffusion architecture, and it is where it can achieve more performance improvements than previous Diffusion versions. This component repeatedly runs multiple steps (Steps) to generate image information. The Steps value usually defaults to 50 or 100.
The Image Information Creator works entirely in “latent space,” which makes it 64 times more efficient than previous approaches that worked in pixel space. Technically, this component consists of a U-Net neural network and a scheduling algorithm.
The Image Information Creator module consists of at least 50 iteratively connected U-Net modules. The trained noise predictor U-Net can predict the noise and the noise amount. Then the U-Net is able to remove the noise to generate the result of each iteration, that is, an image with less noise, until a perfect image without noise is generated.
Image Decoder
The image decoder is actually an AutoEncoder Decoder (mainly a VAE: Variational AutoEncoder), which draws the image based on the information passed from the image information creator. It is only run once after the previous Diffusion process is completely completed, that is, it decodes the image information in the latent space to generate the final pixel image.