Playground for Stable Diffusion

Loges Siva
5 min readOct 16, 2022

--

An Easy Guide to using Stable Diffusion model for image and video generation

Hi, let’s go over a short and informative read on stable diffusion and usage. This post is for content creators and developers who are interested in computer vision and deep learning, who require to create assets of images for their work, or who are interested in applying diffusion model to build their applications. This post will not go into the math, architecture and research behind Stable Diffusion. If you’re interested to learn the details, the links are attached in the references section.

Introduction 📜

What is the term “diffusion”?

From Wikipedia, “Diffusion is the net movement of anything (for example, atoms, ions, molecules, energy) generally from a region of higher concentration to a region of lower concentration.”

Similar to the definition, in forward diffusion pass, diffusion models gradually apply noise to an image till the image becomes a complete noise. This essentially diffuses the pixels in the image. In the backward diffusion pass, the noisy image is denoised across same steps till the data is recovered. Since it is a sequential process, there is less chance of mode collapse (a problem with GANs) to occur.

Latent Diffusion Model 🧙‍♂️

Latent diffusion model is built with architectures of GAN, diffusion model and transformer model. Most diffusion models use U-Net architecture to preserve the dimensionality of the image. Usually, diffusion models apply diffusion in pixel space, but stable diffusion models apply diffusion in latent space. Hence, the term “Latent diffusion model (LDM)”. The conversion between pixel space to latent space is done using transformer (Encoder and Decoder). This method is memory efficient compared to previous methods, and also produces highly detailed image.

Usage 🏃‍♂️

We will use the Github repository Stable Diffusion Playground to realize the different modes of application of stable diffusion model.

The repository supports 5 modes at the time of writing this article,
1. Text to Image
2. Image to Image
3. Inpaint
4. Dream
5. Animate

Let’s go over each of the modes and it’s usage in detail.

Code Requirements

Follow the readme code requirements section in the repository and setup the environment. This is essential part for the forth-coming sections in this article.

Hugging Face Access Token

Stable Diffusion Playground codebase uses hugging face to download the models and use their APIs. To use hugging face hub, user must create an access token to authenticate their identity. To do so, create an account in huggingface.co. Then, go to Settings -> Access Tokens. Create an access token with read permission.

The created access token will be used by the modes of the application. When prompted for access token, copy it from Settings -> Access Tokens and use it. Please do not share the access token publicly.

Text to Image Mode

Text to Image result

Given an input prompt, this mode will generate an image based on prompt description. Run the below command to start application for Text to Image mode,

python run.py --mode txt2img --device gpu --save

Follow the command line interface to provide Hugging face user access token, input prompt and resolution of the image to be generated.

Image to Image Mode

Image to Image result. Left is original input image. Right is generated image based on prompt and input image.

Given an input prompt and an image, this mode will modify input image based on prompt description. Run the below command to start application for Image to Image mode,

python run.py --mode img2img --device gpu --save

Follow the command line interface to provide Hugging face user access token, input prompt, input image and strength. ‘strength’ will accept values in range [0, 1], where 0 represents no change from initial input image, 1 represent complete change from initial input image based on input prompt.

Inpaint Mode

Inpaint result. Left is initial input image. Center is mask image. Right is generated image based on prompt, initial image and mask image.

Given an input prompt and an initial input image and a mask image, this mode will modify input image based on prompt description only on the area specified by the mask image. Run the below command to start application for Inpaint mode,

python run.py --mode inpaint --device gpu --save

Follow the command line interface to provide Hugging face user access token, input prompt, input image, mask image and strength. ‘strength’ will accept values in range [0, 1], where 0 represents no change from initial input image, 1 represent complete change from initial input image based on input prompt.

Dream Mode

Given an input prompt, this mode will generate a video based on prompt description. Run the below command to start application for Dream mode,

python run.py --mode dream --device gpu --save --num <number of frames>

<number of frames> represents the number of frames required in the generated video.

Follow the command line interface to provide Hugging face user access token, input prompt and resolution of the video to be generated.

This mode works by spherically interpolating the latents across fixed number of steps and providing the latent as initial input latent to the model. This produces small changes in generated frames preserving coherence between the frames. Hence, video feels like a dream like effect imagined by the model.

Animate mode

This mode is different from the other modes by it’s usage and architecture. Animate mode supports video generation in 2D and 3D based on input prompt. This mode also support Video as input, and converts the video to a style based on the input prompt. Run the below command to start application for Animate mode,

python run.py --mode animate --device gpu --save

Follow the command line interface to provide Hugging face user access token. The prompt and configurations of the mode are required to be set in animation_mode/config.py file. Go through the Readme for better understanding of the configurations and their usage.

Demonstration video

Conclusion

Latent diffusion models are a step forward in image generation at generating high resolution images with extreme details while also preserving the semantic structure of images.

Stable Diffusion Playground application is developed as a result of the breakthrough researches and development in diffusion model domain. Thanks to the amazing creators and developers for open-sourcing the paper, project and models for everyone to experiment, use for variety of applications and improve on the previous works.

References

[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer, “High-Resolution Image Synthesis with Latent Diffusion Models”, arXiv:2112.10752, 2021

[2] Blattmann et. al., Latent Diffusion Models, https://github.com/CompVis/latent-diffusion, 2022

[3] Logeswaran Sivakumar, Stable Diffusion Playground, https://github.com/Logeswaran123/Stable-Diffusion-Playground, 2022

[4] Hugging Face Stable Diffusion, https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines/stable_diffusion, 2022

--

--