Project 5: Diffusion Models

Part 0. Setup

For this part, we set up the diffusion model with a random seed (185). This seed will be used throughout the project. In part A, we work with images of size 64x64 pixels, which might not be as sharp. In part B, the images improve significantly. Below, we show results with different inference steps.

20 Denoising Steps
20 Denoising Steps
50 Denoising Steps
50 Denoising Steps

Part 1. Implementing the Forward Process

Adding noise progressively to an image is key in the diffusion process. This allows the model to learn how to reverse noise, recovering the clean image. Below are examples of the test image at various noise levels:

Image with 250 timesteps
250 timesteps
Image with 500 timesteps
500 timesteps
Image with 750 timesteps
750 timesteps

Part 2. Denoising Using Classical Gaussian Blurring

Gaussian blurring was used to denoise the images. This method is not effective, as seen below:

Blurred image at 250 timesteps
250 timesteps
Blurred image at 500 timesteps
500 timesteps
Blurred image at 750 timesteps
750 timesteps

Part 3. One-Step Denoising Using the Diffusion Model

A U-Net estimates and removes Gaussian noise. The model considers the timestep as an input parameter, making it easier to determine the amount of noise to remove:

Denoised image at 250 timesteps
250 timesteps
Denoised image at 500 timesteps
500 timesteps
Denoised image at 750 timesteps
750 timesteps

Part 4. Iterative Denoising Using the Diffusion Model

Iterative denoising works better than classical methods. To speed up the process, denoising is performed every 30 timesteps. Below is the progression of denoising and the final comparison to classical methods:

Denoising begins
Beginning of denoising
Midway through denoising
Middle of denoising
Denoising completed
End of denoising
Final comparison to classical methods
Final comparison with classical denoising

Part 5. Diffusion model sampling

In this part, we used the trained diffusion model to generate five random images. These images were sampled starting from pure noise and progressively denoised to create realistic outputs. This demonstrates the model's ability to synthesize diverse images based on the learned data distribution.

Five Randomly Generated Images
Five randomly generated images using the diffusion model

Part 6. Classifier-Free Guidance (CFG)

To improve the quality of generated images, we implemented Classifier-Free Guidance (CFG). This technique combines a conditional noise estimate based on a text prompt with an unconditional noise estimate, weighted by a guidance parameter γ. The new noise estimate is computed as:

ϵ = ϵu + γ(ϵc − ϵu)

By setting γ > 1, we achieve significantly higher-quality images at the expense of diversity. For γ = 0, the model generates unconditional results, and for γ = 1, it creates conditional results. The images below demonstrate the improvement in quality using CFG with γ = 7.

Images generated using CFG with γ=7
Images generated using Classifier-Free Guidance (γ=7)

Part 7. Image-to-Image Translation

In this part, we explored the SDEdit algorithm to make creative edits to an image. By adding noise to a real image and denoising it using the diffusion model, we force the noisy image back onto the manifold of natural images. The extent of the edit depends on the noise level—higher noise levels lead to larger edits, while lower noise levels retain more of the original structure.

Using the prompt "a high quality photo", we applied this technique to the original test image at noise levels [1, 3, 5, 7, 10, 20]. The results demonstrate how the image transitions from noisy to more refined, with creative edits made by the model.

Results: Original Test Image

Edit at noise level 1
First noise levels
Edit at noise level 3
Last noise levels

Results: Custom Test Images: Muhammed Ali

We repeated the process on two additional test images using the same noise levels. Below are the results for one of the custom images.

Edit at noise level 7
Early noise levels
Edit at noise level 10
Last noise levels
Edit at noise level 20
Original image
Custom Test Image Edits
Early noise levels
Custom Test Image Edits
Last noise levels
Custom Test Image Edits
Original image image

Part 7.1 Image-to-Image Translation for Hand-Drawn Images

In this part, we apply the same image-to-image translation technique but use hand-drawn images as input. Below is one hand-drawn image from the web and two images drawn by me.

Results: Web Image

Original Web Image
Original Image
Edited Web Image
Edit at Different Noise Levels

Results: Custom Test Images - Hand-Drawn

Original Drawing of Snoopy
Original Drawing of Snoopy
Edited Drawing of Snoopy
Edit at Different Noise Levels
Original Hand-Drawn Image
Original Drawing
Edited Hand-Drawn Image
Edit at Different Noise Levels

Part 7.2 Inpainting

In this section, we explore inpainting by following a method inspired by the RePaint paper. Given an image (xorig) and a binary mask (m), we create a new image that preserves the original content where the mask is 0 and introduces new content where the mask is 1.

To achieve this, we use the diffusion denoising loop with a small modification. At each denoising step, after obtaining xt, we force it to retain the original pixels where m is 0, effectively leaving the masked areas intact. The process is defined as:

xt ← m * xt + (1 − m) * forward(xorig, t)

By iteratively applying this approach, the diffusion model fills in the masked area with new content while preserving the rest of the image. We applied this technique to inpaint the top of the Campanile using a custom mask.

Results: Inpainting the Campanile

Inpainting Mask for Campanile
Stages
Inpainted Image of the Campanile
Inpainted Image of the Campanile

Results: Custom Inpainting on Two Additional Images

We further experimented with inpainting on two custom images, using different masks to replace selected areas of each image.

Inpainting Mask for Custom Image 2
Original Image
Inpainting Mask for Custom Image 2
Inpainted Image
Inpainting Mask for Custom Image 2
Stages
Inpainted Custom Image 2
Inpainted Image

Part 7.3 Text-Conditioned Image-to-Image Translation

In this part, we perform image-to-image translation with guidance from a text prompt. By combining the model's inpainting ability with language control, we can adjust the output to match a specific description. This goes beyond simple "projection to the natural image manifold" by incorporating semantic information from the prompt.

To achieve this, we modify the prompt from "a high quality photo" to a descriptive phrase, applying different noise levels [1, 3, 5, 7, 10, 20]. The resulting images should maintain characteristics of the original image while aligning with the text prompt.

Results: Text-Conditioned Edits of the Test Image " a rocket ship"

Text-Conditioned Edit at Noise Level 1
Low Noise Levels
Text-Conditioned Edit at Noise Level 3
High Noise Levels"

Results: Text-Conditioned Edits on Custom Images

We applied the same process to two custom images using the prompt "a photo of a hipster barrista" and "a pencil" and varying noise levels. The outputs show a blend of the original image (a banana) and the characteristics described in the prompt.

Text-Conditioned Edit on Custom Image 1
Pencil Banana High noise
Text-Conditioned Edit on Custom Image 2
Pencil Banana Low noise
Text-Conditioned Edit on Custom Image 1
Barista Banana High noise
Text-Conditioned Edit on Custom Image 2
Barista Banana Low noise

Part 8 Visual Anagrams

In this part, we use diffusion models to create visual anagrams—images that show two different scenes when viewed upright and flipped upside down. Our goal is to create an image that appears as "an oil painting of an old man" when viewed normally, and as "an oil painting of people around a campfire" when flipped.

To achieve this effect, we perform diffusion denoising on an image xt at a particular timestep t, with two different prompts. First, we denoise the image using the prompt "an oil painting of an old man", obtaining a noise estimate ϵ1. Next, we flip xt upside down and denoise it with the prompt "an oil painting of people around a campfire", resulting in ϵ2. We then flip ϵ2 back to its original orientation and average it with ϵ1 to create the final noise estimate ϵ.

Algorithm Steps

The steps for creating a visual anagram are as follows:

  1. Generate ϵ1 = UNet(xt, t, p1) using the prompt "an oil painting of an old man".
  2. Flip xt upside down and generate ϵ2 = flip(UNet(flip(xt), t, p2)) using the prompt "an oil painting of people around a campfire".
  3. Average the two noise estimates: ϵ = (ϵ1 + ϵ2) / 2.
  4. Perform a reverse diffusion step with the averaged noise estimate ϵ to obtain the final image.

Results: Visual Anagrams

Below are examples of visual anagrams created using this method. The first image appears as "an oil painting of an old man" when viewed normally and as "an oil painting of people around a campfire" when flipped. We also include two additional anagram illusions that reveal different images when flipped upside down.

Visual Anagram - Upright
"an oil painting of an old man" and "an oil painting of people around a campfire"
Visual Anagram - Flipped
"an oil painting of a snowy mountain village" and "an oil painting of people around a campfire"
Visual Anagram - Flipped
"an oil painting of a snowy mountain village" and "an oil painting of an old man"

Part 1.10 Hybrid Images

In this part, we implement Factorized Diffusion to create hybrid images, inspired by techniques from project 2. A hybrid image combines different elements that appear as one thing when viewed from a distance and something else up close. Here, we use a diffusion model to blend low and high frequencies from two different text prompts, creating a composite effect.

The process involves estimating noise using two prompts, combining the low frequencies from one estimate and the high frequencies from the other. This approach allows us to create an image that looks like one subject from afar and transforms into a different subject upon closer inspection.

Algorithm Steps

The steps for generating a hybrid image are as follows:

  1. Generate the first noise estimate, ϵ1 = UNet(xt, t, p1), with the first prompt.
  2. Generate the second noise estimate, ϵ2 = UNet(xt, t, p2), with the second prompt.
  3. Apply a low-pass filter to ϵ1 and a high-pass filter to ϵ2:
    ϵ = flowpass(ϵ1) + fhighpass(ϵ2).
  4. Use the composite noise estimate ϵ to complete a reverse diffusion step and obtain the final hybrid image.

For this part, we use a Gaussian blur with a kernel size of 33 and sigma of 2 for the low-pass filter, which smooths out the details in the first noise estimate, while the high-pass filter captures the finer details from the second estimate.

Results: Hybrid Images

Below are examples of hybrid images created with this technique. The first image appears as a skull from afar and as a waterfall up close. We also include two additional hybrid illusions that transform depending on viewing distance.

Hybrid Image - Skull / Waterfall
Hybrid Image: Skull from afar, Waterfall up close
Hybrid Image - Custom 1
Hybrid Image: A photo of a dog and a man in a hat
Hybrid Image - Custom 2
Hybrid Image: A man with a hat and a waterfall

Final remarks for part A

I thought this first part of the project was really cool. I think it is very intersting how the images are generated and can't wait for part B which I assume will be even cooler.

Part B

Warm-Up: Building a One-Step Denoiser

To start, we train a simple denoiser that maps a noisy image xnoisy to a clean image xclean. The L2 loss is optimized as:

ℒ = ||Dθ(xnoisy) - xclean||²

1.1 Implementing the UNet

We use a UNet for the denoiser. It consists of downsampling and upsampling blocks with skip connections, capturing both global and local features. This structure is ideal for image-to-image tasks like denoising.

UNet Architecture
UNet Architecture for Denoising

1.2 Denoising Problem and Data Generation

In this part, we implement and visualize the noising process for generating training data. The objective is to train a denoiser to map noisy images xnoisy to clean images xclean, minimizing the L2 loss:

ℒ = ||Dθ(xnoisy) - xclean||²

To achieve this, clean MNIST digits xclean are progressively noised to create training pairs (xnoisy, xclean).

Results: Visualization of the Noising Process

Below is the output of the implemented noising process applied to a normalized MNIST digit. The images show the effect of increasing noise levels:

Noising Process Visualization
Progressive noising of a normalized MNIST digit

For training we used the image class pairs. I used a batchsize of 256, 5 epochs, sigma = 0.5 and the Adam Optimizer with a learning rate of 1e-4. Below you can find the loss during training.

Noising Process Visualization
Loss per epoch during training
Noising Process Visualization
Loss per iteration during training

The results from the training can be seen below, where I show the performance after 1 epoch and after 5 epochs. As you can see the model becomes much better after 5 epochs.

Noising Process Visualization
Results after 1 epoch of training
Noising Process Visualization
Results after the 5th and final epoch of training

The model was as stated trained using a sigma value of 0.5. But it is intersting to see how well it can perform with higher versus lower values for sigma i.e more and less noise than during training. In the image below you can see that performance.

Noising Process Visualization
Results for different values of sigma when trained using 0.5

Part 2. Time and Class Conditioning

Previously, our UNet model predicted the clean image. In this section, we update the approach to predict the noise ϵ added to the image instead. This allows us to start with pure noise ϵ ∼ N(0, I) and iteratively denoise to generate a realistic image x.

Combining Time and Class Conditioning

Instead of implementing time conditioning first and then adding class conditioning, we simultaneously condition the UNet on both the timestep t and the class of the digit. Using the equation:

xt = √ᾱtx0 + √(1−ᾱt)ϵ, ϵ∼N(0,1)

we generate a noisy image xt from x0 for a timestep t∈{0,1,…,T}. When t=0, xt is clean, and when t=T, xt is pure noise. For intermediate values of t, xt is a linear combination of the clean image and noise. The derivations for β, αt, and ᾱt follow the DDPM paper, and we set T=400 due to the simplicity of our dataset.

To handle time conditioning, we integrate fully connected layers to embed t into the UNet. For class conditioning, we use a one-hot vector to represent each digit class ({0,…,9}). We further add two fully connected layers to embed the class vector.

Updated UNet Architecture

Below is the updated UNet architecture, which includes both time and class conditioning as well as the new training algorithm used:

Updated UNet Architecture
UNet Architecture with Time and Class Conditioning
Updated UNet Architecture
Training algorithm with Time and Class Conditioning

Training Algorithm

The training process involves generating noisy images xt for random timesteps, computing their one-hot class vectors, and training the UNet to predict ϵ. The addition of class conditioning ensures better control over the generated images, while time conditioning enables iterative denoising. As we can see in the results presented below after the twentieth epoch the model works very well and produces fine, correct and detailed numbers.

Results from Time and Class Conditional UNet wtih Gudienscale: 5.0

Updated UNet Architecture
Training loss per iteration
Updated UNet Architecture
Results after one epoch of training
Updated UNet Architecture
Results after five epochs of training
Updated UNet Architecture
Results after the final and 20th epoch of training

Comparing with the only TimeConditioned UNet

If we now only look at the Time Conditioned UNet, i.e no classes. The results are presented below. As we can see the numbers are not as good without the classifier free guidance.

Results from TimeCOnditioned UNet, no classes

Updated UNet Architecture
Training loss per iteration
Updated UNet Architecture
Results after one epoch of training
Updated UNet Architecture
Results after five epochs of training
Updated UNet Architecture
Results after the final and 20th epoch of training

Final remarks

Very cool and fun project. The most interting part was definetely to see how adding classes (CFG) increased the performance of our model and how easy teh change was to do.