Project 5: Diffusion Models

For this part, we set up the diffusion model with a random seed (185). This seed will be used throughout the project. In part A, we work with images of size 64x64 pixels, which might not be as sharp. In part B, the images improve significantly. Below, we show results with different inference steps.

20 Denoising Steps

50 Denoising Steps

Part 1. Implementing the Forward Process

Adding noise progressively to an image is key in the diffusion process. This allows the model to learn how to reverse noise, recovering the clean image. Below are examples of the test image at various noise levels:

250 timesteps

500 timesteps

750 timesteps

Part 2. Denoising Using Classical Gaussian Blurring

Gaussian blurring was used to denoise the images. This method is not effective, as seen below:

250 timesteps

500 timesteps

750 timesteps

Part 3. One-Step Denoising Using the Diffusion Model

A U-Net estimates and removes Gaussian noise. The model considers the timestep as an input parameter, making it easier to determine the amount of noise to remove:

250 timesteps

500 timesteps

750 timesteps

Part 4. Iterative Denoising Using the Diffusion Model

Iterative denoising works better than classical methods. To speed up the process, denoising is performed every 30 timesteps. Below is the progression of denoising and the final comparison to classical methods:

Beginning of denoising

Middle of denoising

End of denoising

Part 5. Diffusion model sampling

In this part, we used the trained diffusion model to generate five random images. These images were sampled starting from pure noise and progressively denoised to create realistic outputs. This demonstrates the model's ability to synthesize diverse images based on the learned data distribution.

Part 6. Classifier-Free Guidance (CFG)

To improve the quality of generated images, we implemented Classifier-Free Guidance (CFG). This technique combines a conditional noise estimate based on a text prompt with an unconditional noise estimate, weighted by a guidance parameter γ. The new noise estimate is computed as:

By setting γ > 1, we achieve significantly higher-quality images at the expense of diversity. For γ = 0, the model generates unconditional results, and for γ = 1, it creates conditional results. The images below demonstrate the improvement in quality using CFG with γ = 7.

Part 7. Image-to-Image Translation

In this part, we explored the SDEdit algorithm to make creative edits to an image. By adding noise to a real image and denoising it using the diffusion model, we force the noisy image back onto the manifold of natural images. The extent of the edit depends on the noise level—higher noise levels lead to larger edits, while lower noise levels retain more of the original structure.

Using the prompt "a high quality photo", we applied this technique to the original test image at noise levels [1, 3, 5, 7, 10, 20]. The results demonstrate how the image transitions from noisy to more refined, with creative edits made by the model.

Results: Original Test Image

First noise levels

Last noise levels

Results: Custom Test Images: Muhammed Ali

We repeated the process on two additional test images using the same noise levels. Below are the results for one of the custom images.

Early noise levels

Last noise levels

Original image

Early noise levels

Last noise levels

Original image image

Part 7.1 Image-to-Image Translation for Hand-Drawn Images

In this part, we apply the same image-to-image translation technique but use hand-drawn images as input. Below is one hand-drawn image from the web and two images drawn by me.

Results: Web Image

Original Image

Edit at Different Noise Levels

Results: Custom Test Images - Hand-Drawn

Original Drawing of Snoopy

Edit at Different Noise Levels

Original Drawing

Edit at Different Noise Levels

Part 7.2 Inpainting

In this section, we explore inpainting by following a method inspired by the RePaint paper. Given an image (x_orig) and a binary mask (m), we create a new image that preserves the original content where the mask is 0 and introduces new content where the mask is 1.

To achieve this, we use the diffusion denoising loop with a small modification. At each denoising step, after obtaining x_t, we force it to retain the original pixels where m is 0, effectively leaving the masked areas intact. The process is defined as:

By iteratively applying this approach, the diffusion model fills in the masked area with new content while preserving the rest of the image. We applied this technique to inpaint the top of the Campanile using a custom mask.

Results: Inpainting the Campanile

Stages

Inpainted Image of the Campanile

Results: Custom Inpainting on Two Additional Images

We further experimented with inpainting on two custom images, using different masks to replace selected areas of each image.

Original Image

Inpainted Image

Stages

Inpainted Image

Part 7.3 Text-Conditioned Image-to-Image Translation

In this part, we perform image-to-image translation with guidance from a text prompt. By combining the model's inpainting ability with language control, we can adjust the output to match a specific description. This goes beyond simple "projection to the natural image manifold" by incorporating semantic information from the prompt.

To achieve this, we modify the prompt from "a high quality photo" to a descriptive phrase, applying different noise levels [1, 3, 5, 7, 10, 20]. The resulting images should maintain characteristics of the original image while aligning with the text prompt.

Results: Text-Conditioned Edits of the Test Image " a rocket ship"

Low Noise Levels

High Noise Levels"

Results: Text-Conditioned Edits on Custom Images

We applied the same process to two custom images using the prompt "a photo of a hipster barrista" and "a pencil" and varying noise levels. The outputs show a blend of the original image (a banana) and the characteristics described in the prompt.

Pencil Banana High noise

Pencil Banana Low noise

Barista Banana High noise

Barista Banana Low noise

Part 8 Visual Anagrams

In this part, we use diffusion models to create visual anagrams—images that show two different scenes when viewed upright and flipped upside down. Our goal is to create an image that appears as "an oil painting of an old man" when viewed normally, and as "an oil painting of people around a campfire" when flipped.

To achieve this effect, we perform diffusion denoising on an image x_t at a particular timestep t, with two different prompts. First, we denoise the image using the prompt "an oil painting of an old man", obtaining a noise estimate ϵ₁. Next, we flip x_t upside down and denoise it with the prompt "an oil painting of people around a campfire", resulting in ϵ₂. We then flip ϵ₂ back to its original orientation and average it with ϵ₁ to create the final noise estimate ϵ.

Algorithm Steps

Results: Visual Anagrams

Below are examples of visual anagrams created using this method. The first image appears as "an oil painting of an old man" when viewed normally and as "an oil painting of people around a campfire" when flipped. We also include two additional anagram illusions that reveal different images when flipped upside down.

"an oil painting of an old man" and "an oil painting of people around a campfire"

"an oil painting of a snowy mountain village" and "an oil painting of people around a campfire"

"an oil painting of a snowy mountain village" and "an oil painting of an old man"

Part 1.10 Hybrid Images

In this part, we implement Factorized Diffusion to create hybrid images, inspired by techniques from project 2. A hybrid image combines different elements that appear as one thing when viewed from a distance and something else up close. Here, we use a diffusion model to blend low and high frequencies from two different text prompts, creating a composite effect.

The process involves estimating noise using two prompts, combining the low frequencies from one estimate and the high frequencies from the other. This approach allows us to create an image that looks like one subject from afar and transforms into a different subject upon closer inspection.

Algorithm Steps

For this part, we use a Gaussian blur with a kernel size of 33 and sigma of 2 for the low-pass filter, which smooths out the details in the first noise estimate, while the high-pass filter captures the finer details from the second estimate.

Results: Hybrid Images

Below are examples of hybrid images created with this technique. The first image appears as a skull from afar and as a waterfall up close. We also include two additional hybrid illusions that transform depending on viewing distance.

Hybrid Image: Skull from afar, Waterfall up close

Hybrid Image: A photo of a dog and a man in a hat

Hybrid Image: A man with a hat and a waterfall

Final remarks for part A

I thought this first part of the project was really cool. I think it is very intersting how the images are generated and can't wait for part B which I assume will be even cooler.

Part B

Warm-Up: Building a One-Step Denoiser

To start, we train a simple denoiser Dθ that maps a noisy image x_noisy to a clean image x_clean. The L2 loss is optimized as:

1.1 Implementing the UNet

We use a UNet for the denoiser. It consists of downsampling and upsampling blocks with skip connections, capturing both global and local features. This structure is ideal for image-to-image tasks like denoising.

1.2 Denoising Problem and Data Generation

In this part, we implement and visualize the noising process for generating training data. The objective is to train a denoiser Dθ to map noisy images x_noisy to clean images x_clean, minimizing the L2 loss:

To achieve this, clean MNIST digits x_clean are progressively noised to create training pairs (x_noisy, x_clean).

Results: Visualization of the Noising Process

Below is the output of the implemented noising process applied to a normalized MNIST digit. The images show the effect of increasing noise levels:

For training we used the image class pairs. I used a batchsize of 256, 5 epochs, sigma = 0.5 and the Adam Optimizer with a learning rate of 1e-4. Below you can find the loss during training.

Loss per epoch during training

Loss per iteration during training

The results from the training can be seen below, where I show the performance after 1 epoch and after 5 epochs. As you can see the model becomes much better after 5 epochs.

Results after 1 epoch of training

Results after the 5th and final epoch of training

The model was as stated trained using a sigma value of 0.5. But it is intersting to see how well it can perform with higher versus lower values for sigma i.e more and less noise than during training. In the image below you can see that performance.

Part 2. Time and Class Conditioning

Previously, our UNet model predicted the clean image. In this section, we update the approach to predict the noise ϵ added to the image instead. This allows us to start with pure noise ϵ ∼ N(0, I) and iteratively denoise to generate a realistic image x.

Combining Time and Class Conditioning

Instead of implementing time conditioning first and then adding class conditioning, we simultaneously condition the UNet on both the timestep t and the class of the digit. Using the equation:

we generate a noisy image x_t from x₀ for a timestep t∈{0,1,…,T}. When t=0, x_t is clean, and when t=T, x_t is pure noise. For intermediate values of t, x_t is a linear combination of the clean image and noise. The derivations for β, α_t, and ᾱ_t follow the DDPM paper, and we set T=400 due to the simplicity of our dataset.

To handle time conditioning, we integrate fully connected layers to embed t into the UNet. For class conditioning, we use a one-hot vector to represent each digit class ({0,…,9}). We further add two fully connected layers to embed the class vector.

Updated UNet Architecture

Below is the updated UNet architecture, which includes both time and class conditioning as well as the new training algorithm used:

UNet Architecture with Time and Class Conditioning

Training algorithm with Time and Class Conditioning

Training Algorithm

The training process involves generating noisy images x_t for random timesteps, computing their one-hot class vectors, and training the UNet to predict ϵ. The addition of class conditioning ensures better control over the generated images, while time conditioning enables iterative denoising. As we can see in the results presented below after the twentieth epoch the model works very well and produces fine, correct and detailed numbers.

Results from Time and Class Conditional UNet wtih Gudienscale: 5.0

Comparing with the only TimeConditioned UNet

If we now only look at the Time Conditioned UNet, i.e no classes. The results are presented below. As we can see the numbers are not as good without the classifier free guidance.

Results from TimeCOnditioned UNet, no classes

Final remarks

Very cool and fun project. The most interting part was definetely to see how adding classes (CFG) increased the performance of our model and how easy teh change was to do.