Training a Dog → Cat Autoencoder

5 min read Original article ↗

Imagine setting out to build something great; putting your blood, sweat and (computational) tears into building an image autoencoder, from scratch. Spending hours and hours tuning hyperparameters until everything fits in perfectly and the loss just keeps going down and down and down.

Bonus points to anyone who can guess the LR scheduler

And after 12+ hours or arduous training, you realize you built a model that just replaces dog features with cat features.

  1. The results are at the bottom of the page. Feel free to skip to them, the rest of the article is more of a story of how I got here.

  2. Only slight dog features eg. the nose were converted into more cat-like features.

The original has a more rounded nose, while the reconstruction has a more triangular nose (image search “cat noses” to see what I’m talking about).

The reason for this is most likely because the model was trained solely on cat images; while it did learn some general image compression, it also got good at compressing feline-specific features such as the nose, ears, etc. These two compressions (general purpose and feline-specific) compete and caused this dog’s nose to turn out a little different in the reconstruction.

Here’s a short story about how I ended up needing ~30,000GB VRAM for a hobby project. Not kidding.

It was a different time back then: ChatGPT wasn’t half as popular as it is now, and most companies hadn’t even heard of the acronym “AI”, much less worried that it might destroy their business model.

I remember thinking: there’s no way a computer program could talk in natural language more than predefined conversation paths. No way this “AI” could ever understand or make decisions or even just execute npm run dev on it’s own.

Right?

Then I saw this Veritasium video. And I built one of those rectangle-or-circle classifiers.

And it blew my mind.

Fast forward a couple of months. I learnt about backpropagation and convolution and attention. I read about diffusion models for image generation. Looked up a guide. Copy-pasted the code. Got a dataset of Forza Horizon 5 images. Hooked everything up. Whoops! Got an error. What is it? Let’s have a look…

RuntimeError: CUDA out of memory. Tried to allocate 29.06 TiB (...)

I don’t know about you, but I usually don’t have 30TB of VRAM lying around on my desk.

Naturally, I Googled the error. I did the usual torch.cuda.empty_cache() and that obviously didn’t work. After hours of finagling, I finally realized my mistake.

I needed an autoencoder.

If you don’t already know, diffusion models are really resource-intensive for large images. Virtually all diffusion models available today, including Stable Diffusion and FLUX, are trained on a compressed latent space instead of normal pixel space.

Shamelessly ripped & modified from EQ-VAE. Left: original image. Right: latent version.

This latent space is constructed by training a specialized model called an autoencoder. You train a network to reconstruct an image, and over thousands of images, it learns to compress information into a smaller latent space. Instead of encoding specifics of the pixels, the latent space encodes general information and structure (or at least, that’s the goal). The Stable Diffusion autoencoder acheives 48x compression — that’s a 48x decrease in resources required to train the diffusion model!

I got right to work on that: building my own autoencoder. This went on and off over the last couple of years: I’d learn something new, try to implement it, fail horribly and then give up. I’d usually end up with a horrible mangled mess.

Image of a cat or nightmare manifestations?

Then, a couple of weeks ago, I tried to actually understand what was going on instead of just copy-pasting: what all the loss functions did, what worked with what, what had to be thrown out the window. And a little more than 12 hours of training gave me (drumroll please)…

Slightly blurry, not the sharpest details, but on the whole: pretty good!

The autoencoder was trained solely on cat images from this Kaggle dataset.

Naturally, my first question was: how would it perform on images different from what it had seen during training?

While I do intend to share some secrets regarding autoencoder training shortly (subscribe to know when that comes out), here are some results for now. Originals on the left, reconstructions on the right.

This one’s an image from the test dataset.
This one’s interesting: image models struggle with text, and mine is no exception.
This one shows that the model has learnt some general-purpose tricks for image comrpession.
Just look at the eyes: non-existent in the reconstruction.
Again, the eyes: completely gone.
You see the borders at the edges? Likely convolution artifacts.
Pretty good, but perceptibly blurry.
This one goes from a happy dog to a smug dog.
The fur is smoothed out; otherwise, pretty good!
This one’s the same as above, just pixelated. Interesting reconstruction!
This one shows how art features were made softer and slightly more realistic (in my opinion).

A couple of these images were from Unsplash, others from various sources:

Discussion about this post

Ready for more?