Generating Human Faces with Variational Autoencoders

31 Dec, 2025

This is part 7 of my series. In the previous post Denosing Images of Cats and Dogs with Autoencoders, we looked at Autoencoders and implemented the RED model to denoise some pictures of cats and dogs using Residual autoencoders

In this post, we will explore variational autoencoders (VAE). We will go over the concepts needed to understand them, such as the KL-divergence and the reparameterization trick. After that, we will generate some MNIST digits and play around with the latent space. Lastly, we will implement a Convolutional VAE and generate some human faces.

In the last post, we showed how disentangled and sparse the latent space of vanilla autoencoders was. When we tried randomly sampling from the latent space, the results we got were poor. This is where Variational autoencoders come in; they were specially designed for generative tasks.

VAEs are the generative version of autoencoders that seek to put the latent space in order and allow us to sample from a normal distribution to generate new images. If you have some pre-existing knowledge on autoencoders, you will only need to know the KL-divergence and the reparameterization trick to get up to speed with VAEs.

The General Outline Of VAEs

In the general sense, VAEs try to transform the input data to a continuous, normally distributed space. This is done by the encoder and decoder.

The encoder takes the input data and transforms it into a mean and a variance. We then use this mean and variance to generate a random sample through the reparametrization trick. This sample is then fed to the decoder, which generates the image.

The mean and variance variables generated by the encoder are passed through the KL-divergence loss. The KL-divergence loss calculates how similar these variables are to a normal distribution, which has a mean of 0 and a variance of 1. The generated image is then put through the loss function, such as Mean Squared Error (MSE) or Mean absolute Error (MAE), to see how close the generated image is to the original; this is often called the reconstruction error.

So the full loss function of a VAE can be seen as the combined:

l o s s = M S E + K L

This is essentially the ELBO loss you see floating around in papers without the fancy mathematical notation. As you can see, VAEs try to create a good reconstruction of the image while also trying to force the latent space to be normally distributed.

Weights could be added to either of the two loss functions to prefer a certain metric over the other. If you want a better organised latent space, then you will just weigh the KL-loss a bit more.

KL-divergence And Ordering Latent Space

KL-divergence is a statistical formula that seeks to show how different two distributions are from eachother. The closer the two distributions are to eachother the nearer the result is to zero. The output of the formula is between 0 and 1, where 0 means the two distributions are similar while 1 means they are completely different. It is also asymmetrical, so

K L (a, b) \neq K L (b, a)

It is easy to understand the discrete version of the formula, but we will be using the continuous version that involves using the normal probability density function. In this post, we will not go too deep into the formula, as it is not the point; the main focus is the model.

If you want a deeper dive into KL-divergence, ritvikmath and Adian Liusie have amazing videos describing how the algorithm works.

The code for our KL-divergence is below:

class KL_Diverance(nn.Module):
  def __init__(self, reduction="mean"):
    super().__init__()
    self.reduction = reduction

  def forward(self, mu, log_var):
    loss =  -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=0)
    if self.reduction == "sum":
      return loss.sum()
    else:
      return loss.mean()

The Reparametization Trick And Random Sampling

The reparametization trick was created to solve the problem of randomly sampling from a normal distribution and having to do backpropagation. Backpropagation requires that the matrices we get our derivatives from come from a calculation, so that we can update them. In other words, we cannot backpropagate from random or stochastic variables.

To solve this issue, the reparameterization trick was created. The formula for the trick is below; the Mu and Sigma are generated by the encoder, and the Epsilon is a random vector of the size of the latent space (In our case, Mu, Sigma and Z are all vectors of the same size). We backpropagate through the Mu and Sigma to get the derivatives needed to update the weights of the model.

Z = μ + σ * ϵ

The code for the reparameterization trick that we will use in this post is below. In our case we get the variance from the model and the calculate the standard deviation for us to randomly sample from the produced Mu and sigma.

def reparameterize(mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std, requires_grad=False)
        return mu + torch.mul(eps ,std)

So, in essence, the reparametization trick just allows us to perform backpropagation when a random or stochastic variable exist.

Creating New Digits

In this section, we will generate some MNST digits with variational autoencoders. We will create a simple fully connected VAE to showcase some of the concepts we just learnt.

The full code for this section can be found in the Kaggle notebook. We will only be focusing on the model itself, not including the code for the training loop, dataset, and graphs.

We will start with our encoder, which is just a fully connected linear layer. The output of this linear layer is then forwarded into a mean and variance layer.

class Encoder(nn.Module):

    def __init__(self, layer_size ,latent_dim):
        super().__init__()
        self.flat = nn.Flatten()
        self.dense_1 = nn.Linear(28*28, layer_size)
        self.relu = nn.ReLU()
        self.mu = nn.Linear(layer_size, latent_dim)
        self.logVar = nn.Linear(layer_size, latent_dim)

    def forward(self, x):
        x = self.flat(x)
        x = self.dense_1(x)
        x = self.relu(x)

        # getting mean and std
        mu = self.mu(x)
        log_var = self.logVar(x)
        
        return mu, log_var

The decoder half of our model just takes a mirror of our encoder, just without our mean and variance layers. The decoder just takes in a latent space and outputs the generated image.

class Decoder(nn.Module):

    def __init__(self, layer_size, latent_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, layer_size),
            nn.ReLU(inplace = True), 
            nn.Linear(layer_size, 28*28),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.model(x)

The full VAE model below just combines the encoder and decoder. It then reparameterizes the mean and variance from the encoder and creates a sample. This sample is then fed into the decoder, which then generates the image.

class VAE(nn.Module):

    def __init__(self, layer_size=16, latent_dim=2):
        super().__init__()
        self.encoder = Encoder(layer_size, latent_dim)
        self.decoder = Decoder(layer_size, latent_dim)

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std, requires_grad=False)
        return mu + torch.mul(eps ,std)

    def forward(self, x):
        mu, log_var = self.encoder(x)
        sample = self.reparameterize(mu, log_var)
        decoded = self.decoder(sample)
        return decoded, mu, log_var

The model was trained for 20 epochs using Binary cross entropy as the loss function, and the generated images below are from the test set.

numbers

The model does a great job at reconstructing images given an input image, but a better test would be to see how it deals with generating random images. This is done by randomly sampling from a normal distribution and feeding this vector into the decoder. As you can see from the results below, some of these images are very blurry; this is one of the major drawbacks of variational autoencoders.

Grid of generated numbers

Messing Around With Our Models' Latent Space

In this section, we will be exploring the latent space of our model to see how our model defines the relationships between different numbers. All the code for this is found in the Kaggle Notebook

We will use the numbers 3 and 0 to discover how the model defines their relationship. We will take their latent space representation, create a line space of even numbers between them and convert them into images. This will show us what numbers appear between them in latent space.

The results of the experiment are below, and as we can see, the model slowly converted the number 3 into some number closer to an 8 before becoming a 0. This means that somewhere in our latent space, 3, 0 and 8 have a relationship.

Linespace numbers

We can explore this idea further by plotting the latent space using TSNE. We will be able to see where certain numbers intersect other numbers in this ordered latent space.

In the diagram below, we can see that 3, 5 and 8 are very close together in latent space. As the number 3 goes to 0, we see that it runs into the latent space of the numbers 5, 8 and 2.

MNIST number distribution

An interesting thing to note with our diagram is that 1's are very far away from the other numbers. 6's, 0's and 1's are often in their own clusters while the other numbers are mixed together. This shows how the model interpreted the relationship between all these numbers.

Note: When I trained the model with Tanh and Mean Squared Error, I got very bad results. The latent space was very disintegangled and the results from the generated images were bad. I recommend using Binary cross-entropy and Sigmoid for training on the MNIST dataset

Making New Faces

Let's move on to something a bit more fun. In this section, we will generate images of human faces using convolutional VAEs. We will be using the ffhq-face-dataset for faces. All the code for this section will be found in the Kaggle notebook as we will only be focusing on the model.

Our convolutional VAE encoder will consist of 5 convolutional blocks that will be used to extract features from the images. These blocks will use batchnormalization and have LeakyReLU as the activation function.

class Encoder(nn.Module):

    def Encoder_block(self, in_channels, out_channels, stride=2, padding=1):
        return nn.Sequential(
            nn.Conv2d(in_channels, out_channels, (3,3), stride=stride, padding=padding),
            nn.BatchNorm2d(out_channels),
            nn.LeakyReLU(0.2, inplace=True),
        )

    def __init__(self, latent_dim):
        super().__init__()

        self.feature_extraction = nn.Sequential(
            self.Encoder_block(3, 32),
            self.Encoder_block(32, 64),
            self.Encoder_block(64, 128),
            self.Encoder_block(128, 256),
            self.Encoder_block(256, 512),
        )


        #non-trainable layers
        self.flat =  nn.Flatten()
        self.relu = nn.LeakyReLU(0.2)


        # Used for Sampling
        self.dense = nn.LazyLinear(512)
        self.mu = nn.Linear(512, latent_dim)
        self.log_var = nn.Linear(512, latent_dim)


    def forward(self, x):
      features = self.feature_extraction(x)
      flattened = self.flat(features)
      result = self.dense(flattened)
      mu = self.mu(result)
      log_var = self.log_var(result)

      return mu, log_var

The decoder half of the model will be a mirror of the encoder, but with Transpose Convolutional blocks instead. Our decoder takes in a latent space vector and converts it back into an image.

class Decoder(nn.Module):
    
    def deocder_block(self, in_channels, out_channels, stride=2, padding=1, output_padding=1):
      return nn.Sequential(
          nn.ConvTranspose2d(in_channels, out_channels, (3,3), stride=stride, padding=padding, output_padding=output_padding),
          nn.BatchNorm2d(out_channels),
          nn.LeakyReLU(0.2, inplace=True),
      )


    def __init__(self, latent_dim):
        super().__init__()
        
        self.latent2image = nn.Sequential(
            nn.Linear(latent_dim,  512),
            nn.Linear(512,  512*3*3),
            nn.Unflatten(1, (512, 3, 3))
        )
      

        self.image_reconstruction = nn.Sequential(
            self.deocder_block(512, 256),
            self.deocder_block(256, 128),
            self.deocder_block(128, 64),
            self.deocder_block(64, 32),
            self.deocder_block(32, 3),
            nn.Sigmoid(),
        )


    def forward(self, sample):
        image_sample = self.latent2image(sample)
        return self.image_reconstruction(image_sample)

The Face_VAE model here is just used to put everything together.

class Face_VAE(nn.Module):

    def __init__(self, latent_dim=128):
        super().__init__()
        self.encoder = Encoder(latent_dim)
        self.decoder = Decoder(latent_dim)

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std, requires_grad=False)
        return mu + torch.mul(eps ,std)


    def  forward(self, x):
        mu, log_var = self.encoder(x)
        sample = self.reparameterize(mu, log_var)
        generated_images = self.decoder(sample)
        return generated_images, mu, log_var

Reduction of sum was used for the loss functions to allow bigger gradient updates to the network. The best training results were achieved when using sum reduction.

optimizer = torch.optim.Adam(vae.parameters())
KL_loss = KL_Diverance(reduction="sum")
reconstruction_loss = nn.MSELoss(reduction="sum")

Samples Of Generated Faces

Here are some of the faces we got from randomly sampling the decoder. As you can see, the faces are very blurry, just like from our MINST model.

This is due to two main factors:

The loss function Mean Squared Error (MSE), penalties big errors over big differences. Mean Absolute Error could have been used for making the output images a bit less blurry.
The major factor for the blurriness has to do with VAEs themselves. VAEs take the mean of images in the process of generating them. This process allows only for the main features to be considered and other features such as background are ignored. In the image of generated faces you can see that features such as eyes, noses and mouths are prominent but other things such a necks and head shape are not. This shows that during training our model learnt to prioritise these features over others

Generated pictures of faces

Series

Demystifying LeNet-5: A Deep Dive into CNN Foundations
Exploring pre-trained Convolutional layers and Kernels
AlexNet: My introduction to Deep Computer Vision models
VGG vs Inception: Just how deep can they go?
ResNet and Skip Connections
Denosing Images of Cats and Dogs with Autoencoders
Generating Human Faces with Variational Autoencoders (you are here right now)
[Generative Adversarial Networks] (https://mayberay.bearblog.dev/all-gans-no-brakes/)