Variational Autoencoders(VAEs)

One of the biggest obstacles with machine learning is access to data. Now imagine that you could get some sample data and then replicate other similar data, so it looks like it's coming from the same distribution; however, it is not identical. This could be game-changing and could be used in many different fields such as health care, automotive, manufacturing, where there is not enough data — specifically, image data.

So how does it work?

Here We Have The Data X And The Hidden Model Z

Latent space refers to an abstract multi-dimensional space containing feature values that we cannot interpret directly, but which encodes a meaningful internal representation of externally observed events

Credit Jeremy Jordan

Here is a great example of latent space. The first image is the data given to you, and then you have the latent space with all its different attributes, and then from that, you can make a new image.

It is important to note that latent attributes are not single numbers. It is a distribution. So when the new images are created, all of the latent attributes are sampled from their respective distribution.

Gaussian Distribution

Variational autoencoders have two parts—the encoder and the decoder. The encoder takes the image and then creates the latent space. The decoder then takes the sampled points from the latent space and tries to recreate the image. The goal of Variational Autoencoders is to recreate an image that is not identical to the first one; however, it looks like it could come from the same batch of training points.

Visualization of Encoder and Decoder

Now Get Ready For Some Mathematics

Now applying the Bayes theorem, we can modify this equation to look like the following.

This is stating that the likelihood of X given z, which is a function with parameters theta. To get X, you will usually have a neural network output some value plus some random noise.

You may have noticed that there is also a new variable, P(z). This is called the prior or latent variable. P(z) holds all the information needed to generate X. It captures all of the dependencies between dimensions.

If we want to optimize this function, we will maximize P(X) with respect to theta.

However, this term on the right is hard to calculate, so we take the log to use some of the logarithm laws (which we will see in action later).

This function is also hard to maximize because the integral is intractable, meaning that it is impossible to solve.

To solve this issue, machine learning experts have decided to turn this into what they know best. An optimization problem. This can be done through two techniques.

  1. Multiplying and dividing by a new distribution
  2. Utilizing Jensen’s Inequality

To reduce clutter, I will omit the max with respect to theta in the following equations.

To start, we will multiply and divide our equation above by a new distribution called q(Z).

This will introduce more freedom in the equation. We will then group the divided by q(z) and the P(X|z)P(z)into a function called f(z).

So the above equation will turn into:

Now we know from expectation rules that the expectation of f(x) is equal to the integral of f(x) multiplied by some distribution.

Expectation Rule

So we can apply this same rule to our equation. Our equation can be turned into the expectation of f(z) coming from the distribution q(z).

Now we're done step 1! easy, right?

Now let's get to Jensen’s Equality.

So there are three things that Jensen’s Equality states.

  1. That a function of an expectation of a function is not equal to the expectation of a function of a function. To make that easier to understand, I will show a formula.

2. The next thing that Jensen’s Equality states is: If the function is convex, for example, an exponential. The function of an expectation of a function is ≤ to the expectation of a function of a function.

3. As you can suspect, the final rule states that a function of an expectation of a function is ≥ to an expectation of a function of a function.

So now, knowing these rules, we can apply this to our function. A logarithm is a concave function, so that means we will apply the 3rd rule.

This is done to move the log inside of the expectation, which will bring us even closer to our goal of turning this into an optimization problem.

So let us now expand f(z) and use some more logarithm and expectation rules. We know from earlier that f(z) is equal to P(X|z)P(z)/q(z).

Now we can apply logarithm rules to expand the log of P(X|z)P(z)/q(z) to the log of P(X|z) + the log of P(z)/q(z).

We can then use expectation rules to convert this to two different expectations.

Now we are getting very close to our Elbo loss function. The first term on the right hand of the equation is called the log-likelihood. It states how likely we are to get X given z coming from the distribution q(z). That part is finished and is ready for optimization. However, the right part of this equation still needs a little work.

In Variational Autoencoders, we use what is called the Kullback-Leibler divergence. It measures the difference between functions.

Visualization Of KL divergence

The formula for the KL divergence is as follows.

Please notice that q(z) is on the numerator in this fraction, and P(z) is the denominator. In our equation from before, P(z) is the numerator, and q(z) is the denominator. So expanded it would look as follows.

So we know that the only thing left that we need to do is switch the denominator and numerator. Well, thankfully, we have logarithm rules which state. The log of (a/b) is equal to the -log of (b/a).

So using this principle, we can change our equation to the -log.

And then this is just equal to the negative KL divergence that we mentioned before. So our final equation will be.

After applying all of these steps, we have arrived at the ELBO loss function (Evidence Lower Bound), which we will want to maximize with respect to theta and phi, which are both the parameters for their respected neural networks.

It is important to note that we have parameters in our new density q(z) because we want to find the best z that will produce X. Since z is unknown, the only way to do this is to learn it.

ELBO Loss Function

Now for the optimization

The Reparametrization trick

Here you can see that the density is p(z) and theta is inside the expectation. This means that we can easily take the gradient with respect to theta following the next couple steps:

First, we have to expand the expectation into an integral, and then we can move the gradient inside the brackets. Then, we can switch back into expectation form, so we have the expectation of the gradient of f theta of z.

What we just did was very easy to do; however, what happens if the parameter that we want to take the g.w.r.t. is in the density.

We will then end up with the same term as above, but we will also need to take the gradient with respect to theta in the density. This will not convert into an Expectation because when you take the gradient of a density, it can no longer act as a normal density, which is crucial for an expectation.

If we take a look back at our elbow again, we can see that we have this same issue.

ELBO Loss Function

We will have to take the gradient with respect to phi, which is in the density.

So What Do We Do

Here you can see that if Y is sampled from a gaussian with a mean of mu and covariance Sigma, it can be rewritten as mu plus L, a Cholesky decomposition of Sigma, multiplied by epsilon which is a standard Gaussian.

Going back to our example before where we had theta in our density, we could rewrite it like this.

Because it is now in Expectation form, we would be able to apply techniques like Monte Carlo Estimation.

Now knowing the Reparameterization trick, we can apply this to our Elbo Loss function.

Credit Gregory Gunderson

Here is a great visualization that shows all the different steps in a VAE. First, we have the data point X, which we will pass through the encoder, which will give us a new mean and covariance sigma. This will allow us to do two things. First, we can compute the -KL divergence between the distribution q(z) and the prior P(z). Second, we can combine them with epsilon which is sampled from a standard Gaussian, to put into the decoder. That will give us a final X point/new data point. We can then compare this final X point to the first observed X point and compute the mean squared error. Finally, we will arrive at our final loss function by combining the mean squared error and the -KL divergence.

Wrap Up the Math

Where You Can Go From Here

It is used in different RL algorithms, like Dreamer from Google DeepMind.


Dreamer is a three-step algorithm, first, you learn the latent space from sampling from the real world. Second, you learn the policy/behavior in the latent space. And third, you apply the policy back into the real world to understand how good your policy is.

One of the more important parts of the Dreamer algorithm is to capture information in the real world and then build its own model of the environment. You can think of the latent space as X. We are given information and action, which would be the Z. The difference here is that this is now a time series problem because you care about past states and are computing rewards at each time step.

Here We Have The Data X And The Hidden Model Z

Another situation where the concepts from VAEs are used is in Bayesian Optimization (BO). This is an optimization technique used for black boxes.

It gets a little trickier for BO because here, we are given some input X and some corresponding output Z, and we are trying to model the unknown function F(x) that maps X to Z.


  • It can be used in various industries like the health sector, automotive industry, manufacturing, and many others
  • The way it works is after, given an original data point X, it tries to learn a latent space that, when sampled from, will create an accurate image
  • Variational Autoencoders have two parts. The Encoder which builds the latent space, and Decoder which creates the new image
  • We use a combination of the -KL divergence and mean-square-error as our final loss function
  • You can then optimize that loss function with stochastic gradient descent
  • Other areas where these concepts could be applied are RL algorithms like Dreamer from Google DeepMind, and in Bayesian Optimization

Extra Resources