The following video describes the trajectories of ten, randomly chosen, digits from the MNIST dataset (one for each kind) during the first five training epochs.
For each digit, the radius of the circle is the variance computed by the VAE (more precisely, r^2 is the geometric average of the two variances along x and y). Initially it is close to 1, but it rapidly decreases.
Intuitively, the area inside the circle can be understood as the region of the plane that gives rise to a reconstruction close to the input.
Its radius must eventually be small, since we need to accommodate 60000 different digits.
All digits progressively distribute around the center, in a Gaussianlike distribution.
This overall distribution is better appreciated in the next video, similar to the previous one but where we respectively trace six different digits per category.
In order to focus on a smaller area, we start drawing from the beginning of the second epoch, and prosecute until the end of the tenth epoch.
The only purpose of sampling is to "occupy" the space around
each encoding of input points in the latent space. How much space
can we occupy?
In principle, all the available space, that is why we try to keep
the distribution P(zy) close to a normal distribution.
In practice, you force each data point X to compete with all other points in the occupancy of the latent space, each one pushing the others as far away as possible.
The hope
is that this should induce a better coverage of the latent space,
resulting in a better generation of new samples.
Let's check this. In the following pictures we compare the
distribution we get with the variational approach (first row),
with respect to an approach where we removed sampling (second row)
maintaining the quadratic penalization on latent values.
In the latter case,
the dimension of points in the latent space has no significance: we
use the default dimension of the pyplot library.
The effect of sampling, in this case, is not so evident, neither on the distribution in the latent space, nor in the quality of random generation of new digits.
An important proviso of variational autoencoders is that the encoder is sufficiently powerful to suitably approximate the desired distribution P(zx) of latent variables given the input. Maybe the network of the previous example was not sufficiently sophisticated. We shall now repeat the same experiment with more complex neural networks.
For the first experiment, we just add a couple of dense layers, passing from the structure 7842562 to 7842563282. Results are shown below
Here, we can readily observe that the distribution of points over the latent space is sensibly less uniform in the case without sampling, resulting in the generation of more distorted digits. It is also worth observing that, for both models, there is a sensible improvement in the reconstruction loss wit respect to the previous shallower net, and that, from this respect, the network without sampling is even slightly better than the variational one.
Let us see what happens augmenting the dimension of the latent space, passing e.g. from 2 to 3 latent variables. We give a three dimensional view of the distribution of digits in the latent space, as well as the result of sampling, organized in five horizontal slices of 25 points each.


We may observe that digits have a much better spherical disposition in the case with sampling. In the latter case, random sampling from the latent space generates very bad results.
One might easily wonder how the variational approach scales to higher dimensions of the latent space. The fear is as usual related to the curse of dimensionality: in a high dimensional space the encoding of data in the latent space will eventually be scattered away, and it is not evident that sampling during training will be enough to guarantee a good coverage in view of generation.
Let us see what happens in four dimensions, starting with a 7842563284 architecture.
The first observation is that the result of training is much more aleatory than in the previous cases, resulting in quite different final errors. Not infrequently, we also have a collapse to 0 of some of the latent variables (meaning that their variance is around zero: the mean, always is), that is quite bad in view of generative sampling.
The overall quality of generated samples is modest. In this case, we renounce to give representations of the distribution of digits in the latent space, just focusing on sampling. We sample 3 points for each dimension, resulting in 3^4 = 81 images. Below on the left is a typical example (a relatively good one)
7842563284 architecture  784256324 architecture 
Can we improve this situation?
The instability of training, as well as the collapse of some of the latent variables may suggest that the model is exceedingly expressive. So, we may try to reduce its complexity, e.g. removing one or two layers. This seems indeed to be beneficial; the 784256324 architecture is more stable during training, and even sampling looks slightly improved (previous picture on the right)
Let us be more aggressive, trying to work with 8 or 16 latent variables. In both cases we test a couple of architectures. Due to the number of dimensions we renounce to give a uniform sampling, and just show 100 samples generated according to a standard distribution for each latent.
78425632168 architecture  784256328 architecture 
784256322416 architecture  7842563216 architecture 
The collapse to zero of several latent variables is confirmed, more evident for deeper architectures. In spite of this phenomenon, generative sampling from a latent space of dimension 16 seems to produce slightly better results with a deeper network.
A possible explanation is that exploiting a larger latent space we may encode the information at a greater level of detail, but in order to take advantage of this additional information we also need more complex, highly nonlinear transformations.
In the next section, we shall tackle a slightly more complex generative task.