Your Diffusion Model Has Never Seen a Pixel
What VAEs are really for, and why modern image and video generators still depend on them.
When people talk about image generation, they usually talk about the diffusion model.
Stable Diffusion. Flux. Wan. HunyuanVideo. DiTs. U-Nets. Samplers. Guidance. Attention.
But in most modern image and video generators, the model that gets all the attention is not working with pixels at all.
It does not look at a 1024 by 1024 image as a grid of red, green, and blue values. It does not spend most of its time deciding the exact brightness of every pixel. Instead, the image is first compressed into a smaller internal space. The diffusion model generates inside that space. Only at the end does another network decode the result back into pixels.
That compressed space is usually built by an autoencoder, often called a VAE.
The VAE is easy to treat as plumbing. It sits below the main model. It gets a paragraph in many papers. It rarely gets the spotlight.
But that is misleading.
For visual generation, the space is not just an implementation detail. The space defines what the generator can afford to learn, what it can ignore, how stable its samples are, and how expensive training becomes.
Before a visual model can generate an image, it first needs a world small enough to think in.
Why Pixels Are a Terrible Place to Generate
Start with a simple question.
Why not generate pixels directly?
A 1024 by 1024 RGB image contains more than three million numbers. If you choose those numbers randomly, you do not get a strange-looking dog or a bad landscape. You get noise.
That tells us something important: almost all of pixel space is useless.
Natural images occupy a tiny, highly structured part of the full pixel universe. The photos, drawings, frames, and videos we care about live on a very thin surface inside an enormous space. Most possible pixel arrangements are not “bad images.” They are not images in any meaningful sense.
So the first problem is not generation. The first problem is representation.
We need a way to map real images into a smaller space where meaningful images are easier to describe.
An autoencoder does exactly that.
The encoder takes an image and compresses it into a latent code. The decoder takes that code and reconstructs the image. If reconstruction works, the latent code has captured the important information in the image.
As compression, this is powerful.
As generation, it is not enough.
A Regular Autoencoder Learns Addresses, Not a Map
A normal autoencoder only has to reconstruct the training images.
That sounds harmless, but it creates a problem. The model is rewarded for assigning precise codes to the examples it sees. It has no reason to make the empty space between examples meaningful.
Imagine a city where every house has an address, but there are no roads between them. If you already know a valid address, you can get to a house. But if you choose a random coordinate, you may land in a river, a wall, or nowhere at all.
That is what can happen in a regular autoencoder’s latent space.
The training images decode well. But random latent points may not. Even the midpoint between two valid image codes may decode into something broken, because the decoder was never trained to understand that region.
The autoencoder did not fail. It did exactly what it was asked to do.
The problem is the contract.
A generator needs more than good reconstruction at known points. It needs a latent space where nearby points tend to decode into plausible images. It needs roads, neighborhoods, and continuity.
That is the job the VAE was designed to do.
The VAE Makes the Latent Space Harder to Cheat
The key idea behind a VAE is simple:
Do not encode each image as a single point. Encode it as a small cloud.
Instead of saying, “this image lives exactly here,” the encoder says, “this image lives around this region.” During training, we sample a point from that region and ask the decoder to reconstruct the image from that sampled point.
Now the decoder cannot rely on one perfect coordinate. It must learn to behave well around the image, not only at the exact location of the image.
This makes the latent space smoother.
But the model can still cheat.
It could shrink every cloud until each one becomes almost a point. Then we are back to a regular autoencoder.
Or it could keep the clouds wide but push their centers far apart. Then the latent space is still an archipelago: many little islands of meaning separated by empty water.
The VAE prevents both failure modes with a KL penalty.
The reconstruction loss says: preserve the image.
The KL penalty says: keep the latent distributions close to a shared, simple prior, usually a standard normal distribution.
In plain language, the KL term keeps the clouds from collapsing and keeps their centers from flying away.
That is the real function of the VAE contract:
Reconstruction teaches the decoder what images look like.
KL teaches the latent space not to tear itself apart.
Why Old VAEs Looked Blurry
If VAEs make the latent space smoother, why did early VAE samples often look bad?
Because smoothness has a cost.
When many possible images overlap in latent space, the decoder may face uncertainty. Under simple pixel reconstruction losses, the safest output is often an average. Average faces become soft. Average textures lose detail. Average edges blur.
That blur was not a random bug. It was the price of making the latent space more continuous.
For a long time, this made VAEs look like weak generators. GANs produced sharper images. Later diffusion models produced much better samples. The VAE seemed like an elegant idea with disappointing outputs.
Then latent diffusion changed the question.
Instead of asking, “Can the VAE generate great images by itself?” it asked a better question:
Which parts of an image are worth generating slowly?
A pixel-space diffusion model spends sampling steps on everything: structure, lighting, texture, sensor noise, tiny color variations, and details humans barely notice.
Latent diffusion splits the job.
The autoencoder performs perceptual compression. It keeps the important visual information and removes a lot of low-value detail. The diffusion model then generates in that compressed space. The decoder restores the final image in one pass.
The VAE stopped being the artist.
It became the canvas maker.
Stable Diffusion Uses a Very Different Kind of VAE
The autoencoder inside Stable Diffusion is not a pure textbook VAE.
A classic VAE puts real weight on the KL term because it wants the latent space to match a simple prior. It wants to sample directly from that prior.
Stable Diffusion has a different goal.
Its autoencoder is trained mostly for good perceptual reconstruction. It uses reconstruction losses, perceptual losses, and adversarial losses to keep images sharp. The KL term is very small.
So why keep any KL at all?
Because even a small amount of regularization gives the diffusion model a better space to work in.
It helps keep latent values centered and reasonably scaled. That matters because the diffusion process depends on adding and removing noise at predictable scales.
It makes the decoder more tolerant of small errors. The diffusion model does not hand the decoder a perfect autoencoder latent. It hands over an approximate latent produced after many denoising steps.
It also helps keep the latent region connected enough for a generative model to cover.
So the modern diffusion VAE is not really trying to be a standalone generator. It is closer to a sharp perceptual compressor with a thin variational safety belt.
That small safety belt still matters.
Why Transformers Need Latents Even More
Latent diffusion was useful for U-Nets because it made generation cheaper.
For diffusion transformers, it is even more important.
Transformers use self-attention, and self-attention becomes expensive very quickly as the number of tokens grows. If you turn a high-resolution image into tiny pixel patches, the token count can explode. The model would need to compare too many things with too many other things.
A VAE reduces the spatial size before the transformer sees the data.
That does not just make the model faster. It changes the quality of the problem.
Pixels are a bad token system. A patch of blue sky and a patch of a human eye may receive similar compute, even though they carry very different amounts of information.
A good latent space removes redundancy. Each latent token can carry denser, more useful visual information. The transformer spends less effort on raw measurement and more effort on structure.
This is why the autoencoder is not just a compression trick. It shapes the workload of the generator.
A better latent space can make the same generator feel much smarter.
Video Makes the Problem Unavoidable
Video raises the stakes.
A video is not just a stack of images. Adjacent frames share most of their content. If we encode every frame independently with an image autoencoder, we waste computation and risk temporal instability.
Small encoding differences between similar frames can become visible flicker after decoding.
Modern video autoencoders solve this by compressing space and time together. They do not only reduce height and width. They also reduce the temporal dimension.
This matters for two reasons.
First, it removes redundancy across frames. The model does not need to relearn the same background 24 times per second.
Second, it helps preserve temporal consistency. The latent representation can describe motion and change more naturally than independent frame codes.
Many modern video VAEs use causal temporal structure, where a latent frame depends on the present and past, but not the future. That supports long videos, streaming, and chunked decoding.
In video generation, the autoencoder is not a minor preprocessing step. It defines what the model can afford to see and how stable time will be.
The Next Latent Space May Not Be a VAE
There is still a limitation.
A VAE-style latent space is usually optimized for reconstruction. It knows how to preserve visual information, but it may not organize that information by meaning.
A dog, a wolf, and a stuffed animal may be visually related in some ways and semantically related in others. A reconstruction-oriented latent space is not guaranteed to organize those relationships in the way a generator would prefer.
This is why newer research explores representation autoencoders.
Instead of training the encoder only to reconstruct images, these systems use pretrained visual representation models, such as DINO-style or vision-language encoders, and train a decoder on top of those features.
The hope is simple: start from a latent space that already contains more semantic structure.
That helps the generator because it does not have to discover visual meaning from scratch.
But it creates a new problem. Semantic representation spaces were not designed to be easy generative spaces. They may be high-dimensional, irregular, or difficult to noise properly.
So the old trade-off returns in a new form.
A visual latent space has to balance three forces:
Fidelity: can it reconstruct the image?
Modelability: can a generator learn the space?
Semantics: is the space organized by meaning?
Classic VAEs emphasized modelability and paid with blur.
Stable Diffusion-style autoencoders emphasized fidelity and kept just enough regularization to remain usable.
Representation autoencoders push harder on semantics, then have to rebuild modelability by other means.
The field keeps moving around this triangle.
The Real Lesson
Text models have an advantage that vision models do not.
Language already comes compressed. Words and tokens are discrete. They carry dense information. They are shaped by human meaning before the model ever sees them.
Pixels are raw measurements. They are not a good thinking space.
That is why visual generation needs an intermediate world.
The VAE, or whatever replaces it, gives the generator a smaller and more meaningful arena. It decides which details are worth modeling, which details can be decoded later, how noise behaves, how expensive attention becomes, and how fragile the final image will be.
So when an image model improves, the generator may not be the only reason.
Sometimes the real improvement is the world underneath it.
The durable lesson is this:
A generative model is only as good as the space it learns to generate in.
Before a visual model can draw, it needs a world it can understand.

