Just Write a Prompt and Your Video Is Ready!

in Popular STEM6 days ago

Have you ever wondered how a single line prompt can create a video of your choice that is becoming increasingly difficult to distinguish from reality?

generative-video-model-2.png

Let's see what technical complexities an AI system implements in the process of creating this video and surprise you.
We call such video generation models "latent diffusion transformers." We will look at these words separately.

First, we remove the confusion of diffusion. You take a picture (any scene), that picture is shown to the diffusion model of the image generation system, then random pixels are thrown on it (Pixel spattering).

Just like digital noise used to come on TV in childhood, such pixels are thrown on the picture, which spoils the picture, and then this diffusion model is asked to reverse all this process because it knows the original picture. Such a model is trained on how to remove this type of noise from an image (and how to create such a picture just by writing).

To create an image from text alone, this diffusion model needs an LLM that helps create an image for this diffusion model by using the text of the prompt and training data on millions of images, which guides the diffusion model.

This diffusion model creates such images (video frames), and the LLM supervises it. In this process, the encoder in the diffusion model converts each frame (image) into a latent representation at each stage, which is a mathematical code of this frame. This diffusion model acts on this latent code and creates a clear code of the desired frame. Then this code goes to a decoder that creates a high-resolution image of it.

Now, do you understand what this latent diffusion was meant to do?

If you don't understand, it's okay. Just understand that instead of the raw pixels of the image, this diffusion model can easily process its encoded version, and then the decoder converts it back into a good image. The process of creating a video can be done quickly because this diffusion model does not have to work on all the pixels of the image, but rather on the encoded data of this image (which is very small) to create an image (frame), and the decoder's job is to convert this processed work back into a large image.

« Transformers

The last and most important part that works in this video generation is a neural network architecture whose job is to connect these unrelated video frames in a specific order to create a living video. Just like in a text written by AI, a whole sentence and a whole paragraph are formed by a specific combination of words, which is actually the job of this transformer model.

This transformer actually understands the context of the frames, knows the changing environment from one frame to another, which reduces the possibility of incoherence in the final video.

Thus, these latent diffusion transformers work together to create a video. This article only mentions the modern video generation model, but there are other models that are used for video generation.

This was a brief introduction that may contain some technical complications, and there may be some errors or omissions.