Recently, we partnered with IntelliGen to host a technical deep dive into Sora and the world of Gen AI videos! Our guest speaker lineup includes:

Guests

Lijun Yu: Mastermind behind Google's VideoPoet

📝 Relevant Work: "VideoPoet - A large language model for zero-shot video generation"
🔗 Learn More

Ethan He: AI wizard with 20k followers on GitHub, senior research scientist from Nvidia

📝 Relevant Work: "How Sora works - detailed explanation"
🔗 Explore Here

Moderator

Cindy Le: Tech Influencer (150k+ fo), AI founder + Researcher

📘 Relevant Work: EucliDreamer: Fast and High-Quality Texturing for 3D Models with Stable Diffusion Depth

🔗 Youtube channel

🎉 Partner Community: IntelliGen (@intelligen)

During the session, we took a deep dive into the technicalities of Sora, uncovering everything from the model’s breakthrough capabilities to its limitations. We asked the question: what makes Sora tick and how it would revolutionize video content creation?

For those who are eager to dive into the details, we've transcribed the full panel discussion. This is your chance to get up close with insightful queries from our community with industry speakers working at the forefront of AI.

Let’s get to it.

Panel Discussion Transcript

Cindy (Moderator): Could you introduce yourself?

Lijun: Hi everyone, I'm Lijun Yu. I'm currently a graduating PhD student from Carnegie Mellon University. I have also worked at Google as a student researcher for quite a few years. For the past two years, we have worked on video generation. From early on, I worked on transformer-based video generations, which led to the latest release called VideoPoet. It's a large language model style model of multi-modal video generation.

Ethan: My name is Ethan. I work on LLM and generative AI training frameworks at NVIDIA. We have two open-source libraries called Megatron and Nemo on GitHub, so you can train the LLMs on a very large scale — like trillions of parameters. We also released some open-source model training recipes like Llama and Mistral. Previously, I worked at Facebook AI Research on large-scale self-supervised learning. I graduated from Carnegie Mellon.

Cindy (Moderator): Why is Sora so powerful?

Lijun: I was expecting something similar to come by the end of 2024, but it actually came at the beginning, which was very fast-forward. And many people are trying to reproduce it because now it's verified. Sora uses diffusion transformer in a latent space1. So we know how to build latent spaces with discrete or continuous tokenizers. We also know how to train diffusion transformers for images and later for videos. They really scale this thing up — up to, say, tens of thousands of GPUs and hundreds of millions of data.

Ethan: I think my biggest impression is that the scaling law on vision is proven. If you have a self-supervised learning training model like a masked encoder, even if you scale it a hundred times, it's not going to change how it affects the word. It's just going to be another model with a better performance on some of the benchmarks. Stable diffusion is becoming larger and larger and better and better bit by bit. Everybody is hitting the ceiling. And now we are seeing the scaling law. We can add more data and scale the model and the infrastructure to get a better model and better intelligence.

Cindy: What is DiT?

Image Credit: Scalable Diffusion Models with Transformers by Peebles and Xie (2022)

Lijun: DiT stands for Diffusion Transformer. It uses the standard transformer architecture with bidirectional attention. It's trained with the diffusion objective by doing denoising step by step. Nowadays, we've seen autoregressive transformers3 are very scalable, as proved by the LLMs. This time, the diffusion transformer is also scalable.

Going back to the frameworks of various products and open-source projects, Stable Diffusion 3 is a diffusion transformer-based model. Other companies like Runway, Pika, Sora, and Animate are latent diffusion models, which means we do not do the diffusion denoising process in the pixel space. Those pixel diffusion models can be viewed as the first generation of models, starting from DDPM and later to Imagen and Imagen Video. Now the models we talk about are all latent diffusion models, which means we first use a variational autoencoder or other similar frameworks to downsample the videos, pixels, and images into a low-dimensional manifold so that the processing speed is much faster. After we downsample, we do the diffusion denoising training in this latent space. People have been using the U-Net architecture4 for a long time, which became popular from the first version of Stable Diffusion until the latest release of their third version. Similarly, all those video model companies have been using U-Net-based architectures, probably until now. But I believe after seeing Sora, everybody is shifting towards the diffusion transformer, which is much more scalable. But this one may be more costly when you achieve really high quality.

Ethan: I can talk about U-Net versus diffusion transformer. U-Net excels in training stability and convergence. It's easier to train than DiT. That's probably one of the reasons why it was prevailing before OpenAI released Sora. Because U-Net has this kind of long skip connection, it connects the embeddings from the beginning to the end. High-resolution features are going through skip connections to the end. Low-level features also go to some layers way behind them. They can easily learn the image features. You're trying to approximate images from images. This kind of skip connection would help that. A diffusion transformer does not have this kind of inductive bias, so it's harder to train this model. It's kind of similar to a vision transformer versus a convolution neural network. Convolution neural networks have an inductive bias of locality, but vision transformer is just global attention. It's going to learn this kind of locality through training, so it's much harder to train. In the development of our Megatron repository, we have pipeline parallelism where you slice the models into multiple stages. You can put multiple stages on different devices. You can scale the model size almost infinitely.

On the other hand, it's autoregressive versus diffusion. Autoregressive is usually used in transformer and language modeling. It performs very well in the language domain. But diffusion can also be used for language modeling. I remember there was a previous talk from Ilya Sutskever from OpenAI. Ilya mentioned autoregressive is naturally better than diffusion and masked language modeling. This is because autoregressive lacks token prediction, which is a conditional probability, which means you are trying to predict the future based on the past. However, masked language modeling like BERT is going to mask some of the tokens and leave out some of the tokens in the middle, and diffusion is going to construct some of the steps out of nowhere, so they have some kind of synthetic middle steps, which don’t exist in nature. That's part of the reason why autoregressive is stronger than diffusion and BERT. OpenAI also tried ImageGPT, which is autoregressive for generating images and videos. Ultimately, once you have enough compute, you don't have to use diffusion. You can just generate stuff in one step. But currently, since compute is limited, diffusion is still better. You compress a larger model into multiple diffusion steps, so it's kind of like a model 50 times or 100 times larger than the original model.

Cindy (Moderator): We are also interested in the limitations of Sora. Why may errors like this happen? How long do you think will it take to improve the model?

Open AI's Sora, a new text-to-video model, Explained | Encord

Lijun: This can come from two aspects of errors. One aspect is the data. The other aspect is the model. On the data side, first of all, we are not clear on what data Sora has used in the training. It's probably true that a part of the data is real, like from surveillance cameras, robotics, and self-driving car recordings. But there are also other things like people’s posts on YouTube, TikTok, and other platforms, or even from TV shows and movies. There are special effects, and those contents naturally do not follow physical rules. If the model is trained to produce these things, then the content we see may not (follow the physical law). Maybe we need to either filter out the data that only follow the physical rule or label them on whether they do.

On the modeling side, it's ultimately just the accuracy of the model prediction. It's just like the hallucination of large language models. In the early days, we could do machine translation. Later, we may be able to write novels from short to long with language models. We don't really care if it's real or not. But nowadays, we can use LLMs to do reasoning, to write code, and to solve math problems. This is when we require them to follow all the rules we define. And for videos, maybe Sora is like the GPT-2 level models. Earlier, the other models would just be doing machine translation. Now, it's a little more advanced. But still, it's far from accurate to precisely mimic the real world. That's why in many samples, we will see it break the physical law, like making a person have six or four fingers and four feet. This is fine for entertainment but not for robotics or embodied interaction. The speed to achieve that depends on a couple of factors, like how much data we can end up collecting and how accurately we can label whether this is real or not. And also, how much compute we can provide or use and how much improvement each generation of GPUs brings along us.

Ethan: If you look at the improvement on especially fingers and faces on stable diffusion, it's improved within about a year from generating imperfect fingers to perfect fingers. I cannot even identify if the current face-less images are real or not. Initially, to identify whether the image was real, you could look for the imperfections in the human body. But now, I need to look away from the main object in the image and look into the background – for example, if the stop signs make sense in the background. I guess in the near future, this kind of inconsistency problem should be solved really fast. I think it's primarily a data problem. For example, to solve the finger problem, you need to collect a lot of data like this and balance them during training or even just fine-tuning. In addition, once a model’s capability improves to another level, a lot of such problems will automatically disappear.

Cindy (Moderator): OpenAI's CTO recently accepted an interview, where she said they used “publicly available data and licensed data” to train Sora. What could be Sora’s data sources?

Ethan: The CTO didn’t answer the question about the data. YouTube has a clear policy that their data cannot be used for machine learning. It's unclear whether they used it or not. They already admitted they use Shutterstock. High-quality data is very common for large-scale training. For example, in GPT training, they had a small data set of high-quality data, like from books and Wikipedia. They then trained a classifier to classify whether a piece of text was of good quality or not. For video, they have probably done the same. As for gaming engines, I'm not exactly sure about that because OpenAI has a very small team. On YouTube, there are already a lot of gaming videos. I guess except that the difference is that once you want to apply the model to, for example, robotics, it will be helpful to have this kind of engine.

Cindy (Moderator): What kind of training data is good quality?

Lijun: Right now, there are two paths. Maybe they will converge in the end, but right now, we can optimize for the visual appearance or optimize for the physical rules for models. Say, we have self-driving recordings. Models trained on those data will have a really good sense of the real world. But if you used it for text-to-video, it would be really boring. On the other hand, if we have a lot of videos, like from movies, TV shows, or fun stuff from media that people posted, and we properly label them with captions in rich text, we can use that to train a model for entertainment purposes, such as for video creation and editing purposes, very well. For entertainment, we usually care about high aesthetic data, which emphasizes style, appearance, and movement. Usually, people can use filters (to enhance the data), and you can even hire artists to design the filters for you. These high-quality data are not in abundant amounts and can be used in the fine-tuning stage. My personal opinion, including when we developed VisualPoet, was that you can just throw in whatever data you have to make the model learn the general knowledge about the video domain and its connection with other modalities like audio, image, and text.

Cindy (Moderator): When do you think “the human feedback loop” will play a more important role in video generation?

Ethan: I think this will quickly become one of the biggest modes. It's going to align with the user's aesthetic. Collecting this kind of data is extremely expensive if you're going to label it yourself. But with user interaction, they can get this kind of data for free. ChatGPT is already generating like trillion tokens every day. Currently, they are not utilizing this data very well, but in the future, it will be extremely important, and it's harder for other companies to catch up.

Lijun: This is a very important methodology to use in the development of video models, and it's actually very under-explored. If somebody is going to do video generation products like text-to-video, they can collect this human feedback just by spinning up the data flywheel. And it's dependent on what groups of users you have. They would have different preferences.

Ethan: Yeah, I think it's even possible for OpenAI to customize to each of the users. Currently, they have the memory function for GPT. It customizes your experience.

Lijun: I agree. And if we take one step further, we may say we have really precise profiles for each user. And then we just generate videos based on their profiles. This could replace the recommendation systems of social media entirely.

Cindy: During the training of Sora, infrastructure came to play an important role. For startups that don't have very good infrastructures, what are some solutions or alternatives?

Lijun: There are a lot of requirements on infrastructure nowadays to rebuild something like Sora. As you mentioned, we need a couple of models to prepare the data. Even before that, we need a really scalable pre-processing pipeline to download, decode, and filter all of these video files because they will be very huge. Adding one more dimension to images is the most expensive modality you can have right now. That means you are going to be dealing with petabytes of data. You really have a lot of them. And then you need models to label these data and also select them. For large language models, there are a couple of variants that you can use for open source, such as Llama or Mistral — maybe use them for prompt rewriting. For captioning, you need models that understand the visual aspects. If you are rich, you can just call the APIs and for them. And then when you train it — I think that is the most challenging part for infrastructure because you will want to train with tens of thousands of GPUs. So you will have to do sequence parallel among other types of parallel to train it efficiently, and there's going to be a lot of challenges. Also, people have recently been complaining about the reliability of GPUs. You may only discover an error after a couple of hours the job has been running. And when a single chip goes down, the entire job may fail if you do not handle it properly. You may need smart ways to reroute the compute for this chip to some other chips, or you could just get rid of this machine and switch to some other machines.

Ethan: Lijun also mentioned the sequence parallel because people have speculated that Sora uses millions of tokens for context lens. We actually already have the solutions in Megatron. We have what we call sequence parallel and context parallel. So context parallel would slice the attention into multiple devices, kind of like how flash attention works.

Cindy (Moderator): How big is Sora’s model, and what might be the inference time for generating a one-minute video?

Lijun: Maybe we can first go with inference time since I think OpenAI people have also commented on that. I’d say a couple of minutes, maybe up to 10 minutes, which is better than we originally expected. But on the other hand, maybe it's using a lot of GPUs in parallel with 100 GPUs. As for the model, my personal opinion is that it's below 10 billion parameters because they haven't really proved that it can scale to hundreds of billions. Even at the 10 billion scale, handling the 1 million sequence lines is already a very big challenge for the hardware. It's bigger than what people originally estimated on social media, say, maybe 3 billion. I think the estimate would be, let's say, between 7 and 10 billion.

Ethan: Even though it's a small model, it has a very large context length, which makes it not that small. In previous OpenAI talks, I found that usually, they would fit a model that you can physically fit into the machines. They would choose model hidden dimensions up to the point where the memory is full, so I guess they would use the full capability of their clusters. It's definitely going to be millions of context lengths and tens of billions of parameters.

Cindy: Do you think Sora could be classified as a world model5? And what's your own definition of a world model?

Ethan: The disagreement with Yann LeCun was primarily whether it should have a decoder. He believes that you do not build a world model with a decoder because there are millions or even billions of possible outcomes for the next timestamp in a world. This way, you cannot predict what's going to happen in the future. If you have a decoder to predict what's going to happen next, it's going to fail miserably. He instead proposed JEPA6, which is Trans-Embedding Predictive. Instead of doing a decoder, it's going to use an encoder to encode the state of the world, and they are going to do the prediction in the hidden states. This will avoid the millions of possibilities in the outcome. In the hidden states, it's no longer a one-to-one mapping. One hidden state can map to multiple outcomes and multiple decoded results. The model can better learn the world states. That's his argument. I don't think it's necessary to discard the decoder. Sora already demonstrated a very good generation result.

Lijun: First of all, I think Sora is a big step toward a world model, but it's not it yet. There are two reasons. One reason is, as we have discussed a lot before, it does not follow the physical rules very well. And one more important reason is it does not allow action as inputs, so you cannot interact and control the future outcome in an autoregressive manner for infinite long video generation. I think that's a critical property for a world model to be useful.

Going back to the discussion with Yann LeCun's JEPA architecture, I actually like some of his designs and, let's say, half of his arguments. First of all, I can agree with him that the probabilistic models may not be very good when used as world models because when we are trying to learn that probability distribution, there is a very high chance that we go in the correct way following all the physical rules. But there are always chances that when you do the sampling process, you select some rare cases that break the law. You can never prevent these things from happening. Although in the language modeling space, it's less of an issue, and especially when you just make a typo in text — it’s okay. But for a world model, if you want it to simulate physics, I don't think that's the ideal case to use. But I also don't like JEPA because I think you should go back to pixel space. Richard Feynman said that what you cannot create, you do not understand. You should be able to create those pixels. There are a lot of entropies or uncertainties when you want to create every pixel, or maybe you could discard the pixel representation and just create all the particles in the world.

There are a lot of uncertainties given the things you observe. Even for video prediction problems, you cannot really accurately predict how the tree waves will move because you don't know the wind, but if you have enough input, you can definitely precisely predict. JEPA incorporates a latent variable — essentially random noise—as its input. This input is then used to generate or predict the embedding within a distinct space. In my opinion, maybe you can just use that to predict the pixel space output or the real space output of another modality so long as you capture all the latent, unseen, and unknown dependencies in those latent variables instead of the sampling procedure.

Cindy (Moderator): Do you think to achieve the world model, we should adopt an end-to-end strategy? Maybe with the help of physics engines?

Lijun: I think we should not use a physics engine when we build a world model. The ultimate, superintelligence style of the world model is that we want to discover new physical rules to help understand the universe. Maybe it could discover another law of relativity. To do that, we should not constrain it with all the physical priors we already have. Maybe as a first step of verification, we should be able to do something that rediscovers all the laws we already have, like Newton's laws in the low-speed scenario. So I believe in a pure data-driven setup. Sora itself is also not an end-to-end model because it has the encoders and decoders for pixel-to-latent and latent-to-pixel, and it also has the text encoders and even the prompt rewriting engine with GPT. So it tests many modules.

Cindy (Moderator): Do you think computer vision will be solved by a model like Sora or maybe its future version?

Ethan: Computer vision is already kind of solved by some of the standards. A model like Sora is able to maintain 3D consistency, which means that the model can encode the 3D object in its hidden dimension. It also has consistency across time, which kind of means it's also able to self-track. That's why it can keep every object's consistency across different frames. I think most of the tasks will be solved in probably three years, by the same architecture.

Cindy (Moderator): Text-to-video generation is definitely going to have a lot of implications in other fields as well. How might the Sora type of video-generation model be used for autonomous driving?

Lijun: I think it's going to be a big boost for robotics and even self-driving. The action planner for these systems could be based on content generated by this type of video-generation model, or the model planning the actions could be the same one predicting future scenarios. Because the model has such a good understanding of these scenarios, it can choose the best action to proceed. That enables a lot of capabilities to help us in daily life.

Ethan: Some of the companies have already been training visual language action models, like GAIA-1 from Wayve, which is a UK autonomous driving company. They use large language models to do both understanding and planning. They've got the video language pair and ask the model how it should act based on the video, so the model is explainable. It can actually explain why it's going on this route, and it can also do the actions on top of the video.

🙌 Follow Us & Connect with Our Team!

X | LinkedIn

03/16/24 - Gen AI Video Breakout & World Model

Guests

Panel Discussion Transcript