First meeting Sora
Following ChatGpt3.5 to 4.0PlusOpenAI officially announced the launch of Sora, a large model for text-generated video, on February 16, 2024:
Official website: openai.com/sora
According to OpenAI’s official website, this model can be used to generate one-minute videos from text. The videos can have complex scenes such as multiple characters, specific types of movement, precise themes and background details.
sora principle
The task Sora wants to solve is actually very easy to understand. Given a piece of text, the model needs to generate a corresponding video based on the text. Simply put, it is text-to-video (t2v). t2v itself is not a new problem. Many manufacturers are studying the t2v model. However, the videos generated by the current t2v model are generally of poor quality and difficult to reach the industrial application level. Before the emergence of Sora, everyone’s general understanding was that t2v was a difficult task, and it would be difficult to implement an industrial-level t2v model (or a truly practical t2v model) in a short time. However, OpenAI slapped everyone in the face again, and the release of Sora means that this day has come.
Based on Transformer architecture
A brief description of the Sora training modeling process is: the original video is encoded into latent space (latent space) through a visual encoder to form latent space-time patches (spacetime latent patches). These latent space-time patches (combined with text information) are passed The transformer performs the training and generation of diffusion [2, 3, 4], and decodes the generated latent spatio-temporal blocks into pixel space through a visual decoder. So the whole process is: visual encoding -> latent diffusion with diffusion transformer (DiT) [4] -> visual decoding.
Diffusion models and training stability
The Sora model adopts the diffusion model method, which has better generation diversity and training stability than the traditional GAN model. The diffusion model generates videos by gradually removing noise, which can effectively improve the quality of the generated videos. At the same time, by adopting a diffusion model, Sora can also generate more realistic video scenes.
Sora has the flexibility to use videos of different durations, resolutions and aspect ratios
OpenAI found that most of the previous methods used fixed-size videos (such as 4s 256×256 videos) to train models, which have a large gap with any length and aspect ratio in reality, and using original size videos to train models is more effective. Thanks to the transformer structure adopted by Sora, Sora can input any number of visual patches (initially noise patches) to generate videos of any size.
Data processing and compression to generate video
Generating videos requires processing a large amount of data. For this problem, the Sora model uses data processing and compression technology. By processing and compressing video data, Sora can reduce storage space while maintaining video quality.
Video quality and fidelity
In the process of generating videos, the Sora model focuses on maintaining video quality and fidelity. By adopting the Transformer architecture and diffusion model, Sora can generate more coherent and highly realistic video scenes. This gives Sora a wide range of potential in application fields, such as film and television production, game development, etc.
Reference link: www.openai.com/research/so…
sora works display
Case 1:
1.Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick . She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
Translation: A stylish woman walks on the streets of Tokyo, surrounded by warm, shiny neon lights and vibrant city signs. She wore a black leather jacket, a long red dress with black boots and carried a black handbag. She wore sunglasses and red lipstick. She walked confidently and leisurely. The streets are damp and reflective, giving a mirror effect of colorful lights. Many pedestrians are walking on the street.
Case 2:
2.Prompt: Several giant wooly mammoths approaching treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.
Translation: Several huge woolly mammoths walked across a piece of snow-white grass, their long hairs fluttering gently in the breeze. In the distance, there were trees covered with snow and majestic snow-capped mountains. There were some thin clouds in the afternoon light. The sun hung high in the distance, creating a warm glow. Stunning low-angle photography captures these large, furry mammals with a strong sense of depth.
Case 3:
6.Prompt: The camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from it’s tires, the sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over the scene. The dirt road curves gently into the distance, with no other cars or vehicles in sight. The trees on either side of the road are redwoods, with patches of greenery scattered throughout. The car is seen from the rear following the curve with ease, making it seem as if it is on a rugged drive through the rugged terrain. The dirt road itself is surrounded by steep hills and mountains, with a clear blue sky above with wispy clouds .
Translation: The camera follows a white vintage SUV with a black luggage rack on top as it accelerates down a steep dirt road surrounded by pine trees. The wheels kick up dust and the sun shines on the SUV as it speeds along the dirt road, giving The scene adds a warm glow. The dirt road curved gently in the distance, with no other vehicles in sight. The trees on the roadside are redwoods, dotted with green plants. The car follows curves easily in shots from behind, making it look like it’s driving effortlessly over rough terrain. The dirt road is surrounded by steep hills and mountains, and the sky is clear with occasional thin clouds.
Sora application prospects
- Video creation: Users can generate high-quality videos based on text;
- Extended video: Based on a given video or picture, you can continue to extend the video forward or backward;
- Video-to-video editing: For example, applying SDEdit [7] to Sora can easily change the style of the original video;
- Video Join/Transition/Transition: Two videos can be cleverly blended together, using Sora to gradually interpolate between the two input videos to create seamless transitions between videos with completely different themes and scene compositions;
- Vincentian diagram: The image can be regarded as a single frame of video, so Sora can also implement Vincentian diagram.
It is foreseeable that when Sora is officially opened, a large number of Sora-generated videos will emerge on the short video platform, and many ordinary people will gain a pot of gold.
The industry shock brought by Sora
- Short video content creation may enter a new era: Sora can provide rich video materials;
- Video editing and editing: Sora has relevant application capabilities;
- More realistic digital people: users can get their own “ideal type”;
- Entertainment: generate videos from images with one click;
- Game industry: Game engines are challenged by Sora;
- Graphics: a future that may no longer exist.
Sora usage tutorial
Tips:
At present, OpenAI has not yet opened the use of sora, and is still in the internal testing stage. According to the Vincentian graph model DALL·E case, it must be used by paying users of ChatGPT Plus first. They need to register or upgrade GPT Plus. You can read this tutorial: Upgrade ChatGPT Plus with one click. Hello Rice tutorial
Preparations before using Sora
Before starting, make sure you have an OpenAI account and have access to Sora. Prepare a text description of what you want to turn into a video, remembering that the more detailed the better.
(Portal for friends who do not have a chatgpt account: www.chatgptbom.com/new-chatgpt… )
Sora usage step one: text description
1. Describe video content: First, you need to give a clear description of what you want to show in your video. This includes setting, characters, action, and overall tone. The more details you provide, the better Sora can understand your vision.
2. Complete the text description and custom settings: After completing the text description, you can click the “Generate Video” button. Sora will begin processing your request, which may take a few minutes.
Step 2 of using Sora: Generate video
Preview and edit videos: After the video is generated, you can preview it. You also have the flexibility to edit and change the generated scene if needed to ensure the final video meets your expectations.
Sora FAQ
The motion generated by current video generation models is generally not very good. The simplest example is “people walking”. Most models cannot generate a coherent, long-term, and reasonable process of people walking. The results generated by Sora are significantly ahead of previous models in terms of coherence and rationality. So what exactly prompted this result? Is the model size scaled up? What size does it need to be scaled up to? Or what about data collection and cleaning? And to what extent?
It cannot accurately model many fundamental interacting physical processes, such as glass breaking. Other interactions, such as eating food, are also not always predicted correctly. We list other common failure modes for models in our landing page, such as inconsistencies in long samples or objects appearing out of thin air. “