I have compiled some technical issues into text, hoping to help everyone better understand Sora. I roughly organized the questions into four categories:
- Sora’s technical knowledge
- Sora product related questions
- Sora’s value and applications
- Gossip about Sora
Note that the answers here are all my personal opinions, and some of them are based on the results of everyone’s discussions in the post. Many answers may not be accurate and are for reference only. You are also welcome to correct errors or put forward different opinions.
Sora’s technical knowledge
What is Sora? What can you do?
Simply put, Sora is a technology that can use text to generate videos of up to 60 seconds. It can also be used to generate pictures, because pictures are essentially one frame of video.
What upgrades does Sora have compared to previous AI video generation tools? What are the differences with other AI video generation tools on the market such as Runway, Pika, and SVD?
The reason why Sora has attracted great attention is that the quality of the video it generates is much higher than before. Not only can the video last up to 60 seconds, but it can also support camera switching, stabilize the characters and background, and provide very high image quality.
Pika is based on the Diffusion model, which trains images and videos into meaningless mosaic images, and then back-diffuses blank mosaic images to generate images and videos. There are two main modes, one is based on image key frames and is expanded into video. For example, the style transformation of existing videos; one is to train videos, but due to graphics card limitations, only a few seconds of video at a specific resolution can be trained at a time, and only a few seconds of video can be generated at a time.
LLM and ChatGPT are Transformer models that predict Token to generate text content. Token can be understood as words and words.
Sora is based on the Diffusion Transformer model, which combines the diffusion model and the Transformer model. However, it predicts and generates not text tokens, but “spacetime patches”, which can be understood as a few frames (less than one second). A small piece of video.
The main advantage is that training is not subject to video and graphics card constraints, and the generation is more diverse, and spatio-temporal patches can be flexibly combined.
Cost of use: It is now possible to generate a 60-second video. What is the cost of a 60-second video? What are the requirements for computing power?
Now “DALL-E 3 HD Image Price 0.08;RunwayGen−2price is0.08 ; R u n w a y G e n−2The price is 0.05/second.
Sora has not released relevant data. It is pure speculation: Sora’s inference requires about ~8xA100, and the video generation is estimated to be one second per minute, and the cost for half an hour is about ~$10.”
Is it possible to generate music (audio)? If not, what is the difficulty?
It should be possible in the future, but not now because:
- Different sounds need to be emitted based on the environment, object type, collision between objects, and location in the video.
- Requires multiple sound sources to be superimposed
- The music not only needs to be of high quality, but also needs to be integrated with the scenes in the video.
- Character dialogue needs to be aligned with the character’s position, mouth shape, and expression
Sora product related questions
Does it need to be modeled or used in other ways? When will it be available for commercial use?
There is no need to build it locally, and it is expected to provide two methods: ChatGPT integration and API calling; however, the cost of generating videos is high and takes a long time; the number of times may be limited or a higher-level subscription may be provided.
It is expected to be gradually relaxed within three months to six months.
Will using the same request at different times produce the same video? Can it support subsequent fine-tuning modifications or input of more certain boundary conditions? Is the current model architecture capable of supporting this?
The same prompt word will not be the same every time, but the same seed should make it similar;
Sora supports image-generated video and video-generated video, but whether the characters can be consistent cannot be concluded until the product is released.
When can longer videos be generated, such as 30 minutes, 60 minutes or even longer?
The longer it takes to generate a video, the higher the requirements for video memory. However, according to the current speed of technological development, it is optimistically estimated that it should be able to reach 5-10 minutes in 1 year, and 30 minutes and 60 minutes are expected to take 3-5 years.
Who owns the copyright of the generated video?
According to the rules of image generation, it should be owned by the creator, but the generated work itself cannot infringe.
Virtual vs reality: How to tell which videos were filmed? Which ones did Sora do? What else will be true in the future? The problem of deep fakes: Will it be easier to be defrauded, and how to fight against fraud?
Today’s videos have watermarks, and there should be detection tools in the future.
In addition, if you look carefully, you can see the illogical aspects in the video, such as ants only having 4 legs, human hands deforming, etc.
In fact, we have already experienced it: photos are not real, TV is not real, and movies are not real, and the people’s identification level will also improve simultaneously.
Counterfeiting and identifying counterfeits is a long-term offensive and defensive battle.
Sora’s next development trend?
- cost reduction (faster and cheaper);
- Quality improvement (duration, image quality, lens switching, consistency, compliance with physical laws);
- New capabilities: integration of sound and GPT, complete multi-modality;
Can it be used to make cartoons?
Short films are totally fine, complex scenes and longer ones are not, but we can look forward to it in the future.
Sora’s value and applications
What application scenarios does Sora have? How practical is it? Commercial application value?
I summarized the value and application of Sora from four aspects:
- First of all, it can amplify the expression ability of ordinary people. Zhang Xiaolong said that cars are an extension of legs, ChatGPT is an extension of hands, and Sora is a comprehensive extension of our expression, which is the legendary “mouth substitute”This means that we can better express our ideas and are no longer limited by our writing abilities, painting abilities, photography abilities, video editing abilities, and even speaking abilities.
- Sora is a low-cost video toolSora will greatly reduce the cost of video production, which means more people can produce videos at a lower cost, which is a big plus for video creators.
- New human-computer interaction method, dynamically generate videosSora has demonstrated the ability to generate games like Minecraft. Perhaps in the future we can use Sora to dynamically generate game plots, tasks, and scenes. In addition, we can also let Sora dynamically generate videos of news and articles without having to read them.
- emotional sustenanceGenerate videos of deceased loved ones to preserve their memories. Digital companion.
What is Sora’s logic for making money?
Depends on the value created around Sora:
- Emotional value: Selling courses relieves anxiety, provides entertainment and emotional sustenance
- Artistic value: micro film
- Content value: second creation of novels, selling materials, teaching, storytelling, game generation, advertising
- Ecological value: Prompt, easier-to-use gadgets, bypassing restrictions
- Reduce costs and increase efficiency: Quick MVP to verify ideas, advertising, e-commerce, and movie storyboards
How do ordinary people use it well? How to use Sora to do some side business?
- Use it, learn how to use it, know what it can do, and where its boundaries are.
- Choose a direction that suits you and prepare relevant materials or development projects in advance
- Technical personnel can prepare to start preparing products and tools: collecting prompts and secondary development based on APIs
Gossip about Sora
Is the name really originated from Tengen Breakthrough’s OP Sora Shiko?
I’m leaning toward yes.
Is the current heat a conceptual hype (for financing and stock price)? Or is it truly useful?
It is real and useful and can be applied immediately to short videos, such as OpenAI’s account on Tiktok. The videos look fake and real.
Have you seen or heard some exaggerated and unrealistic statements on the Internet?
- One of the key raw materials of “Sora” – maleimide resin comes from a company in Mianyang, Sichuan.
- Sora understands physics.
- Sora is connected to the game engine
- Sora is a key milestone for AGI, which can be achieved within a few years
How competitive is Sora among the world’s top companies? How is China developing in this field? What companies are doing this in China? What is the difference between China and Europe and the United States?
OpenAI has been invested for more than a year, leading the industry by half a year to a year, or even more, as reflected in:
- Leading technology, the technology has not yet been made public, it will take time for other companies to crack it
- The advantage of large models is that they have the most advanced models to help with training, such as automatically generating high-quality video annotations. China should be able to catch up soon – there are talents, data, and computing power, but only a few large manufacturers have the opportunity. The requirements for talent, data, and computing power are too high
It is unclear whether Chinese companies are already working in this direction, but Byte, Alibaba, Tencent, and Baidu all have deep accumulation in the field of AI video.
The gap between China and Europe and the United States mainly lies in the grasp of the direction of AI technology, but this is not just a problem in China. At present, all other companies in the world are following the pace of OpenAI; in addition, the computing power is not completely self-sufficient.
A new industrial revolution? Some netizens pointed out that in just a few years, the highly sought-after “epoch-making” technologies include web3, blockchain, metaverse, Google Glass, Boston Robot, vision pro, chatgpt, etc. Is this time confirmed to be epoch-making again?
It depends on how you define it. In the field of text-generated video, it must be epoch-making! GPT moment for real text-generated video.
- ChatGPT text generation
- Stable Diffusion, MIdJourney, DALL-E image generation
- Sora text generation video
Sora’s physical sensation in Silicon Valley? What’s the real response in the industry? What is the current mentality of entrepreneurs and investors in the AI video generation track, and how will they respond?
- The response has been enthusiastic, with mostly positive reviews
- It is expected that the Diffusion direction will be more difficult to obtain investment.
- Entrepreneurs need to reconsider the direction, such as video editing, turning to Sora-based interface development applications
What does it have to do with chips?
Video generation will continue to be popular in the next few years and will continue to require a lot of computing power, which means a large number of graphics cards. However, in the future, NVIDIA will not be the only one in graphics cards. There should be more companies involved. In this case, the number of graphics cards will increase. Supply will be more plentiful, prices will be more reasonable, and performance will be higher.