Will Sora Put Hollywood Out of Job?

2024-02-20 来源：搜狐时尚原文链接评论0条

(About the author:Xu Siqing, founding partner and CEO of Alpha Startup Fund, is a serial entrepreneur who has embarked on three entrepreneurial ventures. He contributed to the IPO of Alpha Startup Fund on NASDAQ as COO in 2010. He served as an investment partner at Sinovation Ventures, Chief Marketing Officer at Qihoo 360, and later joined WI Harper Group as Managing Director, overseeing investment and management operations in China. In 2015, he established angel investment firm Alpha Startup Fund.

Xu has over 20 years of experience in the IT, Internet, and telecommunications industries, having established Microsoft South China business and served as its first General Manager. He has also held positions such as the General Manager of Data Business at China Netcom Co., Ltd. and the Chief Marketing Officer at eLong Travel.

Xu has been awarded the title of "Forbes China Best Venture Capitalist Top 100" in 2020, 2022, and 2023, and has received multiple awards including 36Kr's "Most Popular Investor Among Entrepreneurs in 2023." Xu holds a Bachelor’s degree in modern mechanics from the University of Science and Technology of China and a Master’s degree in material physics from the Chinese Academy of Sciences.)

BEIJING, February 19 (TMTPOST)—OpenAI released its first AI video generation model Sora last Thursday. This marks a historic milestone, as the diffusion model, combined with OpenAI's highly successful transformer, has achieved a breakthrough in visual generation similar to that of large language models. Undoubtedly, a commercial revolution in the field of visual generation will follow.

This article will discuss: 1. What is Sora and how it works; 2. The industrial opportunities Sora presents; 3. Will AI startups fail to survive?

WhatisSoraand how it works

Sora has redefined the standards for AI video generation models in several aspects:

· Sora has increased the video length from five to fifteen seconds to one minute, which can fully meet the needs of creating short videos. According to OpenAI, if necessary, it’s a piece of cake to make videos longer than one minute.

· Sora can generate multiple shots, and each shot maintains consistency in character roles and visual style.

· Sora can generate videos from text prompts, and also support video-to-video editing. It can also generate high-quality images. Sora can even collage together completely different videos to make them merge into one coherent piece.

· Sora is based on a diffusion model, as well as a visual large model that combines diffusion with Transformer, and it has produced emergent phenomena and has gained a deeper understanding and interaction capability with the real world, resembling the early form of a world model.

Soracan generate more real, highly consistent multi-shot long videos

OpenAI released dozens of sample videos, demonstrating the powerful capabilities of the Sora model.

Facial features such as pupils, eyelashes, and skin textures is so real to the naked eye that no bug can be detected, marking an epic improvement in authenticity compared to previous AI-generated videos. The gap between AI videos and reality is even harder to discern.

The drone's perspective of Tokyo street scenes showcases the advantages of Sora in complex settings, naturalness of character movements, and beyond.

Driving on the mountain roads, the retro SUV looks highly realistic.

Sora can make a natural transition between two input videos, creating seamless transitions between videos with completely different themes and scenes.

How the Diffusion Model+Transformer Works

Inspired by the large-scale training of big language models, the OpenAI team took a similar approach. In the same way of handling the text data tokens of big language models, they segmented visual data into chunks. They first compressed the video into lower-dimensional latent features and then decomposed it into spatiotemporal chunks, which serve a similar function to tokens in big language models, used for training Sora.

Simplyput,Sora has tokenized images/videos

Sora is a video model based on the Diffusion Model, but it is a Diffusion Transformer model. The Transformer has already proven its powerful capability in implementing language, vision, and image generation together.

It is based on the research findings of DALL·E and GPT models, adopting the re-annotation technique of DALL·E 3, and it enables the model to more accurately follow user text instructions to generate videos by leveraging the capabilities of GPT.

So,Sorais a visual large model that combines a diffusion model with a transformer.

In addition to generating videos based on text instructions, this model can also transform existing static images into videos, giving the content in the images targeted and detailed animations. The model can also extend existing videos or complete missing frames.

The emergence of Sora has further widened the gap between China and the U.S. in the field of AI.

Majorflaws to be addressed

However, despite the significant improvements in technology and performance, Sora still has many limitations, particularly in understanding complex scenes involving physical principles, cause-and-effect relationships, spatial details, and the passage of time. For example, it does not represent the shattering of glass well.

Also, there is no change in the flame before and after blowing out a candle.

It also made an error in the direction of a person running on a treadmill.

OpenAI only provided a demonstration of the generated video, and with the release of Sora, it also sparked concerns about the misuse of video generation technology. For this reason, the company did not officially make Sora available for public use, but carefully selected a group of "trusted" professionals for testing.

IndustrialopportunitiesSorapresents

Firstly, this marks a milestone in technological advancement.

Secondly, in the context of video applications, powerful presentation does not equal being practical. If commercialization requires a score of 100 (60 points for technology + 40 points for scenario), human beings could get 90 points, and Sora can get 60 points, or up to 75 points. There is still a path to commercialization that needs to be completed by manual efforts or a combination of technology and business innovation.

First, controllability. Whether in commercial or creative scenarios, videos need to be completed according to human intelligence or objective laws, which poses a huge challenge for Sora.

For example, someone proposed a physical model, and although Sora can generate beautiful and flashy moves, when it comes to a specific scenario, such as a rubber ball bouncing repeatedly off the ground, it requires the support of a physical model, which goes beyond the capability of current diffusion+transformer technology.

Second, the prompt is still a technical challenge. In the visual field, it is generally difficult for non-professionals to use visual generation effectively, which requires training and technological breakthroughs to train laymen into experts.

As such, there is still large space for improvement for creation based on practical scenarios. For creations with 60 to 75 points or more, there are opportunities for scenario innovation.

The opportunities for scenario innovation belong to creators who understand the scenarios and the models.

Those who have watched the TV drama "Blossoms" know that, for famous directors like Wong Kar-wai, technological innovation tools can at most improve the efficiency of presenting specific scenarios. Characters like Ba Zong, Lingzi, and Ye Shu cannot be replaced by machines in the short term.

What we may expect is not AI making filmmakers unemployed, but empowering filmmakers to create better works.

Will AI Startupsat Home and AbroadFail toSurvive?

First of all, the winner takes all does not apply to all cases. A notable feature of the U.S. business ecosystem is that top-tier companies build platforms, second-tier companies produce a full range of products, and third-tier companies focus on winning customers.

OpenAI's Sora marks a significant engineering progress, somewhat akin to the industry leading the way in state-funded scientific research. However, this breakthrough was first realized in the industrial sector, not academia, and there is still some distance to go before commercialization.

Leading companies need to secure their leading position in key areas, make breakthroughs in technology, build platforms, and also develop vertical applications, but they place more emphasis on attracting a wide range of developers to participate, rather than spreading themselves too thin by trying to cover all applications.

Therefore, there is a lot to be achieved beyond 60 points. This can be clearly seen by looking at the thousands of applications on Salesforce.

Secondly, according to OpenAI's paper, the path to supporting 60-second videos is clearly articulated, saving many startups tens of millions in exploration costs, while also providing entrepreneurs with a great deal of imaginative space.

If it only takes 15 seconds, if the controllability of the video subject is increased, if there is a need to control the subject's path in the video, could there be other options? Could the diffusion transformer be used in a better way? Again, the capability of the model determines the height of a startup team, and above a score of 60, the applications supported by the model will be distinguished. Startups that understand models and applications have great opportunities.

In the U.S. market, large companies that follow the lead like to narrow the gap through mergers and acquisitions; small teams that run fast and start quickly are highly valued when entering large companies.

Domestic mergers and acquisitions are not as active, and big factories like to enter the field and do everything. But with OpenAI running so fast and so many opportunities emerging on such a large track, it's hard for big companies not to have other ideas, in case another big company beats them to the draw.

Once again, this is a grand arena where everyone can play on an equal footing.

Admittedly, behind the large video models is the super-linear growth of training and inference computing power. The rise of demand, coupled with a greater need for computing power, infrastructure, and tools, has generated more new opportunities than ever before for Chinese and U.S. entrepreneurs.

(This article was first published on the TMTPost App)

关键词： Sora model videos video models large

转载声明：本文为转载发布，仅代表原作者或原平台态度，不代表我方观点。今日澳洲仅提供信息发布平台，文章或有适当删改。对转载有异议和删稿要求的原著方，可联络content@sydneytoday.com。