OpenAI, the organization responsible for ChatGPT and DALL·E 3, has recently revealed their latest creation: SORA, a text-to-video model. Named after the Japanese word for "Sky," SORA can generate videos up to one minute in length based on short descriptive prompts provided by users. Not only can it create original videos, but it also has the ability to extend existing videos in either direction in time or animate static images with remarkable accuracy and attention to detail.
SORA operates as a diffusion model, starting with a video that resembles static noise and progressively transforming it by removing the noise through multiple steps. Built on a transformer architecture, similar to the GPT models, SORA exhibits excellent scalability with various types of visual data, including different resolutions, durations, and aspect ratios. The model also incorporates the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for visual training data, further enhancing its fidelity and adherence to user prompts.
With SORA, users can generate videos featuring complex scenes, multiple characters, specific types of motion, and precise subject and background details. What's more, the model can create multiple shots within a single generated video while maintaining consistency in characters and visual style. OpenAI showcased examples of SORA's capabilities, including woolly mammoths in a snowy meadow, a stylish woman strolling down a neon-lit Tokyo street, a space man wearing a knitted helmet in a movie trailer, and a close-up of a color-changing chameleon.
OpenAI views SORA as a stepping stone toward understanding and simulating the real world—a crucial milestone on the path to artificial general intelligence (AGI). However, the model does have some limitations. It struggles with accurately simulating the physics of complex scenes and may confuse spatial details of a given prompt. Additionally, SORA fails to model cause-and-effect relationships in certain events.
Addressing the safety and ethical implications of SORA, OpenAI acknowledges the potential for the creation of misleading or harmful content. To mitigate these concerns, the company is working with red teamers—experts in areas like misinformation, hateful content, and bias—who will test the model in an adversarial manner. OpenAI is also developing tools to detect videos generated by SORA, including a detection classifier and a metadata standard. They are leveraging safety methods from their DALL·E 3 products, such as text and image classifiers that identify and reject prompts violating usage policies.
OpenAI emphasizes their commitment to transparency by sharing early research progress, seeking feedback from external individuals, and providing insight into upcoming AI capabilities. They actively engage with policymakers, educators, and artists worldwide to understand their concerns and identify positive use cases for this revolutionary technology. Although currently available only to limited researchers and video creators, OpenAI plans to expand access to SORA in the future.