Yesterday, Sam Altman, CEO of OpenAI, Sora announced, the company’s up-to-date AI video generator. Like DALL-E and ChatGPT before it, Sora is able to accept natural language commands from the user, understand the request, and play it back as advertised. Only instead of generating text responses or generating an image, Sora generates a full, realistic video, better than any AI program I’ve ever seen. I don’t mean that as a compliment.
Sora’s initial impression: terror
OpenAI has put up a series of different videos on Sora’s classifieds page showing off what it can do, and they are stunning – in the worst sense. Sora can generate animated content such as “a brief fluffy monster kneeling next to a melting red candle” or “a disco with an animated kangaroo”. While the end result isn’t up to the quality of, say, Pixar or DreamWorks, they largely look professional (and some definitely look better than others). I doubt many people would guess at first glance that humans were not involved in this process.
But while the potential of animation is disturbing enough, realistic films are downright terrifying. OpenAI showed off “drone footage” of a historic church on the Amalfi Coast, a parade of people celebrating the Chinese Lunar Recent Year, and a tracking shot of a snowy street in Tokyo, and I promise the site would assume these videos are real on your first watch. I mean, some of it still doesn’t feel like it was generated by the AI and me know They are.
Even those with AI flaws such as asset warping and shifting can be confused with video compression. There’s a video of puppies playing in the snow, and while you’ll notice some hiccups when you learn it’s not true, the physics and image quality sell the illusion. Why aren’t any of these puppies real? They clearly love snow. God, are we already living in the Matrix?
How does Sora work?
While we don’t have all the details, OpenAI describes Sora’s core processes your technical report. First, Sora is a diffusion model. Like AI image generators, Sora creates a video by starting with a bunch of stagnant noise and removing it until it resembles the image you’re looking for.
Sora trains on units of data called patches: these patches are created by compressing images and videos into a “lower dimensional hidden space” and then breaking this down into “spacetime” patches, which are the units that the model actually understands. These patches contain information about the location and time of a given video. Sora then generates the videos in this “hidden” space, and the decoder maps them back to the “pixel” space, creating the final result.
However, the company does not confirm where this video and photo data comes from. (Captivating.) They say that Sora builds on research done on DALL-E and GPT models, using the same re-writing technique used in DALL-E 3 to train the model based on descriptive user prompts.
What else can Sora do?
While it can of course generate videos based on standard prompts, OpenAI claims that Sora can generate video from still images. Apple researchers are working on the same type of process in their Keyframer program.
It can also extend an existing video forward or backward in time. OpenAI showed an example of this in a video of a streetcar in San Francisco. Added approximately 15 seconds of additional video to the start in three different ways. So at first all three seem different, but in the end they all merge into the same original video clip. They can also utilize this technique to create “perfect loops”.
OpenAI believes Sora is ideal for simulating worlds. (Awesome!) It can create videos with consistent 3D elements so that people and objects stay in place and interact as they should. Sora does not lose contact with people and objects when they leave the frame; can remember what people and objects do, which leaves an impact on the “world”, like someone painting on a canvas. It can also, um, generate Minecraft on the fly, simulating the player while generating the world around him.
Sora is not perfect
To its credit, OpenAI recognizes Sora’s current weaknesses and limitations. According to the company, the model may have trouble accurately reproducing physics in a “elaborate scene” as well as in certain cause-and-effect situations. OpenAI gives the example of a video of a person eating a cookie, but when you later see the cookie, there is no bite mark on it. Apparently glass breakage is also a problem when rendering.
The company also claims that Sora may corrupt “spatial detail” in tooltips (for example, confusing left with right) and may not be able to correctly render events occurring over time.
Some of these limitations can be seen in videos displayed by OpenAI as evidence that Sora makes “mistakes”. To display a prompt asking Sora to generate a running person, Sora generates a man running in the wrong direction on a treadmill; when the prompt asks if archaeologists discovered a plastic chair in the desert, the “archaeologists” pull a sheet out of the sand and the chair basically materializes out of nowhere. (This one is especially frigid to watch).
The future is not now, but very soon
If you read Sora’s intro page, you might have a mini panic attack. But with the exception of the videos that OpenAI highlights as errors, these are the best videos Sora can currently produce, chosen to show off its capabilities.
Altman himself shared his entry on Twitter after the announcement and asked users to send him a response for Sora to submit. He tweeted the final results for about eight options, and I doubt any of them would have made it to the announcement page. First try in “Half duck, half dragon flies through a handsome sunset with a hamster dressed in adventure gear on its back” was laughably bad and looked like something out of the first straight-to-DVD version of the 2000 cartoon.
The tweet may have been deleted
The final score on the other hand, the “two golden retrievers recording a podcast on a mountaintop” thing was misleading: it looks like someone collected footage of all the assets and quickly edited them on top of each other. It doesn’t look as “real” as it does in Photoshop, which again raises the question: what Exactly Sora is trained in:
The tweet may have been deleted
These brief demonstrations made me feel a little better, but only a little. I don’t think Sora is at the point where he can generate realistic videos on a whim that are imperceptible to reality. There are probably thousands of results that OpenAI went through before settling on the highlights we see in its announcement.
But that doesn’t mean Sora isn’t terrifying. It won’t take much time or research to improve it. Meaning, AI video generation took place here 10 months ago. I wonder what Sora would spit out if given the same prompt:
OpenAI insists it’s taking appropriate precautions here: it’s currently working with red teams on harm reduction research, and wants to watermark Sora-generated content like other AI programs, so you can always tell when something was generated with using OpenAI technology.
But I mean, come on ON: Some of these movies are too good. We leave out things that may be deceiving at first glance, but look fraudulent in hindsight. Some of these movies are tough to believe are not true. If these things can impress those of us who stare at AI-powered content for a living, how is the average social media user supposed to know that the realistic video on their Facebook was created by robots?
Not that it’s too murky in here, but This year, elections are high in more than 50 countriesand in the United States, Artificial intelligence has already been used to attempt voter fraud— and that was it audio. You’re really going to have to max out your bullshit detection this year, because I imagine we’ll see some of the most convincing multimedia hoaxes and disinformation campaigns in history.
People, you better hope these watermarks actually work. This will be A wild ride.