Microsoft researchers published a paper on the topic this week VASA-1, a fresh AI tool that can generate a convincing video of a person speaking using only a still image. Microsoft has no immediate plans to make the fresh tool publicly available, but it’s impressive. Well, it’s impressive if you don’t look too closely at the teeth. Just look at these teethers.
The VASA-1 model works by taking any photo of a human face – or, as in the examples published by Microsoftan AI-generated face of a person who does not actually exist, and after receiving an audio file, it can create a synchronized video containing facial nuances and natural-looking movement.
Once again, everything is impressive, as you can see in one of the videos provided by Microsoft below. But the one area where VASA-1 seems to struggle is tooth rendering. If you focus on the teeth, they can take on a cartoonish feel and look slightly animated in a way that doesn’t quite match the hyper-realistic quality of everything else.
The film’s odd teeth become even more apparent when you sluggish the whole thing down, as Gizmodo did in the GIF below. (It can make you feel bad judging someone’s appearance until you remember that the person below literally doesn’t exist.)
Another example video provided by Microsoft, which appears below, shows cartoon-like features of the teeth – even though other features appear very realistic, especially when you remember that the only source material is a inert image and an audio file.
For some reason, the teeth in the videos of the men were slightly less observable, perhaps because the model didn’t show the men opening their mouths as wide when speaking. But anyone who looks closely can still get the impression that something is wrong here.
One of the more fascinating things the researchers noticed is that his model can generate relatively high-quality video very quickly, something other AI generators like Sora from OpenAI apparently they were struggling with this problem. The article reported a latency of just 0.17 seconds on a desktop computer with a single NVIDIA card RTX 4090 graphics card.
This speed enables instant delivery of videos for various applications, such as real-time translation services.
“Our method not only provides high-quality video with realistic facial and head dynamics, but also enables online generation of 512 x 512 resolution videos at up to 40 frames per second with negligible startup delay. It paves the way for real-time interactions with realistic avatars that mimic human conversational behavior,” the fresh article says.
Scientists are clearly aware of the dangers of this type of technology, which perhaps explains why Microsoft has not yet announced plans to make it available to the public soon. However, researchers have also identified exploit cases that they believe will be useful to humanity.
“The benefits — such as increasing educational equity, improving accessibility for people with communication difficulties, providing companionship or therapeutic support to those in need, and more — underscore the importance of our research and other related pursuits. We are committed to the responsible development of artificial intelligence with the goal of improving human well-being,” we read in the article.
“Given this context, we do not plan to provide an online demo, API, product, additional implementation details, or any related offerings until we are confident that the technology will be used responsibly and in compliance with applicable regulations.”
It’s probably a good idea, considering number of frauds that are possible with this type of technology. After all, there are only seven months left until the 2024 US presidential election. AND threat of fascism around the world it’s not going away anytime soon. Humanity really feels like it’s powerless against AI-generated counterfeits these days. Gigantic companies like Microsoft should probably do everything they can to limit the potential damage before virtually everything on the Internet becomes a counterfeit.