Google DeepMind CEO Demis Hassabis revealed plans to eventually fuse the company’s Gemini AI with its Veo video generator, aiming to teach the AI more about the physical world, during a recent appearance on the Possible podcast.
Hassabis explained the strategy aligns with their vision for a “universal digital assistant” capable of aiding users in real-world scenarios. “We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” he stated on the podcast co-hosted by Reid Hoffman.
This move reflects a broader industry shift towards versatile “omni” models. Google’s latest Gemini versions already handle audio, image, and text generation, while rivals like OpenAI enable image creation in ChatGPT, and Amazon intends to launch an “any-to-any” model.
Developing these comprehensive models demands vast datasets spanning video, images, audio, and text. Hassabis hinted that the video data fueling Veo largely originates from YouTube, a Google-owned platform.
He elaborated that by processing extensive YouTube content, Veo learns about real-world physics. “[Veo 2] can figure out, you know, the physics of the world,” Hassabis commented regarding the model watching “a lot of YouTube videos.”
Google previously acknowledged to TechCrunch its models “may be” trained on “some” YouTube content, consistent with agreements with creators. Reports suggest Google updated its terms of service last year, potentially expanding access to data for AI training purposes.