The more we learn about how AI is built, the more reports emerge of companies using copyrighted content to train AI without permission.
NVIDIA has been accused of downloading videos from YouTube, Netflix, and other datasets to train commercial AI projects. 404 Media reports that the company used the downloaded videos to train AI models for products like the company’s Omniverse 3D world generator and “digital human” efforts like the Gr00t embodied AI project.
Contacted by email, NVIDIA told Tom's Guide that it “respects the rights of all content creators” while asserting that its research efforts are “in full compliance with the letter and spirit of copyright law.”
“Copyright protects certain expressions, but not facts, ideas, data or information,” their statement reads. “Anyone is free to learn facts, ideas, data or information from another source and use them to express themselves.”
They also argued that training AI models is an example of free use of content for transformative purposes.
Netflix declined to comment, but YouTube disagrees with NVIDIA's assessment. Jack Malon, YouTube's head of policy communications, referred us to comments CEO Neal Mohan made to Bloomberg in April, saying that “our previous comments still stand.”
At the time, Mohan was responding to reports that OpenAI was training its Sora AI video generator on YouTube videos without permission. He said, “It doesn’t allow for things like transcripts or video clips to be uploaded, and that’s a clear violation of our terms of service. Those are the rules of conduct for content on our platform.”
This isn't the first time this summer that NVIDIA has been accused of taking down YouTube. Several major companies, including Apple and Anthropic, have reportedly been mining information from a massive data set called “the Stack,” which contains thousands of YouTube videos, including from popular creators like Marques Brownlee and PewDiePie.
Ethical concerns raised…and dismissed
404Media reports that employees who raised ethical or legal concerns were informed by their managers that the practice had the green light from “the highest levels of the company.”
“This is a management decision,” responded Ming-Yu Liu, vice president of research at NVIDIA. “We have general approval for all the data.”
Apparently, some executives pushed the issue back, saying that the removal of the file was an open legal issue that the company would deal with later.
YouTube and Netflix videos aren't the only data that NVIDIA reportedly deleted. The company also reportedly deleted data from the MovieNet movie trailer database, video game footage libraries, and the Github WebVid video dataset.
What is not fair play?
Apparently, some of the videos NVIDIA used came from a massive YouTube library of videos reserved for academic purposes. That usage license specifies that the videos are for academic research only. Apparently, NVIDIA claimed that the academic library was a legitimate target for commercial AI products.
Alphabet, YouTube’s parent company, isn’t immune to criticism that it’s using the internet to build AI models. Last summer, Google released a plan to use all “publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and cloud AI capabilities.”
It's safe to assume that anything posted on Google platforms like YouTube is considered fair game, but so is anything posted on the internet in general.
At the time, a Google spokesperson told Tom's Guide: “Our privacy policy has long stated that Google uses publicly available information from the open web to train language models for services like Google Translate. This latest update simply clarifies that newer services like Bard are also included. We integrate privacy principles and safeguards into the development of our AI technologies, consistent with our AI Principles.”
This implies that any public release made at any time fuels Google's own AI ambitions.
The full 404 Media report has much more detail and is worth reading.