CIOInsights - Insights From Technology Leaders

Apple, Anthropic, and other companies used YouTube videos to train AI

By CIOInsights

More than 170,000 YouTube videos are part of a massive dataset used to train AI systems for some of the biggest technology companies, according to an investigation by Proof News and published with Wired.

Apple, Anthropic, Nvidia, and Salesforce are among the tech firms that used the “YouTube Subtitles” data ripped from the video platform without permission.

The training dataset is a collection of subtitles taken from YouTube videos belonging to more than 48,000 channels — it does not include imagery from the videos.

Videos from popular creators like MrBeast and Marques Brownlee appear in the dataset, as do clips from news outlets like ABC News, the BBC, and The New York Times. More than 100 videos from The Verge appear in the dataset, along with many other videos from Vox.

AI companies are rarely willingly transparent about the data that goes into their AI systems; how YouTube content specifically is being used has been a key question in recent months.