YouTube Subtitles dataset
Report incident 🔥 | Improve page 💁 | Access database 🔢
YouTube Subtitles is a dataset that comprises subtitles from 173,536 YouTube videos taken from more than 48,000 channels, including Khan Academy, MIT, Harvard University, the Wall Street Journal, NPR, the BBC, PewDiePie and Mr Beast.
The subtitles are often presented alongside translations into languages such as Japanese, German, and Arabic.
Released in 2020 as part of The Pile dataset, YouTube subtitles has been used by multiple technologies companies, including Anthropic, Nvidia, Apple, Bloomberg, and Salesforce.
Dataset 🤖
YouTube Subtitles
YouTube Subtitles dataset (GitHub)
Dataset info 🔢
Operator: Anthropic; Apple; Bloomberg; Databricks; Nvidia; Salesforce
Developer: EleutherAI
Country: USA
Sector: Media/entertainment/sports/arts
Purpose: Train AI models
Technology: Database/dataset
Issue: Cheating/plagiarism; Copyright