YouTube Subtitles

Report incident πŸ”₯ | Improve page πŸ’ | Access database πŸ”’

YouTube Subtitles is a dataset that comprises subtitles from 173,536 YouTube videos taken from more than 48,000 channels, including Khan Academy, MIT, Harvard University, the Wall Street Journal, NPR, the BBC, PewDiePie and Mr Beast.

The subtitles are often presented alongside translations into languages such as Japanese, German and Arabic.

Released in 2020 as part of The Pile dataset, YouTube subtitles has been used by multiple technology companies, including Anthropic, Nvidia, Apple, Bloomberg and Salesforce.

Dataset πŸ€–

Transparency and accountability πŸ™ˆ

Risks and harms πŸ›‘

The YouTube Subtitles dataset has been criticised for transcribing the output of video creators without their explicit permission, thereby potentially violating their copyright.

Page info
Type: Data
Published: June 2024
Last updated: October 2024