AIAAIC - YouTube Subtitles dataset

YouTube Subtitles - dataset

Page published: June 2024 | Last updated: October 2024

Report incident🔥| Improve page 💁| Access database 🔢

YouTube Subtitles is a dataset that comprises subtitles from 173,536 YouTube videos taken from more than 48,000 channels, including Khan Academy, MIT, Harvard University, the Wall Street Journal, NPR, the BBC, PewDiePie and Mr Beast.

The subtitles are often presented alongside translations into languages such as Japanese, German and Arabic.

Released in 2020 as part of The Pile dataset, YouTube subtitles has been used by multiple technology companies, including Anthropic, Nvidia, Apple, Bloomberg and Salesforce.

Dataset 🤖

Dataset 🔗
Released: 2020
Developer: EleutherAI
Purpose: Train AI models
Type: Database/dataset
Technique: Machine learning

Transparency, accountability 🙈

Data sources. AI companies have not been open about their data used to train their AI models, including YouTube Subtitles.
Accountability. EleutherAI declined to discuss allegations that YouTube videos used without permission to create YouTube Subtitles formed part its Pile dataset.

Risks, harms 🛑

The YouTube Subtitles dataset has been criticised for transcribing the output of video creators without their explicit permission, thereby potentially violating their copyright.

Incidents, issues 🔥

July 2024. Apple, Nvidia, Anthropic used thousands of YouTube videos without permission to train AI models

Related 🌐

AIAAIC Repository ID: AIAAIC1590

Page updated

Google Sites

Report abuse