OpenAI scrapes YouTube to train GPT-4

Occurred: April 2024

OpenAI quietly scraped and transcribed over one million hours of YouTube videos to train its GPT-4 large language model, raising questions about copyright and ethics. 

In an effort to source additional training data, Open AI created Whisper, a speech recognition system to transcribe over a million hours of YouTube videos. The transcriptions were included in the training data for GPT-4. Experts have demonstrated that the performance of large language models improves with increased amounts of training data.

The discovery goes against Google’s terms which restrict automated access to YouTube videos and the use of its videos for independent applications outside of YouTube. 

Google is thought likely to have known about OpenAI’s activities, but appears not to have acted, possibly as it has  been accused of using YouTube videos to train its own AI models.

YouTube creators are thought to face a variety of actual and potential harms as a result of OpenAI’s actions, including copyright violations, financial loss, and privacy abuse. 

Incident databank

Operator: OpenAI

Developer: OpenAI

Country: Global

Sector: Media/entertainment/sports/arts

Purpose: Generate text

Technology: Chatbot; NLP/text analysis; Neural network; Deep learning; Machine learning; Reinforcement learning

Issue: Copyright; Ethics/values

Transparency: Governance