Copyright watchdog takes down Dutch language AI training dataset

Occurred: August 2024

A large dataset of copyrighted books and news articles was removed from the internet after an enforcement action by Dutch copyright enforcement group BREIN.

The dataset, which remains unnamed, contained information collected without permission from tens of thousands of Dutch language books, news sites and subtitles from numerous films and TV series, and we being offered for use in training AI models, notably large language models.

It is unclear how widely this dataset may have already been used by AI companies. BREIN director Bastiaan van Ramshorst said they were trying to act preemptively to avoid future lawsuits. 

The dataset was seen to raise questions about the legality and ethics of using copyrighted material for AI training without permission.

The European Union's AI Act requires AI firms to disclose the datasets used to train their models.