C4 dataset is trained on unsafe, copyright-protected web content