Perplexity AI ignores requests not to scrape websites

Occurred: June 2024

Report incident ๐Ÿ”ฅ | Improve page ๐Ÿ’ | Access database ๐Ÿ”ข

AI start-up Perplexity came under sharp scrutiny for apparently ignoring website operators' requests not to scrape their content, calling into question the company's business model and ethics.

The self-styled AI 'answer engine' has been found to be bypassing robots.txt files, which are standard instructions for web crawlers to respect a website's preferences regarding content scraping. Publishers including WIRED and MacStories found that Perplexity continues to access and use their content despite being explicitly blocked in their robots.txt files.

The company is reportedly using headless browsers to scrape content, ignoring robots.txt, and not sending their user agent string. Perplexity's CEO, Aravind Srinivas, admitted to Fast Company that the company uses third-party web crawlers on top of its own, and that the bot WIRED identified was one of them.ย 

The finding raised concerns about intellectual property rights, publisher earnings, and the need for clearer guidelines or more regulation on the use of web data. It also raised questions about Srinivas' ethics, as well as those of its investors, which include Jeff Bezos and Softbank.

Perplexity is not the only company seemingly flouting robots.txt signals. OpenAI and Anthropic - respective creators of the ChatGPT and Claude chatbots - have also been bypassing robots.txt signals, according to Business Insider. Both companies previously proclaimed that they respect "do not crawl" instructions websites put in their robots.txt files.ย 

โž• June 2024. Amazon Web Services said it was investigating Perplexity over its data scraping practices, and confirmed all AWS clients must follow the robots.txt file instructions.

System ๐Ÿค–

Operator:
Developer: Perplexity AI
Country: USA
Sector: Media/entertainment/sports/arts; Politics
Purpose: Generate information
Technology: Chatbot; Machine learning; NLP/text analysis
Issue: Cheating/plagiarism; Copyright; Ethics/values; Transparency