Perplexity accused of breaking a major online AI scraping rule - but it says it has done nothing wrong

Perplexity Voice Assistant
(Image credit: Perplexity)

  • Perplexity seen to be ignoring signals like robot.txt to scrape online sites
  • It even found protected and hidden test sites from Cloudflare
  • OpenAI adheres to responsible crawling, but Perplexity quiet for now

Cloudflare has accused AI giant Perplexity of scraping websites which explicitly disallowed crawling via robots.txt and other network-level rules by hiding its identity and conducting obfuscated crawling activity.

Researchers from the company said they observed Perplexity using multiple user agents, including one impersonating Google Chrome on macOS, as well as rotating IP addresses and ASNs to evade detection.

Alarmingly, Cloudflare detected millions of daily requests across tens of thousands of domains, highlighting the sheer scale of illegitimate scraping by one of the biggest companies in the space.

Perplexity is scraping sites it shouldn't be

According to Cloudflare's analysis, in many cases, Perplexity ignored or didn't fetch robots.txt files - which are plain-text files placed at the root of a site to tell automated agents (like search engines, AI crawlers and link checkers) which URLs may or may not be fetched.

Tellingly, Perplexity also attempted to access test websites Cloudflare created, even though they were blocked via robots.txt and not publicly discoverable, while using undeclared crawlers that weren't even associated with its official IP range.

"Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences," the researchers write.

In response to its findings, Cloudflare has de-listed Perplexity's bots from its verified bots list. The company has also added new managed rule heuristics to detect and block stealth crawling.

In contrast, OpenAI's crawlers have so far respected robots.txt and block pages, using transparent identifiers and documented behavior to obtain information.

Perplexity denied wrongdoing, calling Cloudflare's post a "sales pitch", adding the identified bots weren't even theirs. TechRadar Pro has asked Perplexity for its comment.

Cloudflare urges bot operators to respect website preferences by being transparent, being well-behaved netizens, serving a clear purpose, using separate bots for separate activities and following rules and signals like robots.txt.

You might also like

With several years’ experience freelancing in tech and automotive circles, Craig’s specific interests lie in technology that is designed to better our lives, including AI and ML, productivity aids, and smart fitness. He is also passionate about cars and the decarbonisation of personal transportation. As an avid bargain-hunter, you can be sure that any deal Craig finds is top value!

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.