Cloudflare has accused AI giant Perplexity of scraping websites which explicitly disallowed crawling via robots.txt and other network-level rules by hiding its identity and conducting obfuscated crawling activity.

Researchers from the company said they observed Perplexity using multiple user agents, including one impersonating Google Chrome on macOS, as well as rotating IP addresses and ASNs to evade detection.

Alarmingly, Cloudflare detected millions of daily requests across tens of thousands of domains, highlighting the sheer scale of illegitimate scraping by one of the biggest companies in the space.

Perplexity is scraping sites it shouldn't be

According to Cloudflare's analysis, in many cases, Perplexity ignored or didn't fetch robots.txt files - which are plain-text files placed at the root of a site to tell automated agents (like search engines, AI crawlers and link checkers) which URLs may or may not be fetched.

Tellingly, Perplexity also attempted to access test websites Cloudflare created, even though they were blocked via robots.txt and not publicly discoverable, while using undeclared crawlers that weren't even associated with its official IP range.

"Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences," the researchers write.

In response to its findings, Cloudflare has de-listed Perplexity's bots from its verified bots list. The company has also added new managed rule heuristics to detect and block stealth crawling.

In contrast, OpenAI's crawlers have so far respected robots.txt and block pages, using transparent identifiers and documented behavior to obtain information.

Perplexity denied wrongdoing, calling Cloudflare's post a "sales pitch", adding the identified bots weren't even theirs. TechRadar Pro has asked Perplexity for its comment.

Cloudflare urges bot operators to respect website preferences by being transparent, being well-behaved netizens, serving a clear purpose, using separate bots for separate activities and following rules and signals like robots.txt.