How to crawl a site without getting blocked? 11 tips for Web Scraping

(Image credit: Shutterstock)

Scrapers. Spiders. Crawlers. Whatever you call them, they're are extremely versatile tools for collecting website data quickly and efficiently.

Google and other search engines crawl websites to find suitable ones to display in search results. E-commerce platforms constantly scrape data to find vital business data. Vendors can also find information on potential leads and learn better how to target audiences.

Unfortunately, as any website developer knows, scrapers can generate huge amounts of web traffic - particularly if they're not configured correctly. This means many websites take active measures to prevent scraping.

In this guide you'll learn how to work around some of these to optimize your web scraping.

1. Respect the robots

The simplest and most effective way to ensure that your scraping tool won't be blocked is to respect the rules. Fortunately most websites make this easy for you via robots.txt - a plain text file which lists information for automated 'bots' like scraping tools.

The file should firstly state whether or not scraping is allowed. Even if it is, make sure that your bot follows the rules laid out in robots.txt, for instance by configuring it to crawl during off-peak hours and limit the number of requests coming from the same IP address.

Most versions of robots.txt also contain a variable called "crawl-delay", specifying how many seconds you should wait in between requests. Adjusting the timing of your scraper tools requests hugely reduces your chance of being blocked.

Robots shaking hands — (Image credit: Softbank Robotics)

2. Rotate your IPs

One of the most common ways sites detect scrapers is by analyzing their activity, such as the delay between requests outlined above.

If any unusual behavior is detected, it's likely the site will automatically block the IP address from which the scraper seems to be connecting.

You can get around this problem using a proxy server. These can act as a gateway to sit between your scraper bot and the website in question, masking its true IP. A good proxy service has multiple devices and IP addresses, making it easy for the bot to appear to connect from multiple locations.

When it comes to data collection, the two main types of proxies are ISP Proxies and Datacenter. The best (if more expensive) kind are ISP proxies (sometimes known as 'residential proxies') as they use devices owned by real people, so are much harder for sites to detect.

You can read more about the difference between datacenter and ISP proxies in our online guide.

3. Use CAPTCHA solvers

These are another popular tool designed to keep bots out. If you've used the internet for any length of time, you almost certainly will have encountered a CAPTCHA challenge when registering with a site or filling an online form. Usually they involve typing in some distorted text or select certain types of images like fire hydrants from a selection.

In theory these types of puzzles are easy for humans but hard for robots to solve. In practice you can leap this particular hurdle through using a CAPTCHA solver service. Some solver services actually pay human beings to solve the CAPTCHA challenges on your bot's behalf. Others deploy AI and machine learning to solve the puzzles.

You can learn more about what a CAPTCHA Solver service can do for you with our online guide.

4. Configure a real user agent

When your scraper bot connects to a server hosting a website, it sends a HTTP 'request header', providing essential information. This includes 'User Agent' data which identities the device operating system and system version, as well as the connecting application, such as a web browser.

This can be helpful to make sure web pages load correctly but since this doesn't matter to a scraper, most don't have a user agent set up. Even if they do, it may not reflect the most up to date operating systems or web browsers, which can trigger the server's bot detection algorithms.

To get round this issue, set your 'user agent' string to a popular web browser like Chrome or Firefox. If you've no joy doing this, consider using GoogleBot's User Agent data. Most websites will allow this as they want to be crawled by Google in order to appear higher in search rankings.

Servers may also find it suspicious if exactly the same User Agent appears to be connecting to the site over and over, so try to find a number of different valid configurations and switch between them regularly.

5. Update your request headers

Some websites go beyond simply checking the 'user agent' data and read other values in the HTTP request header. This includes values like 'Accept-Language', which indicates the preferred locale and language your client prefers.

For an example of a typical HTTP request header, fire up your regular web browser and head here. You'll see here all the information your browser sends to make sure sites are optimized for viewing on your particular device.

As with the previously mentioned 'user agent' data though, many scrapers don't bother setting values like the preferred language. This means their HTTP request headers don't resemble a typical web browser, making them more likely to run afoul of anti-scraper programs.

Luckily, it's fairly simple to configure scraper bots to use a "real" looking HTTP request header. As with 'user agent' data, make sure to keep changing header settings regularly to avoid too many requests appearing to be from the same device.

While you're here, it's also wise to add a 'referrer' to the HTTP header field e.g.:

“Referer”: “https://www.google.com/”

If you do choose Google, make sure to change the TLD to the location of the site you're scraping - for instance in the case of a German site use https://www.google.de.

This means each time your scraper bot connects to the site, it will appear to have arrived from Google search results. This increases the chances of your bot traffic being seen as valid, given that the server would expect most referrals to be from Google.

6. Avoid the honeypots

Although it's rare, some cunning webmasters lay deliberate traps to detect bots. This is done by inserting links into web pages that only bots can see.

There are two common methods to do this:

If your bot does follow hidden links like these, the site will be able to detect and block it almost immediately.

Your best defense against these types of deliberate traps is to configure your bot to scan links for properties like those described above before following them. More sophisticated scraping tools may already have this feature built in.

7. Wipe your fingerpints

Even if you're regularly rotating your IP address via a proxy and use plausible-looking HTTP-request headers some sites can still identify specific devices through TCP/IP (Transmission Control Protocol/Internet Protocol) fingerprinting.

This relies on the fact that different operating systems (even differing versions of the same OS) implement the various configurations of TCP slightly differently, such as in the size of data packets, number of open ports, the services than run on them and so on. All this information can be combined to perform a unique 'fingerprint' to identify devices.

The best way to protect against this is to limit the type of traffic to which the device running your scraping tool responds. If you're using a third-party scraping service, they may have anti-fingerprinting features already built in. Check with the app developer to confirm.

8. Use a headless browser

Some websites do more than just check the IP address and HTTP header used by your bot. Instead they'll check other browser data like installed extensions, fonts used and cookies to determine if a real person is connecting.

It's impossible to know all the criteria a site will set before deciding if a browser's being operated by a real human being.

Luckily there's a simple workaround through using a headless browser. These function in exactly the same way as a normal web browser but have no GUI (graphical user interface). Headless mode is supported in all Firefox and Chromium-based browsers like Google Chrome.

Using a headless browser does more than help resist being detected as a bit. While it's usually very difficult to scrape data being loaded via Javascript (see below), using a browser in this mode allows you to do this.

Experienced programmers can use a tool suite like Selenium to implement headless browsers. If you're using a third-party data scraping service they may support this feature out of the box. Speak to the app developer to check if this is the case.

9. Avoid JavaScript

Many modern web pages use JavaScript to display content based on user actions, such as when users add an item to a cart or enter text into a search box.

Unless you're using a headless browser (see above), this kind of data is very difficult to gather. You can configure your bot to try but this can cause system slowdown, memory leaks and a host of other issues so you should avoid JavaScript unless absolutely necessary.

Scraper controlled browsers often contain extra JavaScript environment information that indicate the browser is running without a GUI (headless browsers) or running on uncommon operating systems like Linux.

To prevent data leaking through javascript variables, you can configure your bot to use fake values. One common fix is to set 'navigator.webdriver' to 'false', as many use the far more unusual 'true' setting.

Naturally the values you set and how effective these will be to resist fingerprinting will depend on your bot & browser configuration. There are a number of online tools you can deploy such as 'Headless Cat N Mouse' to check for JavaScript data leaks. Alternatively you can sign up for a scraper service that already has anti-JavaScript detection features built in.

10. Look out for website changes

Websites are regularly being updated and if changes to layout are too drastic, this can use your scraper to break. This isn't usually an issue for the scraping tool itself as you can just reconfigure it to reflect the new site layout. Still if a bot is repeatedly trying to access invalid pages, it could be red-flagged and blocked by the site.

Sadly, there's no quick fix for this. You'll need to research target sites before starting scraping to check for unusual layouts. It's also helpful to set up monitoring to make sure your scraper is still working. This is fairly easy to set up, as you can just check the number of successful requests your bot makes per crawl - if for instance your target website has 65 pages, you'd expect the number of successful requests to be 65.

You can also perform a unit test for your scraper bot against website pages with a known layout, such as the main product page. If you test on a URL by URL basis, your bot is far less likely to be flagged for making multiple invalid requests. If you detect a change which breaks the entire site, you can then reconfigure your bot before trying to connect again.

11. Cache out

If a website has resisted all attempts to be scraped, as a last resort you can program your bot to crawl Google's cached version of it.

This is very simple to do, as you only need to add the following to the start of any URL:

“http://webcache.googleusercontent.com/search?q=cache:”

For instance, the URL to access a cached version of the Internet Archive (archive.org) would be:

"https://webcache.googleusercontent.com/search?q=cache:https://www.archive.org"

This is a great workaround but if you load a page in this way you'll see an important caveat : as the page is cached, it can't be used to access real-time information. This is important if you need to scrape data like current sales prices or inventory numbers.

Some sites also don't allow Google to store cached copies, so use this at your discretion.

Nate Drake is a tech journalist specializing in cybersecurity and retro tech. He broke out from his cubicle at Apple 6 years ago and now spends his days sipping Earl Grey tea & writing elegant copy.