Sponsored by Decodo
How to perform web scraping at scale
Large-scale web scraping is a beast of its own
Scraping a few pages with a couple of popular tools is a straightforward process, but scaling to millions of pages moves beyond writing good code into creating a robust distributed system that can avoid blocks, manage rate and infrastructure limits, ensure data integrity, and not use up all available memory. At some point, it may seem like herding cats, but it’s a manageable task.
Limited offer - 10% discount on all residential proxy plans at Decodo (formerly Smartproxy)
Use code TECHRADAR10 at checkout and save 10% on all residential proxies subscriptions. Get started from only $2/GB and experience top-tier performance with 115M+ ethically sourced IPs from 195+ worldwide locations.
How does web scraping work?
Web scraping is a process that extracts massive amounts of data from websites automatically, with a scraper collecting thousands of data points in a matter of seconds. It grabs the Hypertext Markup Language (HTML) code, data from the database, and even JavaScript and Cascading Style Sheets (CSS) elements, depending on the scraper. The process typically consists of four key steps:
1. The scraper sends an HTTP request (e.g. GET) to the target URL.
2. The web server sends the raw code, commonly HTML.
3. The scraper analyses the Document Object Model (DOM) of this HTML, and begins parsing to identify relevant elements (typically using XPath or CSS Selectors).
4. The raw data is cleaned, normalized, exported, and saved in the format best suited for that project, typically JSON, CSV, Markdown, Excel spreadsheet, or a specialized database.
Notably, advanced scrapers use headless browsers (like Puppeteer or Playwright) for dynamic sites that load their content with JavaScript.
How to web scrape at scale?
Web scraping at scale is a whole other level of scraping. It’s millions of requests being sent over thousands of workers globally. We’re not talking about a few hundred pages here, but hundreds of thousands or millions of pages being scraped in one reliable and constant automated process, using distributed systems instead of single scripts. We focus on infrastructure, not coding, carefully stacking key layers to catch issues in time and get the most out of the project, prioritizing the success rate over speed.
For a few thousand pages, it’s best to utilize a combination of simple tools, such as Python script, BeautifulSoup, and Playwright. But for large-scale scraping, we’ll need to use headless browsers for JavaScript rendering, distribute requests across numerous rotating residential proxies, manage dynamic throttling and rate-limited queues among other factors, and establish a monitoring system to catch errors early.
Generally, we want to include the following layers:
- request orchestrator: the brain of the process that assigns goals to workers, manages the URL queue, implements scheduling logic, ensures efficiency, etc.;
- proxy: an intermediary layer between the scraper and the target website that masks the IP and distributes request load, bypassing bans and limits;
- rendering: JS-heavy sites need a headless browser without a GUI;
- parsers: to analyze raw and unstructured data and turn that HTML or JSON into structured records that make sense;
- a data pipeline with built-in validation to ensure accuracy and structure of the data;
- a monitoring system to observe data quality, success and block rates, and response times in real-time, to trigger alerts before issues compound.
Once you go large-scale, you’ll want to turn to your reliable proxies right away, because sending that many requests at once will certainly get you blocked. Focus on rotating IPs, such as Decodo's residential proxies, to bypass rate limits and geo-restrictions.
Turn on the autopilot
A crucial step is to utilize automation and scheduling, the basis of our large-scale scraping systems. We need to make it as robust as humanly possible. One of the best methods is to use Python-based open-source workflow orchestration tools like Apache Airflow, Prefect, and Luigi, which manage, schedule, and monitor complex data pipelines. Airflow, for example, uses DAGs (Directed Acyclic Graphs) to manage workflows, schedule crawls, and retry failed requests.
Project structure example with DAG:
dags/
└── scrape_pipeline.py
scraper/
├── crawler.py
├── extractor.py
└── storage.py
Fail, log, recover
At scale, you’re bound to fail somewhere. The goal is to catch the issues before they grow and silently corrupt the data. Monitoring alerts us to issues, while logging each failure provides a clear record of what failed, when, how many times, the patterns, and other issues it led to. This tells us if we should investigate, ignore, or retry. Each request should immediately log the source URL, the time it took to get it, the transform version, the status code, the parser version, and the success or failure of parsing.
After registering increased error rates, scraping systems should be able to recover automatically by reducing concurrency, shifting to a different proxy, or quarantining the domain for further analysis. When a URL fails, we can save it in a database, a CSV, or a queue to reprocess with altered scraper logic, a longer timeout, a different proxy, etc.
Cleaning and normalization
Once the parsers have done their work, the data needs to be cleaned, standardized, and made consistent. Cleaning removes noise (e.g. HTML tags, whitespace, duplicates) and normalization standardizes formats.
Validation
Validation ensures that the extracted data is accurate, complete, and standardized so that it can be stored or analyzed. This is the last step to identify bad data.
Store data properly
This is the final, key step. If we don’t properly store these gigabytes or terabytes of structured data we now have, we could lose it all, not to mention that the sheer size of the data could bring everything to a grinding halt. The format depends on the type and purpose of the project. Use a database (PostgreSQL or MySQL) for data with a consistent structure for analysis, running queries, feeding dashboards, etc.; CSV or JSONL are often enough for raw data meant for later analysis; and MongoDB is useful for mixed data.
Additional tips:
- Try splitting data storage into HTML/JSON raw side and structured/cleaned side for easy reprocessing;
- Extract and save as you go to unburden your memory and enable easier recovery.
However, if you want to make the process easier, consider one of the five plans offered by the Decodo Web Scraping API. Each includes proxies, 125M+ IPs globally, 99.99% success rate, 200 requests per second and results in HTML, JSON, CSV, XHR or PNG.
IP bans, anti-bot systems, and hitting the wall
A scraper doesn’t behave organically, and websites are designed to catch bot behavior. For example, a production scraper will send thousands of requests per session, unlike any human. Behavioral-analysis-protected websites can recognize small details like mouse movement, scrolling timing, and reading behavior. Too linear a move and our project hits the wall, often a CAPTCHA, 403 Forbidden, or a 503.
- Give your best to avoid CAPTCHA and use solving services only when necessary.
- Make sure to use clean residential IPs.
- Use headless browsers so that the scraper behaves like a real browser.
- Limit the number of concurrent requests and add jitter to mimic human behavior.
To handle behavioral evasion, maintain a diverse proxy pool, and prevent blocking, consider Decodo Site Unblocker as a proxy endpoint with fingerprint rotation and JavaScript rendering. With a 100% success rate, the tool helps users avoid CAPTCHAs and IP bans.
Handling failures
To try and prevent failures, trap your fetch logic in a try/except block. Continue to monitor, log errors, and retry if possible, but not immediately. When faced with rate limits:
- Use exponential backoff with jitter so that requests wouldn’t be sent at fixed intervals.
- Set strict rate limiters, such as a token bucket or fixed window, so that requests wouldn’t exceed the target site's capacity.
- After several failures, send the URL to manual review.
- Use rotating residential proxies, which move between hundreds of ISP-assigned IPs globally, behaving like legitimate users.
Datacenter proxies are the most cost-effective, but they’re easily detectable. One of the best strategies is to start with datacenters for lighter tasks, move to residential for the majority of the process, and then use mobile only when residential stops working.
Proxies | Datacenter | Residential | Mobile |
IP source | Data centers | Real household devices (ISPs) | Mobile carriers |
Trust level | Low | High | Very high |
Detection risk | High | Low | Very low |
Block resistance | Weak | Strong | Very strong |
CAPTCHA frequency | High | Medium-low | Very low |
Geo-targeting | Limited | Precise | Good |
Speed | Very fast | Fast | Slower |
Stability | Very stable | Stable | Can fluctuate |
Cost | Cheap | Expensive | Very expensive |
Use cases | Bulk scraping, low-security sites | General web scraping, account management | Highly protected site |
Infrastructure and architecture limits
A distributed system that must handle complex extraction, rate limiting, anti-bot defenses, and geographic restrictions must be built on a multi-region infrastructure in order to route requests through various local gateways.
This modular infrastructure will utilize concurrency and multithreading to move the process from sequential requests to concurrent (parallel) architectures. For a concurrency model, we’ll choose multithreading or asynchronous I/O for I/O-bound tasks (e.g. downloading HTML), or multiprocessing for CPU-bound tasks (e.g. parsing large pages, rendering JavaScript).
Is web scraping legal?
Web scraping lives in a gray area. Gathering publicly available data is generally legal, but scraping personal data, bypassing paywalls, and violating terms of service is illegal.
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Sead is a seasoned freelance journalist based in Sarajevo, Bosnia and Herzegovina. He writes about IT (cloud, IoT, 5G, VPN) and cybersecurity (ransomware, data breaches, laws and regulations). In his career, spanning more than a decade, he’s written for numerous media outlets, including Al Jazeera Balkans. He’s also held several modules on content writing for Represent Communications.
