Web scraping, also known web data extraction and web harvesting, is the process of extracting data from a website. This means if you've ever copied and pasted from a web page to your device, technically you're a web scraper.
Still, web scraping is usually considered to mean something more than a simple copying of text. The process generally involves accessing a site's HTML code and extracting relevant information.
In this guide, you'll learn more about web scraping, as well as some of the most popular ways people use it to gather data.
What is Web Scraping used for?
Web scraping is done for a number of reasons, including:
Major search engines like Google will scrape or 'crawl' websites to determine relevant search results when users type in keywords.
Scraping is an excellent way for retailers to perform market research and check out the competition.
This is a popular use of scraping for news websites, as information can be gathered from multiple sources to populate a live feed.
Scraping can be used for build data sets for machine learning software, though this has proved controversial as original creators are sometimes not credited for derivative content the programs create.
This is another very popular use for scraping (usually with automated tools). It's mostly used to check the prices on products for e-commerce websites.
Companies can use web scraping to collect data from social media sites like Facebook and Instagram to gauge the general sentiment surrounding their products/services. This can help them either to improve their service or create goods consumers are more likely to want.
Businesses can use scraping to collect contact information like email addresses and phone numbers for potential clients and/or partners.
Can any website be scraped?
In theory, it's possible to extract HTML data from virtual any website. Still, not all sites allow web scraping, as it can increase bandwidth which the owners must pay for. Any site with a significant number of visitors maintains a 'robots.txt' file which specifies whether scraping is allowed and any rules you need to follow e.g. number of data requests per minute. Scraping also isn't always legal, depending on the method used and your jurisdiction (see below).
Web Scraping Methods
There are two broad methods for internet users to carry out web scraping : manually and automatically.
You've already encountered one manual method in the example cited above, whereby a user copies and pastes text from a website. This can actually sometimes be the quickest and easiest way to retrieve data, particularly if the site in question has measures to prevent programmed scraping tools.
Automated scraping tools can be a great way to quickly gather large amounts of data but can also increase web traffic. This is why many sites do not allow scraping of their data and even code pages in a way to prevent this being done.
Manual scraping usually involves using custom scripts a web browser's built-in tools to view and extract a page's source code, after which it can be analyzed e.g by pasting it into a spreadsheet.
The general process works as follows:
The advantages of scraping web data in this way include:
Many websites ban automated scraping of data and even have tools to block the IP of any device that seems to be using them. A real person using their own web browser is far less likely to be banned, particularly as an individual gathering data from a site probably doesn't go against the terms and conditions.
If the layout of a website changes, this can prevent automated scraping tools from gathering data until they've been reprogrammed. If the scraper in question is a human being however, they can simply access the new page or page section to retrieve the data they need.
Manual web data scraping also has its downsides however. These include:
Manually scraping data from a website is a little like picking apples from a tree one by one by hand. Sure you can do it if you have a few hours to spare but it's not nearly as efficient as a machine that can shake them all down in minutes. This issue is especially critical if you need to view real time information like pricing.
Whether you're scraping yourself or paying someone else to, this amounts to a huge number of man hours which need to be paid for with your time or money. Automated tools, on the other hand, can be used for free or for a low subscription fee depending on how much data you want to gather.
While we're on the subject of how much data to gather, you also need to consider how to scale things up. If, for instance, your competitor opens 12 new websites will you have sufficient manpower to scrape each of these in addition to those from which you're already harvesting data?
Automated scraping usually involves using tools like Python scripts and specialist bots (sometimes working via an API) to extract data from multiple web pages. There are a number of tools available to do this but in general the process works as follows:
Bear in mind this is a very simplified overview of how automated web scraping tools work. For instance, some include features like a 'CAPTCHA Solver', which can make use of human beings and/or AI to defeat CAPTCHA challenges.
There are definite advantages to using automated scraping tools, including:
Automated web scraping can be an excellent way to retrieve large amounts of information quickly. This is vital if you're monitoring data in real time like stock prices or product inventories.
Unlike manual methods, bots can adapt much better to scanning more sites and pages for data. It's simply a matter of configuring the URLs and HTTP requests correctly. This is particularly useful if you need to process large amounts of data.
Computers are excellent at carrying out repetitive tasks in the same way. This not only results in more accurate data but frees up time and resources for people within your organization to analyze the information the scraping tool gathers.
Automated scraping does have its drawbacks though, including:
Coding a scraping script from scratch can be complicated, so is only recommended for experienced programmers. There are automated scraping tools which can make this process easier such as 'BeautifulSoup' or 'Scrapy'. You'll also need to carefully select the URLs for web pages you intend to scrape to make sure the tool only gathers relevant data.
Not all websites allow data scraping and those that do may place restrictions on what information can be gathered, as well as how often your bot can query it - these can be found in the site's robot.txt. Breaking these rules may not be illegal but can result in the IP address for your scraping tool being blocked, making it harder to gather data.
If the structure of a page changes slightly, a human carrying out scraping tasks can adapt quickly and gather the data anyway. Automated tools tend to break unless they're configured specifically to gather data from each page. Worse, if they keep making repeated invalid requests for data they can trigger a site's anti-bot detection measures and be blocked.
Is Web Scraping legal?
Scraping can occupy a grey area under law. If the scraping methods you use don't violate a website's terms and conditions and don't cause a significant spike in web traffic, you're unlikely to have any legal troubles. Still, this can depend on what data is scraped, how you scrape it and the law in your home jurisdiction. Some significant times scraping has been tested in court include:
In this case eBay sued Bidder's Edge, a third-party website, for scraping and displaying eBay's auction listings without permission.
Since 1997 Bidder's Edge (BE) had acted as an aggregator of auction listings, scraping auction information from various sites like eBay. The auction giant initially allowed this and the two parties even discussed a licensing agreement. However, they couldn't come to an agreement on how often eBay pages could be scraped.
BE's platform was accessing eBay around 100,000 times a day, accounting for a significant portion of its web traffic.
eBay initially tried to block BE's IP addresses but this proved ineffective, as they simply switched to using proxy servers, so the case ended up in court.
The court held that the automated web scraping by Bidder's Edge constituted trespass to eBay's computer servers and disrupted eBay's operations, so submitted an injunction against BE.
The case established that scraping could be considered disruptive to a business' activities and that simply having a website isn't an open invitation for others to do as they like with it.
Meltwater was a SaaS (Software as a Service) company that offered news monitoring services to subscribers. It did this by offering clippings of news content that had been scraped from sources all over the internet, including articles generated by Associated Press.
AP eventually took Meltwater to court for copyright infringement. Meltwater countered that their activities fell under the doctrine of "fair use" under US copyright law and that their platform aggregated content in much the same way as a search engine.
The court ruled in favor of the Associated Press, finding that Meltwater's copying of AP content (even as excerpts) did not qualify as fair use and ordered them to pay damages. Interestingly when a case was brought against Meltwater in the UK the Supreme Court actually sided with the news aggregator, even though British law has no concept of "fair use".
To Scrape or not to Scrape
As we've seen, there are many potential uses for scraping. If you want to gather useful business intelligence, monitor prices and generate leads then scraping can be an excellent way to get ahead, particularly if you make use of automated tools to gather and summarize data.
If, on the other hand, you plan to base your entire business model around web scraping then it's probably best to make sure to take legal advice to avoid being sued by the data holders.
Whether it's for your personal curiosity or in the name of big business, make sure to follow our recommended best practices when scraping to avoid being banned.
Are you a pro? Subscribe to our newsletter
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Nate Drake is a tech journalist specializing in cybersecurity and retro tech. He broke out from his cubicle at Apple 6 years ago and now spends his days sipping Earl Grey tea & writing elegant copy.