A massive industry you've never heard of is destroying your favorite websites

web scrapers
(Image credit: Future)

Remember that interesting article you read just a few days ago on one of your favorite sites? Chances are it's now been spread around the web - and not in a good way. Spammy sites are constantly stealing content from legitimate websites, passing them off as their own, and making bank in the process.  

Take a look at these TechRadar Pro articles that were copied and then posted on other sites. While the articles were copied in their entirety, the websites that stole them made sure to change the author names to avoid giving credit to the original author.

Unfortunately, there’s not just one or two sites stealing articles, there are many. According to Paul Bischoff, Privacy Advocate at research firm Comparitech, it's "incredibly easy".

"Even a novice programmer can make a scraper in Python, though some are more sophisticated than others at avoiding detection," he told us. "Despite this, scrapers are difficult to distinguish from legitimate users. Even big sites haven't figured out how to combat them."

To see the extent of the problem, copy a sentence from an article on your favorite site and paste it into Google in quotation marks. After doing so you will likely find many articles containing the exact same phrase, which is a good indication that the sites that posted them stole the content.

For example, a recent TechRadar Pro article entitled "It's time CEOs gave SEO the attention it deserves - here's why" had 77 results when a quote was copied and pasted into Google.

This kind of content stealing is both illegal and unethical. So how and why does it happen?

Scraper shenanigans

Plagiarizing sites steal content from legitimate websites via web scraping, which is the technical term for extracting information from a website. The simplest form is done manually - you’ve probably even done it yourself. For example, if you read an article you like, you might decide to copy and paste some text from it as a quote and link to it in a social media post. As you’ve extracted information from the site the article was posted on, you’ve scraped it.

Usually, though, web scraping is done by computer programs. That’s because computers can do things much quicker than us humans. So how does a web scraper work? It's usually made up of two parts, a crawler and the scraper itself, that work together. The crawler finds the URLs you want to scrape and downloads their HTML files. The scraper then finds the information of interest in the HTML file and extracts it, storing it in, for example, a spreadsheet.

If you know how to code, you can build your own scrapers from scratch. But you don’t need to be a skilled programmer to get up and running with web scraping - a quick Google search reveals plenty of pre-built scrapers that can be used without any programming knowledge.

The information you get from web scraping can be incredibly valuable. For example, it allows companies to get data about their competitor’s prices, which they can then use to adjust their own. It can also be used to identify sites that are violating copyright by hosting stolen content or selling counterfeit goods so that appropriate action can be taken. Web scraping is even valuable to scientific researchers who often need to obtain and analyze data, such as coroner's reports or clinical trial data, as scraping can make getting the information much faster.

Content stealing sites also use scrapers, but for the malicious purpose of copying original articles from legitimate websites or from their RSS feeds. Then they create their own website and post the stolen content as their own, getting credit for other peoples’ hard work. This hurts the websites that you love and can also lead to legal troubles and hefty financial penalties for the content stealers.

In some instances, scraper sites even run stolen content through a synonym builder, which swaps in words at random in an effort to evade checks and balances. Inevitably, this results in poor-quality copy, and yet these fakes sometimes rank higher in search listings than the original article.

Why do content thieves steal content and try to pass it off as their own? Because unfortunately there are plenty of ways for them to make a profit doing so. The most obvious ones are advertising and affiliate links. But there is another lucrative, but illicit, option: selling backlinks.

So what is a backlink? Suppose you have a website and I have a blog. In one of my blog posts, I might cite a piece of information I found on your website and so include a link back to it. That link is called a backlink to your website.

Backlinks are a valuable commodity because they signal to search engines that your site has useful content. If lots of reputable websites link to you, then search engines are likely to rank your website higher in their search results. That’s why one of the main tasks of search engine optimization (SEO) is to build a high quality backlink profile.

Not all backlinks will help you rank higher in search engine rankings, though. For example, if a site links to you as part of a paid advertisement, Google requires them to give this link either a ‘nofollow’ or a ‘sponsored’ attribute. These attributes tell Google that they should not count these links when determining how to rank your site (though they do use them as “hints”). After all, if a backlink has been paid for, the site linked to you because they were paid, not because they found your content useful.

Picture of the Earth with a web of links over the surface

(Image credit: Shutterstock / NicoElNino)

So how do you build up a high quality backlink profile? Google’s advice is to “is to create unique, relevant content that can naturally gain popularity in the internet community”. Doing this successfully requires a lot of ingenuity, time and effort. For those wanting a quicker and easier approach with money to burn, buying backlinks without the ‘nofollow’ or ‘sponsored’ attributes is an attractive option. It is not without its risks, however. If Google determines that you’re buying or selling backlinks without the appropriate attributes then your site’s ranking will suffer or it may even disappear from Google’s search results altogether.

But plenty of backlink buyers and sellers are undeterred by the threat of penalties from Google. The SEO industry, of which backlink selling is a part, is a huge one, estimated to be worth approximately $80 billion dollars. While the size of the backlink selling industry in particular doesn't seem to have been fully investigated, Joshua Hardwick, Head of Content at SEO software company Ahrefs, told TechRadar Pro that "it does seem like it's quite a popular way to get backlinks simply because getting backlinks is difficult to do in a legitimate way". 

For example, Nathan Gotch, of Gotch SEO, describes how he earned more than $470,998 selling backlinks to 507 new customers from 2017 to 2019. Though Gotch is no longer in the backlink selling industry, plenty of other SEO agencies continue to sell backlinks and freelancer sites like Fiverr have an abundance of backlink sellers offering their services.

Ahrefs found that, on average, a paid backlink on a blog cost $356.92 in 2016, and $361.44 in 2018. It also found that paid backlinks were considerably cheaper if they came in the form of a guest post written by the buyer. Such a guest post, and the included backlink, cost $77.80 on average in 2018. 

Hardwick said that he would expect the results of a survey in 2021 to be similar to the results from 2016 and 2018. "I'd imagine links are a little bit more expensive. I imagine there's a few more people wanting to sell things, but I wouldn't say it's exploded,” he told us.

With good money to be made, it’s unsurprising that some of the sites stealing content from TechRadar are selling backlinks too. I contacted three of them - News Nation USA, NY Press News and USA Tribune Media - to see if they would sell me backlinks in the form of a guest post. They all responded positively.

News Nation USA charges $275 for a guest post that includes two 'do-follow' backlinks (i.e. backlinks without the ‘no-follow' attribute). It has very few requirements for guest posts, only noting that adult content is not allowed and that each post should be longer than 350 words and accompanied by a picture.

The rates at USA Tribune Media are slightly cheaper, at least for regular posts. This company charges $250 for a guest post with a do-follow link, providing the topic is not CBD or gambling, which will cost you $80 more. It also offers link insertion, where backlinks are inserted into existing content without the need for a buyer to write a guest post. The prices for this are slightly cheaper, at $200 for the insertion of a regular link and $300 for one about CBD or gambling.  They were also explicit that “fake news/information or defamation of any product or the company is totally prohibited.”

NY Press News has the most complicated lineup of backlink options to choose from. One guest post on a general or tech topic that includes two ‘dofollow’ backlinks costs $275, but it offers three such guest posts for the discounted price of $525. Guest posts on topics including health, e-commerce, gambling or cryptocurrency command a higher price, however. The site operators also offer link insertion, which is cheaper, but they only include one do-follow link instead of two.

How scraper sites harm legitimate websites

While imitation may be the sincerest form of flattery, content-stealing websites cause significant harm to the sites they steal from.  

Perhaps the most obvious problem is that these sites don’t just steal content - they steal traffic.  A search for recent TechRadar articles, for example, brings up a whole host of other sites with similar or exactly the same content, meaning that readers may click on one of the content-stealing sites instead of the original. Worse, Google will sometimes even prominently feature plagiarized content in their “Top stories” box.

web scraping

A Bing search for the title of a recent TechRadar article “It looks like macOS Monterey will launch with its best feature after all” by Matt Hanson. Note, the appearance of plagiarized content by News Nation USA. (Image credit: Future)

web scraping

The third article listed in the “Top stories” box is a stolen article from TechRadar that was posted on News Nation USA. (Image credit: Future)

Legitimate publications invest time, effort and money into producing useful, interesting content. Many are funded, at least in part, via advertising and affiliate commissions. So when content-stealing sites copy their articles and steal their traffic, the original sites lose income, making it harder for them to continue producing content that you love.

Content-stealing sites can also tarnish the reputation of legitimate sites. Some select domain names that are similar to legitimate websites, presumably in the hope that readers will mistake them for the real deal. When a reader is fooled, clicks through and is greeted by a sub-standard site and paid-for links to shady places, they may associate those bad features with the legitimate site.

As an example, the domain NewsNationUSA.com, which copies content from other sites, is very similar to NewsNationNow.com, which is a legitimate site owned by Nexstar Media Group. In fact, Nexstar filed a complaint in September against NewsNationUSA.com, as well as eight other domains, for trademark counterfeiting and infringement as well copyright infringement for stealing their content. As part of the complaint, it argues that NewsNationUSA.com and the eight other sites have harmed its reputation.

Content stealing websites harm legitimate ones in a less obvious way, too: by potentially damaging their SEO. The first way they do this is by creating duplicate content. This can be problematic as search engines might not be able to figure out which site originally created the stolen content and, in rare cases, may result in a plagiarizing site ranking above the legitimate one in search results.

The second SEO problem content-stealing sites can cause comes in the form of backlinks. While backlinks from reputable sites improve a site’s search engine rankings, backlinks from disreputable ones can harm it.  

If a website’s content is stolen by a scraper site, the scraper site may include backlinks to the original. For example, suppose an article on a legitimate site has internal links (i.e. links to other articles on the same, legitimate domain). If that article is copied and pasted by a scraper site, then the scraper site now has backlinks to the legitimate website.

Google is getting better and better at recognizing scraper sites and ignoring any backlinks from them, rather than penalizing those they link to. But, if Google doesn't realize the content stealing site linking back to the legitimate site is really a scraper, it might reflect poorly on the original site. Still, one bad backlink is not going to do any harm. But if a legitimate publication has many backlinks from multiple such sites then this could potentially cause problems in the form of a toxic backlink penalty, which would make its search engine rankings, as well as its traffic, plummet. This is not a likely occurrence, but seems to be at least a theoretical possibility. 

For example, Kinsta, a WordPress hosting platform, earned a backlink from WordPress.org’s blog. That post was later plagiarized by hundreds of scraper sites. Although it wasn't Kinsta’s content that was scraped, it was "caught up in a massive scraping war" and so had hundreds of backlinks from scraper sites. Kinsta acknowledged in its post that Google is good at figuring out what is going on in situations like this and so was unlikely to penalize it for those spammy backlinks. The company didn’t want to leave anything to chance, however, so took action to protect itself.

Fighting back against content-stealing sites

Legitimate sites can fight back and protect themselves against plagiarizing scraper sites on a number of different fronts. First, they have ways to get the stolen content removed.

According to Jake Moore, cybersecurity expert and spokesperson for ESET, the first port of call is for the legitimate site to contact the owner of the plagiarizing site and ask for them to remove the stolen content. He told us it’s important to “make them aware that what they are doing is illegal and unethical”.

Unfortunately, however, in many cases the owner may be hard to find, or may not take kindly to being asked to remove content from their site. In this situation, if the content-stealing sites are based in the US, owners of the stolen content can submit a Digital Millennium Copyright Act (DMCA) takedown notice to whoever hosts the plagiarizing site. If it’s based in a different country, there will often be similar takedown procedures.  

Tools like Domain Tools and Who Is Hosting This? can help owners track down the host in order to send them a takedown notice. If the plagiarizing site is using Cloudflare, finding the real host will be difficult. Cloudflare does, however, have an abuse form that can be used for copyright violations and indicate that they will pass the information on to the host and the website owner.

Sometimes, however, even contacting the host doesn’t work. In that case, legitimate sites can ask for stolen content to be removed from search engines. For example Google and Bing both have forms that can be used to report stolen content and DuckDuckGo has a dedicated DMCA email address. The stolen content will still exist on the plagiarizing sites but, if they are removed from search engines, at least it will be much more difficult for people to find.

A much more time-consuming and expensive remedy is to file a copyright infringement suit. In the US, for example, if a legitimate site has registered the copyright of their content, they can file a complaint in federal court. This is the route Nexstar Media Group is currently taking against NewsNationUSA.com. The firm is seeking damages, which could be as much as $150,000 per stolen article, and is asking for the scraper sites to pay the legal fees as well.

The second way that legitimate website owners can defend themselves from scraper sites is to protect their SEO. One way to do this is by making use of the rel=canonical tag. This tag, which goes in the header of the HTML code of a site, tells search engines which copy of a page is the original. Placing a self-referential rel=canonical tag in the header protects against scraper sites that copy and paste the HTML code. That’s because they also copy the rel=canonical tag that identifies the legitimate site as the original source, reducing the risk that the content-stealing site will rank above the original in search engine results.

However, if the plagiarizing site doesn’t just copy and paste the HTML code then this won’t help much. TechRadar, for example, makes use of rel=canonical tags to indicate that its articles are original. But News Nation USA adds its own self-referential rel=canonical tag to the articles it copies.

Legitimate sites can also protect their SEO when they find toxic backlinks from shady sites by considering whether they need to use Google’s disavow tool. This allows legitimate sites to tell Google that they have nothing to do with the toxic links and don’t want them to count towards their search engine rankings. This means that they won’t be penalized for these links if Google doesn’t like them and is exactly what Kinsta did to avoid any link penalties when they received hundreds of backlinks from scraper sites.

Before using the disavow tool, legitimate sites need to contact the content-stealing scraper sites that link to them and ask them to remove the backlinks. If that doesn’t work, then they can use the tool to upload a list of the URLs that link to them that they want Google to ignore. Google, however, discourages users who haven’t paid for backlinks themselves from using the tool and warns that “if used incorrectly , this feature can potentially harm your site’s performance in Google Search results.”

Legitimate sites can also take action to prevent scrapers from stealing their content in the first place. As Moore told TechRadar Pro via email, “other than reporting the site it is possible to turn off the right click function which enables copying of any text which can reduce the problem but not stop it completely.”  

If content is being scraped from a legitimate site’s RSS feed, then switching from a full feed that shows articles in their entirety, to one that only shows a summary of the articles, can also help prevent content theft. If legitimate websites can figure out the IP address of the malicious scrapers, another option is to block them from accessing the site. And sites that want to get really creative can set things up so that alternative content, like gibberish, is displayed to the malicious scrapers rather than the real content.

What can you do?

It takes legitimate sites a lot of time, effort and money to fight scraper sites that steal their content and they don’t always succeed in getting the stolen content removed. But the unfortunate reality is that content stealing sites are not going to disappear, at least not when it is so easy to copy someone else’s content and use it to make a profit. This puts the future of legitimate sites in jeopardy.  

Is there anything that readers of legitimate sites can do to help? Perhaps. If it becomes less profitable to sell backlinks on content stealing sites, then plagiarists might decide that it’s more trouble than it’s worth.  

 The first thing you can do to help is simply not buy backlinks as part of your SEO strategy. That way, none of your money is lining the pockets of scammy sites. If you decide you absolutely must buy backlinks, then at least do your research thoroughly so that the links will not end up on sites that post stolen content. This protects you, too, since if you pay out cash for a backlink on a content stealing scraper site and their host takes the entire site down for copyright violations a week later, that's money right down the drain.

Beyond committing to not buying backlinks, there are some other things you can do to help protect your favorite sites.  For content stealing sites to succeeds, the more traffic and the more links they themselves get, the better. So don’t give them traffic or links. That means being careful about the content we consume and share.  

Some of the guidelines for identifying and avoiding fake news apply to content stealing sites as well. For example, if you’re looking at a news site you’re not familiar with, check the domain closely to make sure it’s not trying to trick you into thinking it’s another, better known site. And look to see if it has a genuine-sounding “About” section that provides information about who runs it and how it is run.  

If something seems off, try putting exact phrases from the site in quotation marks and search for them on Google to see if articles by better-known publications appear. If they do, chances are the suspicious site is stealing content. In that case, avoid it in future, don’t link to it in social media posts or anywhere else, and recommend that friends linking to that site link to the original articles instead.

That way we can all do something, even if it’s small, to protect the sites we love.

Rebecca Morris

Rebecca Morris is a freelance writer based in Minnesota.  She writes about tech, math and science.