The evolution of web scraping and what it means for the future

HTML code on a black screen.
(Image credit: Pixabay)

The world’s first scraper was born almost right when the Internet came into existence. 

However, despite the first web scraping API being launched at the turn of the millennium, the conversation around web scraping is as topical as it’s ever been. This is because of a recent first-of-its-kind decision by the U.S. Ninth Circuit of Appeals concluding that the collection of publicly available data does not breach the Computer Fraud and Abuse Act (CFFA). So, with that in mind, let’s explore the origins of this technology and its continuous cross-industry use.  

Web scraping's beginnings

 In 1993, the term ‘web crawling’ first popped up. This turned out to be a significant year for this technology, as in June, Matthew Gray developed the World Wide Web Wanderer Offsite Link to measure the size of the Internet. 

Later that same year, this was used to generate an index called the ‘Wandex’, which enabled the first web search engine.  Prior to the advent of JumpStation's web scraping technology, public web data collecting was handled manually by an administrator who would collect and arrange data sets in the hopes that they would match what customers were looking for. 

We take web scraping technology for granted nowadays, with major search engines delivering a plethora of results in an instant, saving us long manual hours.  

Knowledge is every industry's superpower

Twenty years later, the concept of collecting publicly available web data is a key foundation for many organisations today. That’s because the Internet has become the biggest public data resource in the world. Public web content allows industry leaders fast and near-live access to actionable insights that strongly impact their organisational strategies as well as overall outcomes and even revenues. 

Leading companies, for example, employ this web scraping technology to acquire data on market conditions and realities such as product pricing and reviews, stock levels, and customer sentiments. This tool is also used by researchers, academics, investors, and journalists to acquire real-time insights and base their reporting on reliable data points. A look at public sentiment and well-being, organisational team structures, growth opportunities, and the competitive landscape for target audience engagement are among them.

Challenges to web scraping

Despite the clear, wide-ranging benefits of web scraping, LinkedIn attempted to restrict hiQ Labs, a data analytics company that collects publicly available data from LinkedIn profiles, from accessing its website in 2017. Its technology is used by companies to retain highly desirable employees as well as to identify knowledge/skill gaps within the organisation. LinkedIn issued a cease and desist letter to hiQ Labs, banning them from operating any of its services, and a legal battle followed in the US.

This resulted in a court case in which a district court ruled in favour of hiQ. Of course, the case has triggered a string of appeals in recent years, and, subsequently, the original case was sent back to The Ninth Circuit. In April 2022, the Ninth Circuit granted hiQ’s request for a preliminary injunction, meaning LinkedIn could not block hiQ from accessing its website. The court ruled that LinkedIn’s claims of hiQ breaching laws, such as the CFAA, are unwarranted, as the data in question is publicly available.

What does the future hold then?

The Ninth Circuit's unprecedented decision confirms the premise on which the Internet, the world's largest public database, was built: democratising access to information for all. Scraping data that is freely available on the Internet is not a violation of the CFAA, according to the court.

Although the final outcome of this case is not yet known, and there could be more legal challenges to come, the latest ruling by the US courts is a big win for archivists, academics, researchers, journalists and businesses that rely on the insights that web scraping provides. Basically, everyone, the big and the small, has the right to access the same public online data sphere. Web data collection certainly has a bright future, especially considering the rapid speed at which the amount of web data continues to grow. The fact that this data can be turned into valuable insights that bring progress and innovation fuels what should always remain an open market.

Find out more on Bright Data's range of offerings here.