Catching phish with web scraping

web scraping in the form of binary code, 3D illustration
(Image credit: Shutterstock/Profit_Image)

Phishing is, unfortunately, profitable, hard to detect, and relatively easy to engage in. With digital transformations expedited across the globe, phishing is bound to experience continued explosive growth.

According to Phishlabs, the number of phishing attempts over Q1 2021 increased by nearly 50%. There’s no reason to believe it will stop climbing either.

That means increased levels of digital harm and risk. To counteract such an uptick, new approaches to phishing detection should be tested or current ones improved. One way to improve existing approaches is to make use of web scraping.

Poking phish

Phishers would be hard-pressed to completely replicate the original website. Placing all URLs identically, replicating images, cooking the domain age, etc. would take more effort than most people would be willing to dedicate. 

Additionally, a perfect spoof would likely have a lower success rate due to the ability for the target to get lost (by clicking on an unrelated URL). Finally, just like with any other scam, duping everyone is not necessary, therefore the perfect replica would be a wasted effort in most cases.

However, those who do phishing aren’t dumb. Or at least those who are successful at it aren’t. They still do their best to make a believable replica with the least effort required. It may not be effective against those who are tech-savvy, but even a perfect replica might not be effective against the wary. In short, phishing relies on being “just good enough”.

Therefore, due to the nature of the activity, there’s always a glaring hole or two that can be discovered. Two good ways to get a head start is to either look for similarities between frequently-phished-websites (e.g. fintech, SaaS, etc.) and suspected phishing websites or to collect patterns of known attacks and work your way up from there.

Unfortunately, with the volume of phishing websites appearing daily and the intent to target less tech-savvy people, solving the issue may not be as simple as it seems at first glance. Of course, as is often the case, the answer is automation. 

Looking for phish

There have been more methods developed over the years. An overview article written in 2018 by ScienceDirect lists out URL-based detection, layout recognition, content-based detection. The former often lags behind phishers as databases are updated slower than new websites appear. Layout recognition is based on human heuristic and is thus more prone to failure. Content-based detection is computational heavy.

We will be paying slightly more attention to layout recognition and content-based detection as these are complicated processes that benefit greatly from web scraping. Back in the day, a group of researchers had created a framework for detecting phishing websites called CANTINA. It was a content-based approach which would check for data such as TF-IDF ratios, domain age, suspicious URLs, improper usage of punctuation marks, etc. However, the study had been released in 2007 when automation opportunities were limited.

Web scraping can improve the framework immensely. Instead of manually attempting to find the outliers, automated applications can breeze through websites and download the relevant content within. Important details such as the ones outlined above can be extracted from the content, parsed, and evaluated.

Building a net

CANTINA, developed by the researchers, had a drawback - it was only used to prove a hypothesis. For these purposes, a database of phishing and legitimate websites had been compiled. The status of both was known a priori.

Such methods are suitable for proving a hypothesis. They are not as good in practice where we don’t know the status of the websites ahead of time. Practical applications of projects similar to CANTINA would require a significant amount of manual effort. At some point, these applications would no longer stand as “practical”.

Theoretically, though, content-based recognition seems like a strong contender. Phishing websites have to reproduce content in a nearly identical manner to the original. Any incongruences such as misplaced images, spelling mistakes, missing pieces of texts can trigger suspicion. They can never stray too far from the original, which means metrics such as TF-IDF would have to be similar by necessity.

Content-based recognition’s drawback has been the slow and costly side of manual labor. Web scraping, however, moves most of the manual effort into complete automation. In other words, it enables us to use existing detection methods on a significantly larger scale.

First, instead of manually collecting URLs or taking them from an already existing database, scraping can create its own quickly. They can be collected through any content that has hyperlinks or links to these supposed phishing websites in any shape or form.

Second, a scraper can traverse a collection of URLs faster than any human ever could. There are benefits to manual overview such as the ability to see the structure and content of a website as it is instead of retrieving raw HTML.

Visual representations, however, have little utility if we use mathematical detection methods such as link depth and TF-IDF. They may even serve as a distraction, pulling us away from the important details due to heuristics.

Parsing also becomes an avenue for detection. Parsers frequently fall apart if any layout or design changes happen within the website. If there are some unusual parsing errors when compared to the same process performed on parent websites, these may serve as an indication of a phishing attempt.

In the end, web scraping doesn’t produce any completely new methods, at least as far as I can see, but it enables older ones. It provides an avenue for scaling methods that might otherwise be too costly to implement.

Casting a net

With the proper web scraping infrastructure, millions of websites can be checked daily. As a scraper collects the source HTML, we have all the text content stored wherever we’d like. Some parsing later, the plain text content can be used to calculate TF-IDF. A project would likely start out by collecting all the important metrics from popular phishing targets and move on to detection.

Additionally, there’s a lot of interesting information we can extract from the source. Any internal links can be visited and stored in an index to create a representation of the overall link depth.

It’s possible to detect phishing attempts by creating a website tree through indexing with a web crawler. Most phishing websites will be shallow due to the reasons outlined previously. On the other hand, phishing attempts copy websites of highly established businesses. These will have great link depths. Shallowness by itself could be an indicator for a phishing attempt.

Nevertheless, the collected data can then be used to compare the TF-IDF, keywords, link depth, domain age, etc., against the metrics of legitimate websites. A mismatch would be cause for suspicion. 

There is one caveat that has to be decided “on the go” - what margin of difference is a cause to investigate? A line in the sand has to be drawn somewhere and, at least at first, it will have to be fairly arbitrary.

Additionally, there’s an important consideration for IP addresses and locations. Some content on a phishing website might only be visible to IP addresses from a specific geographical location (or not from a specific geographical location). Getting around such issues, in regular circumstances, is challenging, but proxies provide an easy solution.

Since a proxy always has an associated location and IP address, a sufficiently large pool will provide global coverage. Whenever a geographically-based block is encountered, a simple proxy switch is all it takes to hop over the hurdle.

Finally, web scraping, by its nature, uncovers a lot of data on a specific topic. Most of it is unstructured, something usually fixed by parsing, and unlabeled, something usually fixed by humans. Structured, labeled data may serve as a great ground for machine learning models.

Terminating phish

Building an automated phish detector through web scraping produces a lot of data for evaluation. Once evaluated, the data would usually lose its value. However, like with recycling, that information may be reused with some tinkering.

Machine learning models have the drawback of requiring enormous amounts of data in order to begin making predictions of acceptable quality. Yet, if phishing detection algorithms start making use of web scraping, that amount of data would be produced naturally. Of course, labeling might be required which would take a considerable amount of manual effort.

Regardless of this, the information would already be structured in a manner that would produce acceptable results. While all machine learning models are black boxes, they’re not entirely opaque. We can predict that data structured and labeled in a certain manner will produce certain results.

For clarity, machine learning models might be thought of as the application of mathematics to physics. Certain mathematical modeling seems to fit exceptionally well with natural phenomena such as gravity. Gravitational pull can be calculated by multiplying the gravitational constant by the mass of two objects and dividing the result by the distance between them squared. However, if we knew only the data required, that would give us no understanding about gravity itself.

Machine learning models are much the same. A certain structure of data produces expected results. However, how these models arrive at their predictions will be unclear. At the same time, at all stages the rest is as predicted. Therefore, outside of fringe cases, the “black box” nature doesn’t harm the results too much.

Additionally, machine learning models seem to be among the most effective methods for phishing detection. Some automated crawlers with ML implementations could reach 99% accuracy, according to research by Springer Link.

The future of web scraping

Web scraping seems like the perfect addition to any current phishing solutions. After all, most of cybersecurity is going through vast arrays of data to make the correct protective decisions. Phishing is no different. At least through the cybersecurity lens.

There seems to be a holy trinity in cybersecurity waiting to be harnessed to its full potential - analytics, web scraping, and machine learning. There have been some attempts to combine two of three together. However, I’ve yet to see all three harnessed to their full potential. 

Chief Operating Officer at Oxylabs

With over 16 years of experience in the field, Juras has established himself as an expert in IT and product management. His ability to apply strategic problem solving, critical thinking, and people management skills have led him to occupy the position of COO at Oxylabs, a global provider of premium proxies and public web data scraping solutions.