The Panama Papers have made for the biggest leak in history: 2.6TB of data contained in 11.5 million files. To put that in perspective, only the Offshore Secrets leak of 2013, at 260GB, is big enough to put a dent in the Panama data trove - and even then, it's a small one.
The records, leaked from the database of offshore law firm Mossack Fonseca and passed to German newspaper Süddeutsche Zeitung, were shared with the International Consortium of Investigative Journalists (ICIJ), revealing how Fonseca helped its clients avoid tax and launder money.
The entire list of companies and people linked to them will be released in "early May", says the ICIJ, with a lot of information already revealed.
But 11.5 million files is a lot to sift through, so how was so much information analysed in such a relatively short amount of time?
Using a program called Nuix, a large-scale investigation platform built for big data, Süddeutsche Zeitung and the ICIJ dug through the Everest of information quite quickly. "The actual mining of the data started in September," said Carl Barron, Nuix's Senior Solutions Consultant who worked with the ICIJ and Zeitung to conduct the investigation. "This is the largest data leak to a journalist body ever. It's 2.7 TB of information and it's 11.5 million items across lots of different file types."
Nuix actually first started working with the ICIJ over four years ago; the two worked together for the Offshore secrets leak of 2013, which was so successful it led to Nuix coming on as a "theoretical" partner once again.
This time, the pile of information was substantially greater, but still - in theory - could have been indexed in a day and a half, according to Barron. Once the initial, readable data was indexed, which took a couple of weeks, Nuix was used to categorise and pull out information based on file type, or if it contained a person's name, held credit card information, or contained many other bits of desired information.
'We didn't realise the story was going to break'
As for Nuix, it didn't actually know what it was dealing with. "Due to the confidentiality and the top secret nature, we didn't actually see the data," said Barron. "And to be honest we didn't realise the story was going to break."
Nuix deals with massive data loads all the time. For a leak it was huge, but according to Nuix this was "quite a routine amount of information".
"We have some customers processing 300 terabytes worth of information in a matter of a month," said Barron, "so it's quite common for us to see this."
In the initial stages there was a lot of back and forth between the ICIJ, Süddeutsche Zeitung and Nuix, as they determined what was needed from a hardware and workflow perspective. Then, once it was set up, it was a pretty straightforward process.
Nuix continued to work in a consultancy capacity, dealing with technical and workflow questions - all the while unaware of the data that was being dealt with, and how much of a story it was going to create.
"My understanding is, and I've had this confirmed, is that they reduced the data size," said Barron. This was done through a process called deduplication. "Deduplication is used to identify the same item that may have been saved a number of times on a system," he explained, so investigators aren't required to look at the same data more than once.
Once the initial indexing had been carried out, the team could then move onto identifying things like unsearchable items. A lot of information was unreadable, and software called OCR (Optical Character Recognition) was used to analyse "closed" files, such as PDFs, scans and images, and turn their data into a format that could be read by computers.
"There were a lot of files that were identified as non-searchable," said Barron on the Panama Papers. "So they spent a few more weeks, a couple more months, on that on the back of [the initial search], bringing the OCR items alongside the digital items."
Optical Character Recognition - damn useful
For example, a document might contain white text that wouldn't visibly appear were you to open the document in Microsoft Word.
"It's quite a common thing we've seen in the past," said Barron. "So it just looks at the zeroes and ones in a forensic manner, and what it does is go through and index all the information that is available there, even if it's hidden or formatted in a way that would typically not be seen."
Barron explained that the information is all live on Nuix's platform, so the investigators can go back and easily search for other information if needed. If an investigation comes to light involving a particular individual, the researchers can start connecting the dots and pulling the info together from the database.
"So the good thing is, if anything comes to light off the back of the investigation, they can keep running their search terms over this information," Barron said.
"The system is live and there to be used."