Plain sailing with data lakes

data lakes, data marts, data vaults and data warehouses
Image credit: Pixabay (Image credit: Image Credit: Shutterstock)
About the author

Neil Barton is the Chief Technology Officer of WhereScape.

It seems like everywhere you turn, someone is talking about big data this, or data analytics that. Supporting this move to data-driven businesses is a whole range of different data infrastructures, but it can be difficult to wrap your head around where your data lakes and data warehouses meet, and why you might even need a data vault. 

Each of these concepts, though, simply boils down to finding ways to ingest and manage your data in an effective way for today’s level of insight-driven decision-making. So what are the options, how do they relate, and what are they used for?

Data lakes

Data lakes are huge collections of data, ranging from raw data that has not been organised or processed, through to varying levels of curated data sets.  One of their benefits from an analytics purpose is that the varying types of consumers can access appropriate data for their needs. 

This makes it perfect for some of the newer use cases such as Data Science, AI and machine learning, which are viewed by many companies as the future of analytics work. It is a great way to store masses of raw data on scalable storage solutions without attempting traditional ETL or ELT (extract, transform, load), which can be expensive at this volume. 

However, for more traditional analytics, this type of data environment can be unwieldy and confusing – which is why organisations turn to other solutions to manage essential data in more structured environments.  

In terms of positioning within a data infrastructure, data lakes are, if you like, up-stream of other data infrastructure, and can be used as a staging area for a more structured approach such as a data warehouse, as well as providing for data exploration and data science. 

Data warehouses

A data warehouse, or an enterprise data warehouse as it is sometimes known, is a more curated repository of data. It is invaluable for providing business users with access to the right information in a usable format – and can include both current and historical information. 

As data enters the data warehouse environment, it is cleansed, transformed, categorized and tagged – making it easier to manage, use and monitor from a compliance perspective, which is where automation comes in. 

The volume and velocity of data experienced by businesses today means that manually ingesting this data, processing it, and making sure it’s stored and accessible in a way that meets compliance requirements within a data warehouse is unfeasible in the modern world. 

However, with businesses constantly looking to data as the source of both reports and forecasts, a data warehouse is invaluable. It’s important that data lakes do not subsume the role of a more structure data infrastructure just because of the perceived effort of ingestion. Automation can help speed the ingestion and processing to fast-track time to value with data-driven decision-making in a data warehouse. 

Data marts

A data mart is a specific sub-set of a data warehouse, often used for curated data on one specific subject area, which needs to be easily accessible in a short amount of time. Due to its specificity, it is often quicker and cheaper to build than a full data warehouse. However, a data mart is unable to curate and manage data from across the business to inform business decisions. 

Data vaults

Data vault modelling is an approach to data warehousing which looks to address some of the challenges posed by transforming data as part of the data warehousing process. One of the great advantages of a data vault is that it makes no assessment as to what data is “valuable” and what isn’t, whereas once data is processed and cleansed into a warehouse environment, this decision has typically been made. 

Data vaults have the flexibility to manage this, and to address changing sources of data, leading the data vault approach to be credited with providing a “single version of the facts” rather than a “single version of the truth.”

For enterprises with large, growing and disparate datasets, a data vault approach to data warehousing can help tame the beast of big data into a manageable, business-centric solution, but can take time to set up. 

Data vault automation is a critical component to ensuring organizations can deliver and maintain data vaults that adhere to the stringent requirements of the Data Vault 2.0 methodology and will be able to do so in a practical, cost-effective and timely manner.

Understanding the differences

Having a broad understanding of how each of these different data approaches work and fit together could be invaluable to IT managers and business leaders as they grapple with understanding what is and isn’t possible as big data becomes as much a business prerogative as a technology one. 

Finding ways to speed up the establishment and management of these practices using technologies such as automation is essential for helping organisations reduce the time to value and succeed in the data-driven business landscape.

Neil Barton is the Chief Technology Officer of WhereScape, a provider of data infrastructure automation software, where he leads the long-term architecture and technology vision for the company's software products.

Neil Barton
Neil Barton is the Chief Technology Officer of WhereScape, a provider of data infrastructure automation software, where he leads the long-term architecture and technology vision for the company's software products.