Skip to main content

What is a data lake? Everything you need to know

What is a data lake?
(Image credit: Pixabay)

When it comes to cloud computing, the terms we use are almost as important as the data we store and analyze. Companies that communicate about how cloud computing data is stored, retrieved, accessed and archived tend to maximize the use of that data. This leads to better products, higher revenue for the company, and more growth. More than anything, it leads to better communication between business units, the Information Technology department, and even the front office, sales, marketing, customers and business partners.

One of the terms that came into wide use over the last few years is a data lake. Before the rise of cloud computing, and even before the Internet was widely used as a means of transmitting data, cloud computing experts used the term data warehouse, but it wasn’t quite sufficient. A data warehouse, as the name implies because of how a “warehouse” is highly organized, consists of data that a company processes, analyzes, and reuses as part of its cloud storage management. For a retailer, a data warehouse might contain all of the product information, SKUs (stock keeping unit), and prices. A data warehouse is typically optimized for a fast, reliable access.

A data lake is not so highly organized. Cloud computing experts started using the term data lake to differentiate the storage of both structured and unstructured data compared to a data warehouse. With a data lake, there is no assumption about the data being optimized.

Yet, there are clear advantages. A data lake can contain a wide assortment of data, but companies can still run cloud analytics on the data, they can still operate a business dashboard, and they can still use the data in an app or for other processing duties. While it is a catch-all term that can consist of massive data stores and is highly scalable and useful for multiple purposes, a data lake also a generic way of describing unorganized and organized data.

Key components

In order to understand a data lake and how it helps companies access cloud computing information in a way that does not require optimization or re-structuring of the data, it’s also important to understand the key components. A data lake often involves machine learning, which is a way to understand and process data using automated methods.

In the case of a retailer who needs to access product information, machine learning can determine which SKUs are stored in a data lake and pull that data into an app. Information Technology service management personnel do not need to organize the data first.

Another key component is analytics. With most structured business data, it’s important to have a database whereby IT professionals can generate reports, run SQL queries, or make use of the data in a logical, predictable way. Think of the typical health-care company that needs to have structured data available to medical staff in order to run analytics and reporting -- it typically has to be in a centralized cloud database and optimized for use (e.g., stored in a data warehouse). However, companies can still run analytics on a data lake without having to first optimize the data, and that is one of the key advantages. In fact, as machine learning and data optimization improve, a data lake of structured and unstructured data becomes even more valuable.

One last component of a data lake: It is not always assumed that the data will be used in the cloud. While a data warehouse might be optimized for on-premise use or in the cloud, a data lake can involve moving data for on-premise use in an internal app (one that pulls data from your own servers) or can be used externally (using online cloud storage and computing data stores).

How the company benefits

One of the keys to understanding the term data lake is to think about how companies access data in the first place. It is not quite as “clean” as you would think. Sometimes, data arrives in a haphazard fashion (called unstructured data) and it’s dumped out to a repository; companies don’t always known the original source of the data. Sometimes, it’s stored in a relational database used for a business app, sometimes it’s a collection of social media data or something that feeds a mobile app used by external customers. The main point to make here is that a data lake provides increased flexibility over how a company can use the data.

So, while a data warehouse is more structured and optimized way of cloud hosting data, and meant for a specific purpose, a data lake is flexible enough for multiple purposes. There’s no need to first create a clear and obvious usage model for the data and to house it in a specific way in a database. It is always available, can be used for multiple purposes and disparate apps, and intended for on-premise processing on your own servers or access from the cloud. It’s ready for anything.