What is AWS Glue?

What is AWS Glue?
(Image credit: Pixabay)

Managing data is a full-time job for some (quite literally). Especially at a larger company, there may be requests to run an analytics report, move data from one repository to another, or even create “clean data” for an important new web application. In terms of data management, cloud computing services provide extreme flexibility in what you can do with data reporting, and there are quite a few tools available to help, especially for Amazon Web Services (or AWS).

AWS Glue is one of those data and cloud storage management tools. It’s known as a managed ETL, which means it is used to Extract, Transform, and Load data in preparation for reporting and analytics. AWS Glue is a data catalog for storing metadata in a central repository. It’s a way to automate ETL so that you point AWS Glue to the data that’s stored within AWS. The data becomes searchable and queryable for any of the reporting and cloud analytics you need to use.

It’s helpful to understand ETL before diving into AWS Glue and the benefits of using it. ETL is how data management employees at a company blend data so that it can be used for a query. There might be multiple data stores and multiple cloud databases, but the ETL readies the data without having to move any data stores. ETL essentially preps the data so that it is ready for analytics and reporting, as opposed to the alternative which is to actually move the data, isolate it, and then run queries in preparation for any analytics or reporting.

AWS Glue is the tool that generates ETL code for programming languages Scala or Python. Essentially, once you generate the catalog data, you can then perform searches and queries on the data using cloud computing tools such as Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, all designed to help companies store and use data in applications. AWS Glue also works with Virtual Private Cloud (Amazon VPC) on Amazon EC2.

To understand what AWS Glue is, it’s helpful to understand how it works. For starters, data management employees, developers, and data scientists can use AWS Management Console to register the data sources. After crawling the data the ETL will then create catalogs using classifiers like JSON, CSV, and Parquet. Employees will then select a source for the ETL and generate the code needed for the reporting and analytics. Finally, the ETL can schedule recurring jobs and to prep the data for tools like AWS Lambda.

AWS Glue benefits

The main advantage of AWS Glue is flexibility. Many companies now use a data lake that contains a wealth of structured and unstructured data. In the past, companies were forced to move the data into a new repository, to endlessly manage the data, and to worry about the servers and infrastructure needed for their apps. Speaking of a fulltime job! That was a complicated time period in the history of Information Technology, all prior to the cloud.

With AWS Glue, there’s no need for a server on-premise (since it is all serverless and runs as a managed ETL) or even your own data center, your own local data management stores, or a dedicated employee who manages the data. Instead, AWS Glue is the glue that ties together disparate data and makes it ready and available for queries.

AWS Glue is also highly automated. It can crawl disparate data sources, identify the formats, and suggest how to use the data. Once AWS Glue does all of this, it can then generate the code you need for any data queries, transformations, or processes.

An important distinction to make here is that AWS Glue does all of its ETL processing in the cloud. That means employees don’t have to do any of the data management and prep that is often required to run ETL, such as managing endpoint security, configuring the data beforehand, moving the data to the right repository, or any of the more complicated steps such as configuring the data stores, managing storage, and configuring servers.

AWS Glue removes much of the headache involved with preparing data for analysis. Known as “heavy lifting” in the industry, it is the chore of making structured or unstructured data ready for queries. With AWS Glue, that is not needed. All of the discovery, cleansing, enriching, and moving of the data occurs behind the scenes as part of the ETL, making everything much easier for IT service management.

Because the cloud is so flexible, and there are so many different data stores, web applications, and business needs for reporting and analytics, AWS Glue helps bring some sanity to the data exploration process -- without having to do any of the back-end work first. It’s powerful in that it saves time and effort, and yet the queries can be repeatable and automated.

John Brandon
Contributor

John Brandon has covered gadgets and cars for the past 12 years having published over 12,000 articles and tested nearly 8,000 products. He's nothing if not prolific. Before starting his writing career, he led an Information Design practice at a large consumer electronics retailer in the US. His hobbies include deep sea exploration, complaining about the weather, and engineering a vast multiverse conspiracy.