What is Amazon EMR?

What is Amazon EMR?
(Image credit: Shutterstock)

For any business that needs to analyze reams of data, there are several complex infrastructure challenges to consider. Massive amounts of data require massive storage, server performance has to be optimal, and there’s an array of networking and security concerns.

Fortunately, Amazon EMR (also known as Amazon Elastic MapReduce) is a service that can help with Big Data analysis needs for companies of all sizes. More than just about any other Amazon service, EMR is closely linked to other platforms to help with Big Data analytics, including Amazon EC2 (Elastic Compute Cloud) for renting virtual servers and Amazon Amazon Simple Storage Service (S3) for object storage. These products all work in tandem to help companies with the infrastructure and platform needs to run Big Data projects.

Imagine the alternative. For a company that is conducting genomic research, analyze traffic data for a city, building a vast machine learning initiative using artificial intelligence to analyze business data across a large company, the infrastructure would have to be deployed in a way that can handle all of that Big Data analytics -- the servers required, the online storage, the networking, and the security to deploy the frameworks you need to run the Big Data project.

Instead, EMR runs as a cloud computing service for deploying the frameworks without the related local, on-premise infrastructure management and deployment. With the ability to deploy Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi (Incubating), and Presto as services in the cloud, companies can then rely on EC2 and S3 to provide the autoscaling you need as the Big Data project evolves and gathers more and more data.

This is the real headache most companies face when analyzing a large amount of data. It’s often two simultaneous headaches. First, there are the project and business requirements (meaning, the reason the project is being conducted int he first place), the coding needed to make it all happen, the reporting and analysis deliverables, and all of the related project variables. That’s complex enough. Second, companies then also have to build the infrastructure required to handle a project of that magnitude. Known as Petabyte-scale computing, it is a double-edged sword because it’s often true that the data scientists and programmers developing the actual Big Data project are not necessarily experts in IT infrastructure.

Like many Amazon Web Services, Amazon EMR runs as a service you manage remotely and auto-scales to meet your needs, so there is little to know management involved.

Benefits of Amazon EMR

Some of the most important benefits to consider with EMR are related to cost and reducing complexity. In terms of cost, as mentioned previously, there is no need to build your own clusters in a local data center that’s on-premise. A compute cluster that runs on EMR can cost as little as 15-cents per hour for 10 nodes. Companies pay at a rate of “per instance” which means you are not paying for the actual infrastructure to sit idle. There is a minimum charge of only one minute, and of course, companies can then scale up from there as they analyze more and more data and benefit from adding additional nodes.

That scaling is important because it means companies don’t have to make plans to retrofit an existing infrastructure. As Big Data needs change and evolve to meet business requirements, you can add dozens of additional clusters and nodes or even thousands. An example of this might be a pharmaceutical company that decides to analyze genomic data for drug discovery. The company starts with one product line and one project meant for genomic discovery, but then adds additional projects on more clusters to aid in the drug discovery.

As for reducing complexity, it’s possible to install and configure an EMR cluster in a few minutes. There is no provisioning, setup, or configuration -- which is amazing considering what is normally required to configure a Big Data cluster running on an Apache Hadoop framework. This means data engineers and scientists and even non-programmers in a company can start using EMR without knowing about back-end infrastructure management.

A final word about security: While data breaches are increasingly common and hard to predict or avoid, the benefit of EMR is that all security issues are handled by the service itself -- including server-side encryption, virtual private cloud access, and firewall configuration. This endpoint security happens “behind the scenes” and is part of EMR even for the most basic clusters.

In the end, Amazon EMR is intended to help companies focus on what they do best -- build the actual project and launch their reporting and analysis tools in the cloud without the inevitable infrastructure problems related to scaling up a project. Companies can work more on the actual deliverables for the analysis and reporting you need, not the back-end.

John Brandon
Contributor

John Brandon has covered gadgets and cars for the past 12 years having published over 12,000 articles and tested nearly 8,000 products. He's nothing if not prolific. Before starting his writing career, he led an Information Design practice at a large consumer electronics retailer in the US. His hobbies include deep sea exploration, complaining about the weather, and engineering a vast multiverse conspiracy.