The "big" in big data is enough to make most IT architects worry about the possibility of an increased load on an already taxed infrastructure. As enterprises move from experimentation to wide deployment of big data and other clustered applications, the network that underpins the entire thing becomes both more critical, and more loaded, than ever.
IT leaders need to ask themselves one simple question: Is my network ready for big data?
Big data is big, but not how you think
When most people think about big data, they imagine massive applications spanning thousands of nodes in support of the largest web-scale companies. While it is true that these deployments do exist (Yahoo! notably has more than 40,000 Hadoop nodes), the average enterprise big data deployment is actually in the 100 to 150 node range.
So, if the average deployment is relatively small, is scale even an issue?
For most enterprises, scale isn't going to be about one or two big data applications. Today, enterprises already experimenting in this field are really just dipping their toes into the proverbial big data water. The deployments are small because they are more of an experiment than a business-critical application. However, if these initial forays into the space yield business success, expect the addition of other applications to quickly follow.
The likely course this will take is the proliferation of small big data applications, each consuming a few hundred nodes. While most companies will never experience the complexity of a 10,000-node deployment, they will start to experience the aggregate load of a few dozen smaller applications.
The role of bandwidth for big data
The entire premise of big data is to break large workloads into smaller, more consumable chunks. To do this, data has to be replicated to servers in a cluster. Since most big data applications make three copies of every piece of information (two in the rack, one in another rack for resiliency), the load on the network becomes large very quickly.
Traditionally, handling load on a network is done with a technology called Equal Cost Multi-Pathing (ECMP). Essentially, ECMP distributes flows across a small number of equal cost paths in the network. So, even though there might be many ways to get from point A to point B, ECMP will select the shortest path and load balance across those. For big data flows, this can create problems. When you send a lot of traffic across the same few paths, you can get congestion in the network. Most big data applications deal with congestion by simply resending the request. But, during times of congestion retransmissions only exacerbate the problem.
The hottest trend in networking is a technology called software-defined networking (SDN). SDN's core architectural tenet is the separation of control and forwarding. By creating a central control point, SDN is able to intelligently look at the network in its entirety. This makes it possible to intelligently forward traffic along longer but less congested paths. It could be that the adoption of non-equal-cost multi-pathing is one key to successfully scaling infrastructure for big data.
More than bandwidth
While SDN can help alleviate the bandwidth issues by utilizing more of the available paths in the network, scaling big data is not only about bandwidth. If the growth of big data in enterprise datacenters involves multiple applications, that means the more looming scaling concern is how the network can account for different applications with different requirements.
Most networks today are built to be agnostic to the applications running on them. That means the network is designed to be general purpose, treating all applications in roughly the same way.
But not all big data applications are the same. Some are very bandwidth heavy (as with data backups). Others are more latency sensitive (like recommendation engines in AdTech). Others are sensitive to jitter or loss. And still others have strict compliance requirements (PCI or HIPAA). The point here is that it is impossible for a single network to treat these applications differently if that network is not at least somewhat application aware.
SDN has the potential to support application requirements via abstract policy expression. In other words, users can define an application and attribute to it the things that are most important. If bandwidth is important, the controller can dynamically create high-capacity links when necessary. If latency is important, the controller can help ensure the shortest possible path is always used. If isolating traffic for compliance reasons is critical, the controller can create tunnels.
The future of enterprise IT is changing dramatically, led by applications like big data. Fortunately, technology advances in the underpinning infrastructure should offer relief for enterprises looking to take advantage. However, IT architects will need to plot their infrastructure courses carefully and deliberately to ensure that the underlying infrastructure intersects the applications they want to run.
- Michael Bushong, VP of Marketing at Plexxi