The Perils & Pitfalls of Distributed Computing: How to Get Your Operations Back on Track

September 12, 2016
Dez Blanchfield


Image credit: flickr @x-ray_delta_one

Pepperdata in the Tech Lab – Hands on With Powerful Distributed Computing Management Tools

blog by Dez Blanchfield, Chief Data Scientist, The Bloor Group

Pepperdata Tech Lab – Part 1

The Perils & Pitfalls of Distributed Computing: How to Get Your Operations Back on Track


If you were to believe everything you read in the media about the current state of Big Data platforms, you could be forgiven for thinking that implementing the infrastructure & platform requirements for Big Data projects were a simple plug & play issue.

But the reality is that Big Data, and in particular the platforms and infrastructure to support any modern Big Data initiative or project, are not that simple. Big Data is not a matter of throwing infrastructure, a Hadoop cluster, and some data into a mixing bowl and then magically you have an off the shelf solution.

Despite some extraordinary efforts by some of the world’s best minds, the challenges surrounding running something as complex as an open source distributed computing platform like Hadoop can indeed be far more complex, time consuming and often costly than anticipated – especially if you end up with conflicting users and workloads.

Anyone who has been down this road has invariably learned, often the hard way, that Hadoop is hard. Hadoop clusters don’t build themselves, and they are certainly not a simple matter of plug-and-play, even for small clusters.

Pepperdata found that they could best communicate the value of their solution by challenging our Tech Lab team to spin up the Pepperdata platform, on a small Hadoop cluster of 12 nodes, and prove the value of their platform and tools and share that experience with the world through a series of blogs and webinars. Our answer was “you had us at ‘hello’”.

Our challenge? Simple:

“Demonstrate the business and technical challenge of running a Hadoop cluster, highlight the usual pitfalls and prove in the Tech Lab how the Pepperdata solution addresses those business and technical challenges and share our findings in the open and transparent method The Bloor Group has build its brand on over the last decade.”

In part this is because the Hadoop platform, and surrounding ecosystems are being used in ways that were not envisioned during its early design. And once a cluster is deployed, keeping it running smoothly is no less a challenge than running any I.T. system with many moving parts.

Hadoop as an ecosystem has indeed matured and grown substantially in the decade since it grew out of the open source Nutch search engine project, but at it’s core the basic elements are not so dissimilar to the original problem being solved with Nutch, being large scale collection, storing, and processing of structured and unstructured data.

As with any distributed computing system, Hadoop’s complexity renders manual tuning largely ineffective. And as the environment gets more complex (adding multiple tenants, mixed workloads, etc), manual efforts to manage and control cluster activity only get less practical.

The key challenges and issues any distributed computing solution faces as it scales, especially a Hadoop cluster, are really no different from the core issues faced by any large scale high performance computing project. The larger the project, the more complex the environments, and the more likely you are to have to deal with competing users or user groups, and overlapping workloads.

Although software tools exist to help improve performance, both open-source and proprietary, they are often suggestion-based, still relying on Hadoop operators to make manual changes to configuration. Automated adjustments at the speed of compute are the only answer.

In The Bloor Group’s inaugural 2016 Tech Lab Webcast I had the opportunity to don my “Hadoop evangelist” and “Data Scientist” hats as I shared my own journey through the evolution of high performance and distributed computing, as I discussed some of my own experiences around why automation is the key to making clusters hum.

I had the pleasure co-presenting with two of the world’s leading experts in distributed computing on the Hadoop platform: Kirk Lewis of Pepperdata. It was truly an honor to have the opportunity to discuss the topic in depth with Pepperdata, and have Kirk represent the company’s views and experience around why distributed computing brings its own set of unique challenges.

In this first webcast with Pepperdata in the Tech Lab, I shared some of my real-world success stories of optimizing Hadoop in real time. We also had the opportunity to take questions from our audience and provide answers, all of which spoke directly to the topic at hand.

This first event was a strong kickoff for the Pepperdata Tech Lab series interactive Webcasts. Join me for Webinar 2 as I take Pepperdata into the Lab and tests it against all manner of workloads.

If you missed the live episode of this webinar, don’t panic, we record them for you to enjoy in your own time.

Pepperdata Tech Lab Webinar 1: The Perils & Pitfalls of Distributed Computing – How to Get Your Operations Back on Track

For more information about our sponsor partner Pepperdata, please visit their website:


No comments

Leave a Reply

Your email address will not be published. Required fields are marked *