wwHadoop: Analytics without Borders

wwhadoop.png

This week at EMC World 2012, the EMC technical community is launching a community program called World Wide Hadoop.  I am really excited to be a part of a collaboration across the EMC technical community that has been looking to extend the “borders” of our Big Data portfolio by building on the success of our Greenplum Hadoop distribution in offering the open source community the ability to federate their hBase analytics across a distributed set of hadoop clusters.

wwhadoopqr.png

In the past month, the EMC Distinguished Engineer community has been collaborating with our St. Petersburg [RUS] Center of Excellence to demonstrate the ability to distribute analytic jobs (move the code vs. moving the data) across multiple [potentially] geographically dispersed clusters, manage those jobs, and enjoin the results.

The big problem that we are addressing is that of Reed’s law, and the value of combinatorial value of networked resources, in our case, information sets.

Reed.png

“[E]ven Metcalfe’s law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with in/i members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2i. So the value of a GFN increases exponentially, in proportion to 2i. I call that Reed’s Law. And its implications are profound.”

A few of our Big Data Challenges

201205181256.jpg

  • Valuable information is produced across geographically diverse locations
  • The data has become too big to move [thus we need to process in place]
  • Scientists and Analysts have begun to move partial sets vs. full corpi to try and save time
  • But this partial data can, and often does create inadvertent variance or occlusions in correlation and value

EMC is demonstrating a working distributed cluster model for analytics across multiple clusters.

201205181314.jpg

We want to work on this with the open community, as we believe that there is tremendous value in enabling the community to both derive and add value with EMC in this space and invite all to join us at http://www.wwhadoop.com.

Big Data Exchanges: Of Shopping Malls and the Law of Gravity

Working for a Storage Systems company, we are constantly looking at both the technical as well as social/marketplace challenges to our business strategy. Leading to the coining of “Cloud Meets Big Data” from EMC last year, EMC has been looking at the trends that “should” tip the balances around real “Cloud Information Management” as opposed to “data management” which is really what dominates todays practice.

There are a couple of truisms [incomplete list]:

  1. Big Data is Hard to Move = get optimal [geo] location right the first time
  2.  Corollary = Move the Function, across Federated Data
  3. Data Analytics are Context Sensitive = meta-data helps to align/select contexts for relevancy
  4. Many Facts are Relative to context = Declare contexts of derived insight (provenance amp; Scientific Method)
  5. Data is Multi-Latency & needs Deterministic support for temporality= key declarative information architectural requirement
  6. Completeness of Information for Purpose (e.g. making decision) = dependent on stuff I have, and stuff I get from others, but everything that I need to decide.

I believe that 1) and 6) above point to an emerging need for Big Data Communities to arise supporting the requirements of the others. Whether we talk about these as communities of interest, or Big Data Clouds. There are some very interesting analogies that I see in the way we humans act; namely, the Shopping Mall. Common wisdom points to the mall as providing an improved shopping efficiency, but also in the case of inward malls, a controlled environment (think walled garden). I think that both efficiency in the form of “one stop”, and control are critical enablers in the information landscape.

Big Data Mall slideThis slide from one of my presentations supports the similarities of building a shopping mall alongside the development of a big data community. Things like understanding the demographics of the community (information needs, key values), the planning of roads to get in/out. And of course how to create critical mass = the anchor store.

The interesting thing about critical mass is that it tends to have a centricity around a key [Gravitational] Force. Remember:

Force = Mass * Acceleration (change in velocity).

This means that in order to create communities and maximize force you need Mass [size/scope/scale of information] and improving Velocity [timelyness of information]. In terms of mass, truism #1 above, and the shear cost / bandwidth availability make moving 100TB of data hard, and petabytes impracticable. Similarly, velocity change does matter, whether algorithmically trading on the street (you have to be in Ft Lee, NJ or Canary Warf, London) or a physician treating a patient, the timeliness of access to emergent information is critical. So correct or not, gravitational forces do act to geo-locate information.

Not trying to take my physics analogy too far, but Energy is also interesting. This could be looked at as “activity” in a community. For energy there is an interesting both kinetic and potential models. In the case of the internet, the relative connectedness of information required for a decision could be viewed in light of “potential”. Remember:

Ep (potential energy) = Mass x force of Gravity x Height (mhg)

In our case Height could be looked at as the bandwidth between N information participant sites, Mass as the amount of total information needed to process, and Gravity as a decentralization of information = the Outer Joins required for optimal processing. If I need to do a ton of outer joins across the Internet in order to get an answer, then I need to spend a lot of energy.

So if malls were designed for optimal [human] energy efficiency, then big data malls could do exactly the same for data.