wwHadoop: Analytics without Borders


This week at EMC World 2012, the EMC technical community is launching a community program called World Wide Hadoop.  I am really excited to be a part of a collaboration across the EMC technical community that has been looking to extend the “borders” of our Big Data portfolio by building on the success of our Greenplum Hadoop distribution in offering the open source community the ability to federate their hBase analytics across a distributed set of hadoop clusters.


In the past month, the EMC Distinguished Engineer community has been collaborating with our St. Petersburg [RUS] Center of Excellence to demonstrate the ability to distribute analytic jobs (move the code vs. moving the data) across multiple [potentially] geographically dispersed clusters, manage those jobs, and enjoin the results.

The big problem that we are addressing is that of Reed’s law, and the value of combinatorial value of networked resources, in our case, information sets.


“[E]ven Metcalfe’s law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with in/i members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2i. So the value of a GFN increases exponentially, in proportion to 2i. I call that Reed’s Law. And its implications are profound.”

A few of our Big Data Challenges


  • Valuable information is produced across geographically diverse locations
  • The data has become too big to move [thus we need to process in place]
  • Scientists and Analysts have begun to move partial sets vs. full corpi to try and save time
  • But this partial data can, and often does create inadvertent variance or occlusions in correlation and value

EMC is demonstrating a working distributed cluster model for analytics across multiple clusters.


We want to work on this with the open community, as we believe that there is tremendous value in enabling the community to both derive and add value with EMC in this space and invite all to join us at http://www.wwhadoop.com.

SAP and Informatica join forces

Wow am I remiss for not posting this article sooner…

SAP has agreed to include data federation tools from Informatica with some of its enterprise resource planning and analytics products.

The combination of the two products is intended to help customers analyze data stored in third-party or legacy systems. It also solves a marketing problem for SAP, as the applications from Informatica will help the company sell into larger enterprises with heterogeneous environments.

Under the terms of their agreement, SAP will embed Informatica’s PowerCenter, PowerExchange and Metadata Manager software into its performance management and business analytic applications and the NetWeaver platform for master data management and business intelligence.

My take: SAP is going head to head with Oracle… Oracle has been acquiring EIM assets at a rapid clip with Sunopsis and Hyperion, SAP has been building their own NetWeaver based BI offering, but there was huge traction with Hyperion to provide the enterprise wide integrated reporting so key to leveraging SAP’s information, but alas, this is no more. Now SAP must figure out how to fill that gap with an existing enterprise domain player to empower their own offer… enter Informatica. Hmmmm….