wwHadoop: Analytics without Borders


This week at EMC World 2012, the EMC technical community is launching a community program called World Wide Hadoop.  I am really excited to be a part of a collaboration across the EMC technical community that has been looking to extend the “borders” of our Big Data portfolio by building on the success of our Greenplum Hadoop distribution in offering the open source community the ability to federate their hBase analytics across a distributed set of hadoop clusters.


In the past month, the EMC Distinguished Engineer community has been collaborating with our St. Petersburg [RUS] Center of Excellence to demonstrate the ability to distribute analytic jobs (move the code vs. moving the data) across multiple [potentially] geographically dispersed clusters, manage those jobs, and enjoin the results.

The big problem that we are addressing is that of Reed’s law, and the value of combinatorial value of networked resources, in our case, information sets.


“[E]ven Metcalfe’s law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with in/i members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2i. So the value of a GFN increases exponentially, in proportion to 2i. I call that Reed’s Law. And its implications are profound.”

A few of our Big Data Challenges


  • Valuable information is produced across geographically diverse locations
  • The data has become too big to move [thus we need to process in place]
  • Scientists and Analysts have begun to move partial sets vs. full corpi to try and save time
  • But this partial data can, and often does create inadvertent variance or occlusions in correlation and value

EMC is demonstrating a working distributed cluster model for analytics across multiple clusters.


We want to work on this with the open community, as we believe that there is tremendous value in enabling the community to both derive and add value with EMC in this space and invite all to join us at http://www.wwhadoop.com.

Big Data Universe: “Too Big to Know”

I was surfing to WGBH on Saturday when I came across a lecture by with David Weinberger (surrounding his new book Too Big to Know).201202200850.jpg

I was sucked in when he eluded to brick and mortar libraries as yesterdays public commons, and pointed to discontinuous and disconnected nature of books / paper. The epitaph may read something like this “book killed by hyperlink, the facts of the matter are whatever you make them.”

Overall David, in his look at the science of knowledge, points at many interesting transitions that the cloud is bringing:

  • Books/Paper are discontinuous and disconnected – giving way to the Internet which is constantly connected and massively hyper/inter-linked
  • “the Facts are NOT the Facts” which is to point out that arguments are what we make of the information presented, and our analytics given a particular context. What we claim as a fact, may in reality proved to be a fallacy -gt; just look at Louis Pasteur and germ theory. History has so many moments like this.
  • Differences and Disagreements are themselves valuable knowledge. For me this is certainly true, learning typically comes through challenge of preconception.
  • There is an ecology of knowledge – There are a set of interconnected entities that existing within an environment. These actors represent a complex set of interrelationships, a set of positive and negative reinforcements, that act as governors on this system. These promoters/detractors act to balance fact/fallacy so as to create the system tension that supports insight (knowledge?). It’s these new insights that themselves create the next arguments – the end goal being?

I wanted to share this book because I believe that it re-enforces the need for businesses to think about their cloud – big data strategies. The question becomes less of “do I move my information to the cloud?” and more of “how do I benefit from the linkage that the Internet can provide to my information?” so as to provide new insights from big data.

Read the book, take the challenge!

Fallacies of Enterprise Information Management (part deux)…

With some hearty comments from Tom Maguire, I’ve been forced to adjust some of these fallacies:
1. Data quality is perfect - data is correct, complete and coherent across all enterprise contexts
- People will remediate bad data – if inaccuracies are found (contrary to the axiom above) users will willingly and proactively make changes, and all users will agree with those changes
2. Relationships are Known – The linkages between data entities are well known, hierarchical, navigable and everlasting
3. There is a singular master model that is explicitly and consistently factorable for all enterprise uses
- One dictionary – There is a consistent dictionary with well agreed and complete set of meta-data supporting the modeled domain
- Static model – The model is complete, and no changes will ever be necessary
4. Expectation of XA/2Phase Transactionality – The data exchanges will be ACIDly transactional, and based upon XA/2-phase transactional mechanisms
- Transactions complete in a timely fashion and are not affected by Deutsch’s fallacies
6. Idempotent Data – There is one master copy of enterprise data, and application specific “caches” are always synchronized and consistent
Working currently on more consistent “information exchanges” these fallacies have been driving some specific architectural artifacts, including:
- the need for appropriately targeted abstractions providing consistent to/from canonical forms,
- the need to support similar meta data/policy models and transformations in support of context bridges and securitization
- support for multi-master synchronization
- needs for distributed model governance, non-destructive model mutation and potentially late-modeled forms
(though OWL/RDF brings some new challenges in transactional systems wrt. transitive closure)

Information Management Truisms

I have long been a fan of Peter Deutsch’s fallacies (btw I’m not alone, Google this AM produced over 22k references) of network/distributed computing, they have served as a set of guiding checkpoints for every distributed system that I have built. What I have found to be missing, however, is a similar set of fallacies/truisms for managing Information while we approach “internet scale” information infrastructure… the information explosion.

Truisms defined by/principles for managing information explosion:

  1. no one person/system is capable of managing all data
  2. optimizations will be continually applied, but by different vendors, thereby requiring an enterprise to distribute their information architecture
  3. information processing is inherently a pipelined process (though fork/join supports parallelism for reduction of latency)
  4. these pipeline’s can have “in parallel” replicas so long as sufficient locking is engineered, and compensation models supported
  5. locking for a given pipeline should be owned discretely by a single application context (workflow) – though this workflow may be complex, it is stateless upon completion of end state
  6. loose coupling / jit integration require coherent, federable, data dictionaries and meta-data/structure maps

Translated to Fallacies… which, agreeing with SGG, I think are way more powerful, and in some cases hilarious.

  1. there is one enterprise data architect who is responsible for the master models
  2. there is a system who is the authoritative master for a given entity domain
  3. there is one vendor involved across the SOA and EIM domain
  4. the data models are largely fixed, and the business will not ask for further changes/enhancements to the model
  5. data exchange will be based upon XA/2-phase transactional mechanisms to achieve ACID properties (pessimistic transactionality)
  6. there will be a singular data dictionary, with complete meta-data for a given entity domain

Additions/Subtractions/debate most wanted!