wwHadoop: Analytics without Borders


This week at EMC World 2012, the EMC technical community is launching a community program called World Wide Hadoop.  I am really excited to be a part of a collaboration across the EMC technical community that has been looking to extend the “borders” of our Big Data portfolio by building on the success of our Greenplum Hadoop distribution in offering the open source community the ability to federate their hBase analytics across a distributed set of hadoop clusters.


In the past month, the EMC Distinguished Engineer community has been collaborating with our St. Petersburg [RUS] Center of Excellence to demonstrate the ability to distribute analytic jobs (move the code vs. moving the data) across multiple [potentially] geographically dispersed clusters, manage those jobs, and enjoin the results.

The big problem that we are addressing is that of Reed’s law, and the value of combinatorial value of networked resources, in our case, information sets.


“[E]ven Metcalfe’s law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with in/i members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2i. So the value of a GFN increases exponentially, in proportion to 2i. I call that Reed’s Law. And its implications are profound.”

A few of our Big Data Challenges


  • Valuable information is produced across geographically diverse locations
  • The data has become too big to move [thus we need to process in place]
  • Scientists and Analysts have begun to move partial sets vs. full corpi to try and save time
  • But this partial data can, and often does create inadvertent variance or occlusions in correlation and value

EMC is demonstrating a working distributed cluster model for analytics across multiple clusters.


We want to work on this with the open community, as we believe that there is tremendous value in enabling the community to both derive and add value with EMC in this space and invite all to join us at http://www.wwhadoop.com.

Big Data and Healthcare: Corrolate, Root Cause, Operationalize & Disrupt

I have to say that it’s been massively fun lately at EMC talking about the impact of Big Data Analytics on vertical markets. It has been fun learning from Chuck‘s experience / scars associated with advocating for disruptive change in the IT market.

Recently, Chuck blogged about the emerging ACO’s and potential disruption that information and cost/benefit aligned markets can provide to markets like healthcare.

I commented on his blog, and wanted to share the insight with my audience as well:

Thanks for your insightful “state of healthcare” review. I find that there is one more facet to healthcare that makes the data/information analytic landscape even more compelling, and that’s an interesting book by David Weinberger, Too Big to Know.

Dr. Weinberger brings forward the notion that books and traditional learning are discontinuous. And that there is emerging, due to the hyperlinked information universe, a massive ecology of interconnected fragments that continually act with the power of positive and negative reinforcement. The net is that traditional views, or in David’s vocabulary “Facts” are largely based upon constrained reasoning, and that as the basis of this reasoning changes with the arrival of new facts, so should their interpretations.

There are a number of entities sitting on top of 30years of historic clinical data, the VA, certain ACO’s and certainly academic medical centers that can chart many of the manifest changes in treatment planning and outcome, but may themselves, because of their constrains present significant correlation. But just the same, many not be able to discern accurate causality due to a lack of completeness of the information base – maybe genomics, proteomics, or the very complex nature of the human biological system.

The interconnectedness of the clinical landscape is of paramount value in the establishment of correlation, and derived causality, and that with the increase of new information, traditional best practices can be [and needs to be] constantly re-evaluated.

I believe that, this lack of causal understanding, due to highly constrained reasoning, has lead to many of the derogatory statements about todays outcome based care model. As todays outcome measures are based more on correlation than on causality, determining causal factors, and then understanding the right control structures should improve the operationalization of care. I further believe that, as the availability of substantially more complete clinical information, across higher percentages of populations, will lead to improvements in these outcome measures and the controls that can affect them. The net result of getting this right is improved outcomes at decreased costs, and effectively turning the practice of medicine into a big data science with a more common methodology and more predictable outcome. In effect, full circle, operationalized predictive analytics.

The opportunities for market disruption are substantial, but there remain high barriers. One constant in being able to exploit disruptions is the right visionary leadership who has the political capital and will, but also is willing to be an explorer, for change is a journey and the route is not always clear. Are you a change agent?

Join me at the 2nd Annual Data Science Summit @ EMC World 2012

Big Data Exchanges: Of Shopping Malls and the Law of Gravity

Working for a Storage Systems company, we are constantly looking at both the technical as well as social/marketplace challenges to our business strategy. Leading to the coining of “Cloud Meets Big Data” from EMC last year, EMC has been looking at the trends that “should” tip the balances around real “Cloud Information Management” as opposed to “data management” which is really what dominates todays practice.

There are a couple of truisms [incomplete list]:

  1. Big Data is Hard to Move = get optimal [geo] location right the first time
  2.  Corollary = Move the Function, across Federated Data
  3. Data Analytics are Context Sensitive = meta-data helps to align/select contexts for relevancy
  4. Many Facts are Relative to context = Declare contexts of derived insight (provenance amp; Scientific Method)
  5. Data is Multi-Latency & needs Deterministic support for temporality= key declarative information architectural requirement
  6. Completeness of Information for Purpose (e.g. making decision) = dependent on stuff I have, and stuff I get from others, but everything that I need to decide.

I believe that 1) and 6) above point to an emerging need for Big Data Communities to arise supporting the requirements of the others. Whether we talk about these as communities of interest, or Big Data Clouds. There are some very interesting analogies that I see in the way we humans act; namely, the Shopping Mall. Common wisdom points to the mall as providing an improved shopping efficiency, but also in the case of inward malls, a controlled environment (think walled garden). I think that both efficiency in the form of “one stop”, and control are critical enablers in the information landscape.

Big Data Mall slideThis slide from one of my presentations supports the similarities of building a shopping mall alongside the development of a big data community. Things like understanding the demographics of the community (information needs, key values), the planning of roads to get in/out. And of course how to create critical mass = the anchor store.

The interesting thing about critical mass is that it tends to have a centricity around a key [Gravitational] Force. Remember:

Force = Mass * Acceleration (change in velocity).

This means that in order to create communities and maximize force you need Mass [size/scope/scale of information] and improving Velocity [timelyness of information]. In terms of mass, truism #1 above, and the shear cost / bandwidth availability make moving 100TB of data hard, and petabytes impracticable. Similarly, velocity change does matter, whether algorithmically trading on the street (you have to be in Ft Lee, NJ or Canary Warf, London) or a physician treating a patient, the timeliness of access to emergent information is critical. So correct or not, gravitational forces do act to geo-locate information.

Not trying to take my physics analogy too far, but Energy is also interesting. This could be looked at as “activity” in a community. For energy there is an interesting both kinetic and potential models. In the case of the internet, the relative connectedness of information required for a decision could be viewed in light of “potential”. Remember:

Ep (potential energy) = Mass x force of Gravity x Height (mhg)

In our case Height could be looked at as the bandwidth between N information participant sites, Mass as the amount of total information needed to process, and Gravity as a decentralization of information = the Outer Joins required for optimal processing. If I need to do a ton of outer joins across the Internet in order to get an answer, then I need to spend a lot of energy.

So if malls were designed for optimal [human] energy efficiency, then big data malls could do exactly the same for data.

Information Era – Business Performance through Big Data Mining

In this article: “Mining of Raw Data May Bring New Productivity, a Study Says – NYTimes” the NYTimes re-inforces one of the key points that EMC has made during the acquisitions of Greenplum, Isilon and most recently at EMC World: “Where Clouds Meet Big Data”. It seems like the analysts are catching up to one of my key theses: “Since Big Data is created in the cloud, it needs to be managed there, AND monetized there.” This thesis, I believe forces us to look exceedingly differently at all aspects of information management. The “Journey to the Cloud” isn’t just about the Enterprise projection to the Cloud (CDN styled), but the Enterprise exploitation of their cloud property (and position). Communities emerge around key topics, just as they do in [Hi-Tech] business: the valley, cambridge, and the like. But there needs to emerge a type of marketplace for the transfer of value. The job marketplace helped us with our initial “exchange” economy (1999-2010)- as employees moved, they built a portfolio of knowledge, relationships and tools. I think that a digital knowledge marketplace will emerge, extending insight based upon a more complete understanding of contextual information, and the ability to exploit this information for improved insights.

I believe, for many reasons, that this marketplace will happen quickly; big data sets acting as magnets for healthcare, financial services, public policy, even law enforcement / intelligence activities. Increasingly, if you are “far” from the market, your latency will be higher [time to insight], your context lower [quality of insight], all leading to a marginalization of value.

Another key point leading me to these centroids of market value: At a recent Internet conference, we learned from a backbone provider that 2013 will be a seminal year in which the gains in Data Center bisection bandwidth will exceed the bisection bandwidth of the Internet. This crossover will substantially Increase the benefits of co-residency – as moving big data can be quite expensive. The net result, we cannot just think about the hybridization of enterprise IT into clouds, we need to think about transacting in the cloud, close to the big data – improving value through deep operational analytics. Oh yeah, and mind your neighbor [friends close, enemies closer?]

Fallacies 3.0

I was given the unique opportunity to brief participants in CareCore National’s user conference late last month. With a panel discussion led by CCN’s CTO Bill Moore (@BowTie) around technology trends effecting the delivery of quality care through consistent data driven outcome evaluation. Presenting with Cisco and AT&T, I had the opportunity to talk about Internet Scale Data analytics, and the emergence of federated exchanges. These federated exchanges, providing the opportunity to combine information out of multiple sources, supporting the delivery of DSS enhanced information pipelines, we have the unique opportunity to look at an information driven event model, and derive from this aggregated information base the optimal processing of this event.

Enabled by virtualization – for operational efficiency, and standards – for vocabulary normalization, and with a bit of magic, each participant can leverage his or her own data, along with that from other participants to make better decisions, and through the improvements of path planning and process optimization, the reduction in the cost of care while optimizing positive outcomes.

During this session, I reviewed what is becoming an increasingly common topic… what assumptions are typically incorrect, and how do we begin to leverage these anti-patterns to help us accelerate their right solutions:

Enterprise Information Fallacies (part 3)

  • The information that you need is information that you own completely
  • There is known and high quality in information
  • You have sufficient volume of information for low standard deviation
  • Foreign dictionaries/schemas are easily mapped across domains
  • Your ETL and OLTP, structured and unstructured infrastructures coherent and consistent
  • ERWIN model(s) is/are sufficient to govern complex information
  • Controls are always correctly enforced by the business application
  • Information is static and non-perishable
  • Information contributors have similar contexts

Federating Information Exchanges serve to lower the friction on the sharing of information, while enabling increased controls. The biggest problems still are the non-functional aspects of the exchange… scalability, availability, reliability (non-repudiable), security… without these properties it becomes nearly impossible to use any derived information. Thus my key focus.

Technorati Tags: , , ,

RSNA Presentation TidBits

RSNA was an exceedingly interesting show this year, the concept of a framework for integration based upon a Service Oriented Architecture (SOA). More interest in meaningful use and the cross-integration of clinical platforms. I feel like we’re on the verge of a substantial trend toward integration for both the delivery of care as well as institutional operational enhancements.

a few pics here

21st Century Health Care?

Interesting use cases abound for Master Data Management, but few hit as close to home as Medical Health Records. Though known colloquially as Personal Health Records (PHR), Electronic Health Records (EHR) or even as Employee Health Records (also EHR)… there is a substantial need across the care triumvirate (Patient-Provider-Payer) to begin to align content across purposes in a non-leaky way. A very interesting company – MedCommons – has been pushing for a Continuity of Care Record (CCR) designed to help pull digital content “critical to the continuity of care” so that a patient can be transferred across providers with high fidelity. To my “digital” dismay, it seems that the majority of records are hand carried. Just ask an armed forces person about their medical jacket that they hand carry from point-to-point across their care continuum. How absurd is it that online services like Spokio can assemble your online footprint, but no one seems ablidged to offer similar aggregation for the treatment of disease, proactive management of health.

I mean, how stupid that we cannot agree on a non-depricating set of formats and exchanges for the appropriate sharing of health records across participants. Sure there are those worried that pre-existing conditions will impact their ability to receive insurance, or that a PCP may learn of a pre-existing STD that might through loose lips sink a marriage, but these are meerly obstacles in need of the proper application of policy driven “opt-in” authorizations.

At the end of the day, if everything is auditable, my decision to withhold information becomes MY decision, and I’ll need to take responsibility for it. Similarly, a providers/payors access also brings a certain set of responsibilities. In the “trust, but verify” culture to which we are a part, it seems that loose federations, policy management/enforcement/ as well non-repudiable audits and intentional design bear the right set of controls to build these systems.

It is astonishing that providers order additional, oft-repeated tests, for self-protection against malpractice. That providers sit idly by not giving patients the ability to archive and share their disease/treatment portfolios, in a way that might offset large segments of the cost…. if my specialist had a copy of the tests run by my PCP would they ask for a new assay? would it be enough that we could know/record that they viewed the prior results and therefore be absolved of negligence?

I’m sure that I’m oversimplifying, but when does the 80:20 rule take effect, can we save 80% by managing the specific treatment of the 20? I’m not talking about taking unfound risks, but merely about exercising better judgement. With patients surfing the internet, we have to think that an educated “consumer” is a “good” consumer, why not support the broad – population wide – “best treatment” capabilities afforded by statistically significant population… by anonymizing our PHR’s, enabling them to be mined, determining which diagnostics were deterministic (and which wastefull) , which treatments produced the best outcomes and what the options are.

Here we are with 21st century IT in retail lending and personalized advertising. But we, health consumers, seem to be counting on the limited experience of our physicians as the only director of our own mortality, when let’s face it, the analytic resources and data does exist. Why don’t insurance style actuarial analysis make it into our own wellness management portfolios, helping us consumers lead more healthy lives – enabling the “smart consumer” the choice that they desire…. not choice assembled by the griping few, or the chance of a caring soul who shares a disease and blogs their experience, but rather on an industry that takes the initiative to make a broad push to improve the 80%, and who knows this probably applies to the trailing 20% as well.

Spoken by a dad, whose spouse has spent countless nights researching diseases based upon observed symptoms, leading her to substantial enlightenment. This knowledge, in turn has, more than once, provided her with information that our family’s PCP didn’t have the time to accumulate through either experience or training.