Big Data Universe: “Too Big to Know”

I was surfing to WGBH on Saturday when I came across a lecture by with David Weinberger (surrounding his new book Too Big to Know).201202200850.jpg

I was sucked in when he eluded to brick and mortar libraries as yesterdays public commons, and pointed to discontinuous and disconnected nature of books / paper. The epitaph may read something like this “book killed by hyperlink, the facts of the matter are whatever you make them.”

Overall David, in his look at the science of knowledge, points at many interesting transitions that the cloud is bringing:

  • Books/Paper are discontinuous and disconnected – giving way to the Internet which is constantly connected and massively hyper/inter-linked
  • “the Facts are NOT the Facts” which is to point out that arguments are what we make of the information presented, and our analytics given a particular context. What we claim as a fact, may in reality proved to be a fallacy -gt; just look at Louis Pasteur and germ theory. History has so many moments like this.
  • Differences and Disagreements are themselves valuable knowledge. For me this is certainly true, learning typically comes through challenge of preconception.
  • There is an ecology of knowledge – There are a set of interconnected entities that existing within an environment. These actors represent a complex set of interrelationships, a set of positive and negative reinforcements, that act as governors on this system. These promoters/detractors act to balance fact/fallacy so as to create the system tension that supports insight (knowledge?). It’s these new insights that themselves create the next arguments – the end goal being?

I wanted to share this book because I believe that it re-enforces the need for businesses to think about their cloud – big data strategies. The question becomes less of “do I move my information to the cloud?” and more of “how do I benefit from the linkage that the Internet can provide to my information?” so as to provide new insights from big data.

Read the book, take the challenge!

Information Era – Business Performance through Big Data Mining

In this article: “Mining of Raw Data May Bring New Productivity, a Study Says – NYTimes” the NYTimes re-inforces one of the key points that EMC has made during the acquisitions of Greenplum, Isilon and most recently at EMC World: “Where Clouds Meet Big Data”. It seems like the analysts are catching up to one of my key theses: “Since Big Data is created in the cloud, it needs to be managed there, AND monetized there.” This thesis, I believe forces us to look exceedingly differently at all aspects of information management. The “Journey to the Cloud” isn’t just about the Enterprise projection to the Cloud (CDN styled), but the Enterprise exploitation of their cloud property (and position). Communities emerge around key topics, just as they do in [Hi-Tech] business: the valley, cambridge, and the like. But there needs to emerge a type of marketplace for the transfer of value. The job marketplace helped us with our initial “exchange” economy (1999-2010)- as employees moved, they built a portfolio of knowledge, relationships and tools. I think that a digital knowledge marketplace will emerge, extending insight based upon a more complete understanding of contextual information, and the ability to exploit this information for improved insights.

I believe, for many reasons, that this marketplace will happen quickly; big data sets acting as magnets for healthcare, financial services, public policy, even law enforcement / intelligence activities. Increasingly, if you are “far” from the market, your latency will be higher [time to insight], your context lower [quality of insight], all leading to a marginalization of value.

Another key point leading me to these centroids of market value: At a recent Internet conference, we learned from a backbone provider that 2013 will be a seminal year in which the gains in Data Center bisection bandwidth will exceed the bisection bandwidth of the Internet. This crossover will substantially Increase the benefits of co-residency – as moving big data can be quite expensive. The net result, we cannot just think about the hybridization of enterprise IT into clouds, we need to think about transacting in the cloud, close to the big data – improving value through deep operational analytics. Oh yeah, and mind your neighbor [friends close, enemies closer?]

Greenplum the “Big Data” Cloud Database

It has been a long time since EMC completed the acquisition of Greenplum and we have been mighty busy. I’ve met with the biggest and smallest of customers, and have heard literally 50′s of feature / product requests. We’re truly listening, Hadoop, ETL, systemic management, BI/BA deep integrations, and improvements in multi-tenancy for governed derivative cubes and marts. Leave me a comment, tell me what you’re thinking. If you want me to keep it private, just put <private>text here</private> in the contents, and I’ll just get back with you personally. Stay tuned, we’re up to something interesting.

The Greenplum “Big Data” Cloud Warehouse

The Data Warehouse space has been red hot lately. Everyone knows the top tier players, as well as the emergents. What have become substantial issues are the complexity of scale/growth of enterprise analytics (every department needs one) and increasing management burden that business data warehouses are placing on IT. Like the wild west, a business technology selection is made for “local” reasons, and the more “global” concerns are left to fend for themselves. The trend toward physical appliances has only created islands of data, the ETL processes are ever more complex, and capital/opex efficiencies ignored. Index/Schema tuning has become a full time job, distributed throughout the business. Lastly, these systems are hot because they are involved in the delivery of revenue… anyone looking at SARBOX compliance?

Today EMC announced the intent to acquire Greenplum software of San Mateo, CA. Greenplum is a leading data warehousing company with a long history of exploiting the open-source postgres codebase, with a substantial amount of work in taking that codebase to both a horizontal scale out architecture, but also a focus on novel “polymorphic data storage” which supports new ways to manage data persistence to provide deep structural optimizations including row, column and row+column at sub-table granularity*. In order to begin to make sense of EMC’s recent announcement around Greenplum one must look at the trajectory of both EMC and Greenplum.

EMC, with it’s VMware/Microsoft and Cisco alliances, and recent announcements around vMAX, vPlex… virtual storage becomes a dynamically provision-able, multi-tenant, SLA policy driven element of the cloud triple (Compute, Network, Storage). But, it’s one thing to just move virtual machines around seamlessly and provide consolidation and improved opex/capex – IT improvements. In my mind “virtual data” is all about an end-user (and maybe developer) efficiency… giving every group within the enterprise the ability to have their own data either federated to, or loaded into a data platform; where it can be appropriately* shared with other enterprise user as well as enterprise master data. The ability to “give and take” is a key value in improving data’s “local” value, and the ease with which this can be provisioned, managed, and of course analyzed defines an efficient “Big Data” Cloud (or Enterprise Data Cloud in GP’s terms).

The Cloud Data Warehouse has some discrete functional requirements, the ability to:

  • create both materialized and non-materialized views of shared data… in storage we say snapshots
  • subscribe to a change queue… keeping these views appropriately up to date, while appropriately consistent
  • support the linking of external data via load, link, link & index to accelerate associative value
  • support mixed mode operation… writes do happen and will happen more frequently
  • accelerate linearly with addition of resources in both the delivery of throughput and the reductions in analytic latency
  • exploit analyst natural language… whether SQL, MapReduce or other higher level programming languages

These functions drive some interesting architectural considerations:

  • Exploit Massively Parallel Processing (MPP) techniques for shared minimal designs
  • Federate external data through schema & data discovery models, building appropriate links, indicies and loads for optimization & governed consistency
  • Minimize tight coupling of schemas through meta-data and derived transformations
  • Allow users to self provision, self manage, and self tune through appropriately visible controls and metrics
    • This needs to include the systemic virtual infrastructure assets.
  • Manage hybrid storage structures within single database/table space to help ad-hoc & update perform
  • Support push down optimizations between the database cache and the storage cache/persistency for throughput and latency optimization
    • From my perspective, FAST = Fully Automated Storage Tiering might get some really interesting hints from the GreenPlum polymorphic storage manager

Overall, the Virtual “Big Data” Cloud should be just as obvious an IT optimization as VDI and virtual servers are. The constraints are typically a bit different as these data systems are among the most throughput intensive (Big Data, Big Compute) and everyone understands the natural requirements around “move compute to the data” in these workloads. We believe that, through appropriate placement of function, and appropriate policy based controls, there is no reason why a VBDC cannot perform better in a virtual private cloud, and why the boundaries of physical appliances cannot be shed.

Share your data, exploit shared data, and exploit existing pooled resources to deliver analytic business intelligence; improve both your top line, and bottom.

Technorati Tags: , , , , , ,

BCDR Myths – May the Disasters ensue

While I was on Wikibon’s website, I saw an article 4 Myths in BCDR that, spiked my interest. Myths, fallacies, truth-sayers seem to expose reality in ways facilitate a new perspective. Of real interest is the statement:

BCDR is a business discipline enabled by technology. Technology creates business risk and the need for the BCDR discipline (e.g., system, data center, network failures), while at the same time enabling the discipline with capability and functionality to recover in the event of a loss (e.g., backup and recovery systems).

I couldn’t agree more, I’ve seen egregious programs where IT is leading a BCDR initiative without a firm grasp of the specific business risks, and therefore being able to justify the resources (capital and operational) of the initiative.

That said, and mentioned numerous times by EMC lately, is the availability of transiting traditionally Active->Passive SRD “copies” into active->active consistent volumes with vPlex – now we can start talking about the DR site being more than a tax, and transiting the non-production copy to become effective in classification – whether eDiscovery, Enterprise Search, or even Business Intelligence, the ability to do classification styled analytics with substantially less impact on the production instance.

<div id=”technorati”>YF2KC9CSUW6Y</div>

Technorati Tags: , , ,

Posted in Uncategorized
Trackback URL for this post: http://www.vdatacloud.com/blogs/2010/06/03/bcdr-myths-may-the-disasters-ensue/trackback/ | Leave a reply

ETL & Hadoop/Map-Reduce… a match made in Orlando!

I’ve been thinking hard as of late on the challenges associated with exploiting massively parallel Hadoop/Map-Reduce clusters for analytics. As most know the NoSQL movement has been growing at a strong pace. What very few seem to want to talk about, is how NoSQL can actually present an analytic query language? Yes the xQL…

We all know that MR is great for limited schema, large cardinality data, but DWH’s typically have stronger schemas and substantial dimensional data, not to mention normal forms. Today Pentaho Corporation has released capabilities into it’s BI suite which extends their ETL (Pentaho Data Integration – PDI) to support processes that exploit (read and write) Hadoop structures. In talking with James Dixon, their CTO, the next step is to support a richer set of analytic query languages.

Press Release: Pentaho… Analytics & MR

MR is well suited for simple query tasks, but analytic workloads make extensive use of meta-data and dimension tables to optimize analytic performance and consistency. In a simple Tuple-Store model (name-value pair), this is a bit of a challenge, as is the availability of structural meta-data that helps to providing basic typing and vocabulary mapping to an appropriate dictionary. Some warehouse implementations, like Hive, leverage a meta-store to define basic primitive types which are recursively defined through compositional maps/lists and vectors, and further supports inspectors/evaluators to support basic predicate operations across these type models. This meta-data, whether co-located or adjacent to the fact data, provides a valuable layer for query and analytics as we move from strongly typed, fully structured systems to late/lazy/loosely typed stores. It’s well known that many emerging DWH vendors ( Aster Data, Greenplum, Paraccel and,Vertica) are listening to the NoSQL crowd, and it’s great to see the BI crowd begin to look at new ways to manage the analytic information across the data landscape.

Great job Pentaho team, and I look forward to discussing your analytic strategy!

Technorati Tags: , , , , , , , , , , ,

Fallacies 3.0

I was given the unique opportunity to brief participants in CareCore National’s user conference late last month. With a panel discussion led by CCN’s CTO Bill Moore (@BowTie) around technology trends effecting the delivery of quality care through consistent data driven outcome evaluation. Presenting with Cisco and AT&T, I had the opportunity to talk about Internet Scale Data analytics, and the emergence of federated exchanges. These federated exchanges, providing the opportunity to combine information out of multiple sources, supporting the delivery of DSS enhanced information pipelines, we have the unique opportunity to look at an information driven event model, and derive from this aggregated information base the optimal processing of this event.

Enabled by virtualization – for operational efficiency, and standards – for vocabulary normalization, and with a bit of magic, each participant can leverage his or her own data, along with that from other participants to make better decisions, and through the improvements of path planning and process optimization, the reduction in the cost of care while optimizing positive outcomes.

During this session, I reviewed what is becoming an increasingly common topic… what assumptions are typically incorrect, and how do we begin to leverage these anti-patterns to help us accelerate their right solutions:

Enterprise Information Fallacies (part 3)

  • The information that you need is information that you own completely
  • There is known and high quality in information
  • You have sufficient volume of information for low standard deviation
  • Foreign dictionaries/schemas are easily mapped across domains
  • Your ETL and OLTP, structured and unstructured infrastructures coherent and consistent
  • ERWIN model(s) is/are sufficient to govern complex information
  • Controls are always correctly enforced by the business application
  • Information is static and non-perishable
  • Information contributors have similar contexts

Federating Information Exchanges serve to lower the friction on the sharing of information, while enabling increased controls. The biggest problems still are the non-functional aspects of the exchange… scalability, availability, reliability (non-repudiable), security… without these properties it becomes nearly impossible to use any derived information. Thus my key focus.

Technorati Tags: , , ,

Posted in Healthcare
Trackback URL for this post: http://www.vdatacloud.com/blogs/2010/05/11/fallacies-3-0/trackback/ | Leave a reply

RSNA Presentation TidBits

RSNA was an exceedingly interesting show this year, the concept of a framework for integration based upon a Service Oriented Architecture (SOA). More interest in meaningful use and the cross-integration of clinical platforms. I feel like we’re on the verge of a substantial trend toward integration for both the delivery of care as well as institutional operational enhancements.

a few pics here

21st Century Health Care?

Interesting use cases abound for Master Data Management, but few hit as close to home as Medical Health Records. Though known colloquially as Personal Health Records (PHR), Electronic Health Records (EHR) or even as Employee Health Records (also EHR)… there is a substantial need across the care triumvirate (Patient-Provider-Payer) to begin to align content across purposes in a non-leaky way. A very interesting company – MedCommons – has been pushing for a Continuity of Care Record (CCR) designed to help pull digital content “critical to the continuity of care” so that a patient can be transferred across providers with high fidelity. To my “digital” dismay, it seems that the majority of records are hand carried. Just ask an armed forces person about their medical jacket that they hand carry from point-to-point across their care continuum. How absurd is it that online services like Spokio can assemble your online footprint, but no one seems ablidged to offer similar aggregation for the treatment of disease, proactive management of health.

I mean, how stupid that we cannot agree on a non-depricating set of formats and exchanges for the appropriate sharing of health records across participants. Sure there are those worried that pre-existing conditions will impact their ability to receive insurance, or that a PCP may learn of a pre-existing STD that might through loose lips sink a marriage, but these are meerly obstacles in need of the proper application of policy driven “opt-in” authorizations.

At the end of the day, if everything is auditable, my decision to withhold information becomes MY decision, and I’ll need to take responsibility for it. Similarly, a providers/payors access also brings a certain set of responsibilities. In the “trust, but verify” culture to which we are a part, it seems that loose federations, policy management/enforcement/ as well non-repudiable audits and intentional design bear the right set of controls to build these systems.

It is astonishing that providers order additional, oft-repeated tests, for self-protection against malpractice. That providers sit idly by not giving patients the ability to archive and share their disease/treatment portfolios, in a way that might offset large segments of the cost…. if my specialist had a copy of the tests run by my PCP would they ask for a new assay? would it be enough that we could know/record that they viewed the prior results and therefore be absolved of negligence?

I’m sure that I’m oversimplifying, but when does the 80:20 rule take effect, can we save 80% by managing the specific treatment of the 20? I’m not talking about taking unfound risks, but merely about exercising better judgement. With patients surfing the internet, we have to think that an educated “consumer” is a “good” consumer, why not support the broad – population wide – “best treatment” capabilities afforded by statistically significant population… by anonymizing our PHR’s, enabling them to be mined, determining which diagnostics were deterministic (and which wastefull) , which treatments produced the best outcomes and what the options are.

Here we are with 21st century IT in retail lending and personalized advertising. But we, health consumers, seem to be counting on the limited experience of our physicians as the only director of our own mortality, when let’s face it, the analytic resources and data does exist. Why don’t insurance style actuarial analysis make it into our own wellness management portfolios, helping us consumers lead more healthy lives – enabling the “smart consumer” the choice that they desire…. not choice assembled by the griping few, or the chance of a caring soul who shares a disease and blogs their experience, but rather on an industry that takes the initiative to make a broad push to improve the 80%, and who knows this probably applies to the trailing 20% as well.

Spoken by a dad, whose spouse has spent countless nights researching diseases based upon observed symptoms, leading her to substantial enlightenment. This knowledge, in turn has, more than once, provided her with information that our family’s PCP didn’t have the time to accumulate through either experience or training.

Fallacies of Enterprise Information Management (part deux)…

With some hearty comments from Tom Maguire, I’ve been forced to adjust some of these fallacies:
1. Data quality is perfect - data is correct, complete and coherent across all enterprise contexts
- People will remediate bad data – if inaccuracies are found (contrary to the axiom above) users will willingly and proactively make changes, and all users will agree with those changes
2. Relationships are Known – The linkages between data entities are well known, hierarchical, navigable and everlasting
3. There is a singular master model that is explicitly and consistently factorable for all enterprise uses
- One dictionary – There is a consistent dictionary with well agreed and complete set of meta-data supporting the modeled domain
- Static model – The model is complete, and no changes will ever be necessary
4. Expectation of XA/2Phase Transactionality – The data exchanges will be ACIDly transactional, and based upon XA/2-phase transactional mechanisms
- Transactions complete in a timely fashion and are not affected by Deutsch’s fallacies
6. Idempotent Data – There is one master copy of enterprise data, and application specific “caches” are always synchronized and consistent
Working currently on more consistent “information exchanges” these fallacies have been driving some specific architectural artifacts, including:
- the need for appropriately targeted abstractions providing consistent to/from canonical forms,
- the need to support similar meta data/policy models and transformations in support of context bridges and securitization
- support for multi-master synchronization
- needs for distributed model governance, non-destructive model mutation and potentially late-modeled forms
(though OWL/RDF brings some new challenges in transactional systems wrt. transitive closure)