The need to extend SIAM for the Digital Enterprise

In the new “as a Service” Enterprise, we talk aggressively about Service Integration becoming a combinator of core enterprise services, contextual services, and utility services (see Simon Wardley’s mapping taxonomy).  Core enterprise services those that support the core industry, ERP for manufacturing, Core Banking for retail banks, Policy Management for Insurance and the like, this enterprise services provide access to core transactional services that operate to the industry regulations and corporate policies the core enterprise business.  Contextual services are distinct differentiated services to engage employees, partners and customers that provide the “secret sauce” of value driven differentiation.  Diametrically opposed to these context services are utility services; services that are substantially undifferentiated, supporting multiple businesses and are provided as commodities: email, voice, IaaS, BURA, leveraging the scale of a demand-pool to reduce costs toward the cost of power + long term debt service.

As we begin to think about Service Integration and Management (SIAM) I actually think that today’s paradigm is net-incomplete. Though we need to manage vendors/suppliers in service oriented and quantitative ways, we need to think more extensively about our approach.  Namely, the usage of the Continuous Delivery model from the SDLC to help us operate within the aggressive cycle time needs of business and continuous improvement.  What as a Service enterprises are really looking for is Managed Service Integrations (MSI) with full operational data commonly and consistently captured and analyzed.  Where SIAM provides frameworks and processes for integration, supporting procurement, contracts, risk and governance, the digitization of SIAM really looks more like MSI a more automated, digital and quantitative process driven by full scope operational analytics.
For me, as we begin to discuss “Services” it becomes important to think about the end-state business and IT architecture and the natural move toward being a “Service Provider” to the business, and with this comes a strategy that begins to morph ITSM toward Operational Support Services (OSS) / Business Support Services (BSS) and the desire to automate like crazy via aggressive operational analytics to guide this automation.  This shift of an operating model toward massive automation, via consistency of deployed solutions, on top of substantially less differentiated platforms creates key economic advantages via OpEx and CapEx, but also results in higher predictability and thereby lower business risk.  But this SP model does come at a cost, namely that hands need to come off keyboards as Continuous [Automated] Deployment and Continuous Monitoring emerge to improve ongoing operations via analytics on operational log data.
One of the key challenges, ongoing is the ever increasing complexity of the enterprise, SIAM is designed to address complexity via a governance strategy, but is still largely dependent on people, it’s achilles heel.  My proposal is to create a fully digital, auditable and requirements driven (managed) set of enterprise service integrations.  This approach requires a couple of core elements:
  1. Governance Risk and Compliance (GRC) platform,
  2. a declarative and automated delivery platform that can automate policy conformance and the placement of controls,
  3. the adoption of a programmatic release management paradigm – todays SIAM
  4. the systematic continuous logging of key operational services and controls
  5. analytic reporting can round trip to the GRC platform to expose the real-time risks.
  6. GOTO 1.  The risks then generate the need to amend the delivery platform which then updates/upgrades/replaces current services… and the looping continues.
This virtuous cycle of consistent service improvement against both functional (business driven) integrations as well non-functional use-cases including things like regulation, corporate policy and contract conformance now become digitized, and continuously monitored and improved… in effect governing digital business services, decreasing operational risks, and overtly complying with appropriate policies.
AgilityPolicies
With this in place investments in DevOps/Automated Orchestration, Policy Driven Cloud Management Platforms (like CSC Agility Platform), Release Management and Operational Logging / Analytic platforms (like the CSC BDPaaS,

Common Distributed Logging

SEIM: Splunk/ArcSight) and eGRC platforms (RSA Archer) all begin to fit together into a MSI future state.

The result: a continuously improving, fully digital, as a Service enterprise running fully compliant with industry regulations, corporate policies and business contracts.

Infochimps joins the CSC troop

Quote

Today, I am finally able to talk about a really cool company, that we’ve been working with to bolster CSC “Big Data” Solutions in the marketplace.  One of the first valuable useful use-cases is to enable big-fast data clouds to be automagically built;  In effect, to build a data lake haystack “as a Service”:

“without a haystack, there is no needle to find”…

And, to enable our customers to quickly “Self”-provision, attach, stream, evaluate, store, analyze their emerging big data workloads.

Storing the data in the cloud [via Apache Hadoop], developing insights from the stored data, and enabling those insights to become actionable through promotion into the streaming ingest services [via Storm / Kafka].

InfoChimps + CSC

InfoChimps + CSC = Big [Fast] Data Made Better

Welcome Infochimps: Jim, Joe, Flip, Dhruv, Adam and the rest of the Austin team.  I have to say that this is an amazingly cool company, that I believe generates massive synergies with CSC’s vertical industry knowledge, existing Big Data offers, and with other key R&D initiatives that we have going on including our Cloud 2.0 services / Internet of Things work.  Adding some Chimp magic, some “open” magic sauce including: WuKong (an open Ruby streaming in Hadoop), IronFan (an open orchestration project for provisioning big data services on clouds), and Configliere (an open configuration project that makes project/module configuration more secure and more functional).  Their proven ability to stand up big data clusters on clouds, manage them with high availability and establishes a key link in the overall cloud and big data lineup for CSC.

I do love this from their site: “Big Data is more than Hadoop” and a bit on the transition from the CEO Jim Kaskade.

This is going to be fun!

Puppet, DevOps and Big Data

I was really excited to be asked to be one of the really cool Keynotes at PuppetLabsPuppetConf this year.  My initial excitement was that I saw that the DevOps movement, capitalized by the system administrator to make their life easier, had parallels in the life of a software developer and information architect as they tried to create automation and reproduce-ability within their own domains.

Reflecting on trying to simplify, automate and increase the pace of integration of information drove me to talk about the transitive uses of the DevOps models in new domains.

Thanks to the amazing Puppet community for offering me the soap-box to discuss some of the parallels that I am seeing as we[EMC] tries to make information integration more agile, predictable and reproducible in the face of increasing growth, complexity and urgency.

Also, I want to thank a few of the critical people that have contributed massive value to my thinking including the Greenplum team working on the K2 cluster especially Milind and Apurva, as well as Nick Weaver and Tom McSweeney who have worked tirelessly to make Razor not just real, but a leading provisioning platform for operating systems in large scale environments including provisioning Windows, like never before!

wwHadoop: Analytics without Borders

wwhadoop.png

This week at EMC World 2012, the EMC technical community is launching a community program called World Wide Hadoop.  I am really excited to be a part of a collaboration across the EMC technical community that has been looking to extend the “borders” of our Big Data portfolio by building on the success of our Greenplum Hadoop distribution in offering the open source community the ability to federate their hBase analytics across a distributed set of hadoop clusters.

wwhadoopqr.png

In the past month, the EMC Distinguished Engineer community has been collaborating with our St. Petersburg [RUS] Center of Excellence to demonstrate the ability to distribute analytic jobs (move the code vs. moving the data) across multiple [potentially] geographically dispersed clusters, manage those jobs, and enjoin the results.

The big problem that we are addressing is that of Reed’s law, and the value of combinatorial value of networked resources, in our case, information sets.

Reed.png

“[E]ven Metcalfe’s law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with in/i members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2i. So the value of a GFN increases exponentially, in proportion to 2i. I call that Reed’s Law. And its implications are profound.”

A few of our Big Data Challenges

201205181256.jpg

  • Valuable information is produced across geographically diverse locations
  • The data has become too big to move [thus we need to process in place]
  • Scientists and Analysts have begun to move partial sets vs. full corpi to try and save time
  • But this partial data can, and often does create inadvertent variance or occlusions in correlation and value

EMC is demonstrating a working distributed cluster model for analytics across multiple clusters.

201205181314.jpg

We want to work on this with the open community, as we believe that there is tremendous value in enabling the community to both derive and add value with EMC in this space and invite all to join us at http://www.wwhadoop.com.

Big Data and Healthcare: Corrolate, Root Cause, Operationalize & Disrupt

I have to say that it’s been massively fun lately at EMC talking about the impact of Big Data Analytics on vertical markets. It has been fun learning from Chuck‘s experience / scars associated with advocating for disruptive change in the IT market.

Recently, Chuck blogged about the emerging ACO’s and potential disruption that information and cost/benefit aligned markets can provide to markets like healthcare.

I commented on his blog, and wanted to share the insight with my audience as well:

Chuck,
Thanks for your insightful “state of healthcare” review. I find that there is one more facet to healthcare that makes the data/information analytic landscape even more compelling, and that’s an interesting book by David Weinberger, Too Big to Know.

Dr. Weinberger brings forward the notion that books and traditional learning are discontinuous. And that there is emerging, due to the hyperlinked information universe, a massive ecology of interconnected fragments that continually act with the power of positive and negative reinforcement. The net is that traditional views, or in David’s vocabulary “Facts” are largely based upon constrained reasoning, and that as the basis of this reasoning changes with the arrival of new facts, so should their interpretations.

There are a number of entities sitting on top of 30years of historic clinical data, the VA, certain ACO’s and certainly academic medical centers that can chart many of the manifest changes in treatment planning and outcome, but may themselves, because of their constrains present significant correlation. But just the same, many not be able to discern accurate causality due to a lack of completeness of the information base – maybe genomics, proteomics, or the very complex nature of the human biological system.

The interconnectedness of the clinical landscape is of paramount value in the establishment of correlation, and derived causality, and that with the increase of new information, traditional best practices can be [and needs to be] constantly re-evaluated.

I believe that, this lack of causal understanding, due to highly constrained reasoning, has lead to many of the derogatory statements about todays outcome based care model. As todays outcome measures are based more on correlation than on causality, determining causal factors, and then understanding the right control structures should improve the operationalization of care. I further believe that, as the availability of substantially more complete clinical information, across higher percentages of populations, will lead to improvements in these outcome measures and the controls that can affect them. The net result of getting this right is improved outcomes at decreased costs, and effectively turning the practice of medicine into a big data science with a more common methodology and more predictable outcome. In effect, full circle, operationalized predictive analytics.

The opportunities for market disruption are substantial, but there remain high barriers. One constant in being able to exploit disruptions is the right visionary leadership who has the political capital and will, but also is willing to be an explorer, for change is a journey and the route is not always clear. Are you a change agent?

Join me at the 2nd Annual Data Science Summit @ EMC World 2012

Big Data Exchanges: Of Shopping Malls and the Law of Gravity

Working for a Storage Systems company, we are constantly looking at both the technical as well as social/marketplace challenges to our business strategy. Leading to the coining of “Cloud Meets Big Data” from EMC last year, EMC has been looking at the trends that “should” tip the balances around real “Cloud Information Management” as opposed to “data management” which is really what dominates todays practice.

There are a couple of truisms [incomplete list]:

  1. Big Data is Hard to Move = get optimal [geo] location right the first time
  2.  Corollary = Move the Function, across Federated Data
  3. Data Analytics are Context Sensitive = meta-data helps to align/select contexts for relevancy
  4. Many Facts are Relative to context = Declare contexts of derived insight (provenance amp; Scientific Method)
  5. Data is Multi-Latency & needs Deterministic support for temporality= key declarative information architectural requirement
  6. Completeness of Information for Purpose (e.g. making decision) = dependent on stuff I have, and stuff I get from others, but everything that I need to decide.

I believe that 1) and 6) above point to an emerging need for Big Data Communities to arise supporting the requirements of the others. Whether we talk about these as communities of interest, or Big Data Clouds. There are some very interesting analogies that I see in the way we humans act; namely, the Shopping Mall. Common wisdom points to the mall as providing an improved shopping efficiency, but also in the case of inward malls, a controlled environment (think walled garden). I think that both efficiency in the form of “one stop”, and control are critical enablers in the information landscape.

Big Data Mall slideThis slide from one of my presentations supports the similarities of building a shopping mall alongside the development of a big data community. Things like understanding the demographics of the community (information needs, key values), the planning of roads to get in/out. And of course how to create critical mass = the anchor store.

The interesting thing about critical mass is that it tends to have a centricity around a key [Gravitational] Force. Remember:

Force = Mass * Acceleration (change in velocity).

This means that in order to create communities and maximize force you need Mass [size/scope/scale of information] and improving Velocity [timelyness of information]. In terms of mass, truism #1 above, and the shear cost / bandwidth availability make moving 100TB of data hard, and petabytes impracticable. Similarly, velocity change does matter, whether algorithmically trading on the street (you have to be in Ft Lee, NJ or Canary Warf, London) or a physician treating a patient, the timeliness of access to emergent information is critical. So correct or not, gravitational forces do act to geo-locate information.

Not trying to take my physics analogy too far, but Energy is also interesting. This could be looked at as “activity” in a community. For energy there is an interesting both kinetic and potential models. In the case of the internet, the relative connectedness of information required for a decision could be viewed in light of “potential”. Remember:

Ep (potential energy) = Mass x force of Gravity x Height (mhg)

In our case Height could be looked at as the bandwidth between N information participant sites, Mass as the amount of total information needed to process, and Gravity as a decentralization of information = the Outer Joins required for optimal processing. If I need to do a ton of outer joins across the Internet in order to get an answer, then I need to spend a lot of energy.

So if malls were designed for optimal [human] energy efficiency, then big data malls could do exactly the same for data.

Big Data Universe: “Too Big to Know”

I was surfing to WGBH on Saturday when I came across a lecture by with David Weinberger (surrounding his new book Too Big to Know).201202200850.jpg

I was sucked in when he eluded to brick and mortar libraries as yesterdays public commons, and pointed to discontinuous and disconnected nature of books / paper. The epitaph may read something like this “book killed by hyperlink, the facts of the matter are whatever you make them.”

Overall David, in his look at the science of knowledge, points at many interesting transitions that the cloud is bringing:

  • Books/Paper are discontinuous and disconnected – giving way to the Internet which is constantly connected and massively hyper/inter-linked
  • “the Facts are NOT the Facts” which is to point out that arguments are what we make of the information presented, and our analytics given a particular context. What we claim as a fact, may in reality proved to be a fallacy -gt; just look at Louis Pasteur and germ theory. History has so many moments like this.
  • Differences and Disagreements are themselves valuable knowledge. For me this is certainly true, learning typically comes through challenge of preconception.
  • There is an ecology of knowledge – There are a set of interconnected entities that existing within an environment. These actors represent a complex set of interrelationships, a set of positive and negative reinforcements, that act as governors on this system. These promoters/detractors act to balance fact/fallacy so as to create the system tension that supports insight (knowledge?). It’s these new insights that themselves create the next arguments – the end goal being?

I wanted to share this book because I believe that it re-enforces the need for businesses to think about their cloud – big data strategies. The question becomes less of “do I move my information to the cloud?” and more of “how do I benefit from the linkage that the Internet can provide to my information?” so as to provide new insights from big data.

Read the book, take the challenge!

Greenplum the “Big Data” Cloud Database

It has been a long time since EMC completed the acquisition of Greenplum and we have been mighty busy. I’ve met with the biggest and smallest of customers, and have heard literally 50′s of feature / product requests. We’re truly listening, Hadoop, ETL, systemic management, BI/BA deep integrations, and improvements in multi-tenancy for governed derivative cubes and marts. Leave me a comment, tell me what you’re thinking. If you want me to keep it private, just put <private>text here</private> in the contents, and I’ll just get back with you personally. Stay tuned, we’re up to something interesting.

The Greenplum “Big Data” Cloud Warehouse

The Data Warehouse space has been red hot lately. Everyone knows the top tier players, as well as the emergents. What have become substantial issues are the complexity of scale/growth of enterprise analytics (every department needs one) and increasing management burden that business data warehouses are placing on IT. Like the wild west, a business technology selection is made for “local” reasons, and the more “global” concerns are left to fend for themselves. The trend toward physical appliances has only created islands of data, the ETL processes are ever more complex, and capital/opex efficiencies ignored. Index/Schema tuning has become a full time job, distributed throughout the business. Lastly, these systems are hot because they are involved in the delivery of revenue… anyone looking at SARBOX compliance?

Today EMC announced the intent to acquire Greenplum software of San Mateo, CA. Greenplum is a leading data warehousing company with a long history of exploiting the open-source postgres codebase, with a substantial amount of work in taking that codebase to both a horizontal scale out architecture, but also a focus on novel “polymorphic data storage” which supports new ways to manage data persistence to provide deep structural optimizations including row, column and row+column at sub-table granularity*. In order to begin to make sense of EMC’s recent announcement around Greenplum one must look at the trajectory of both EMC and Greenplum.

EMC, with it’s VMware/Microsoft and Cisco alliances, and recent announcements around vMAX, vPlex… virtual storage becomes a dynamically provision-able, multi-tenant, SLA policy driven element of the cloud triple (Compute, Network, Storage). But, it’s one thing to just move virtual machines around seamlessly and provide consolidation and improved opex/capex – IT improvements. In my mind “virtual data” is all about an end-user (and maybe developer) efficiency… giving every group within the enterprise the ability to have their own data either federated to, or loaded into a data platform; where it can be appropriately* shared with other enterprise user as well as enterprise master data. The ability to “give and take” is a key value in improving data’s “local” value, and the ease with which this can be provisioned, managed, and of course analyzed defines an efficient “Big Data” Cloud (or Enterprise Data Cloud in GP’s terms).

The Cloud Data Warehouse has some discrete functional requirements, the ability to:

  • create both materialized and non-materialized views of shared data… in storage we say snapshots
  • subscribe to a change queue… keeping these views appropriately up to date, while appropriately consistent
  • support the linking of external data via load, link, link & index to accelerate associative value
  • support mixed mode operation… writes do happen and will happen more frequently
  • accelerate linearly with addition of resources in both the delivery of throughput and the reductions in analytic latency
  • exploit analyst natural language… whether SQL, MapReduce or other higher level programming languages

These functions drive some interesting architectural considerations:

  • Exploit Massively Parallel Processing (MPP) techniques for shared minimal designs
  • Federate external data through schema & data discovery models, building appropriate links, indicies and loads for optimization & governed consistency
  • Minimize tight coupling of schemas through meta-data and derived transformations
  • Allow users to self provision, self manage, and self tune through appropriately visible controls and metrics
    • This needs to include the systemic virtual infrastructure assets.
  • Manage hybrid storage structures within single database/table space to help ad-hoc & update perform
  • Support push down optimizations between the database cache and the storage cache/persistency for throughput and latency optimization
    • From my perspective, FAST = Fully Automated Storage Tiering might get some really interesting hints from the GreenPlum polymorphic storage manager

Overall, the Virtual “Big Data” Cloud should be just as obvious an IT optimization as VDI and virtual servers are. The constraints are typically a bit different as these data systems are among the most throughput intensive (Big Data, Big Compute) and everyone understands the natural requirements around “move compute to the data” in these workloads. We believe that, through appropriate placement of function, and appropriate policy based controls, there is no reason why a VBDC cannot perform better in a virtual private cloud, and why the boundaries of physical appliances cannot be shed.

Share your data, exploit shared data, and exploit existing pooled resources to deliver analytic business intelligence; improve both your top line, and bottom.

Technorati Tags: , , , , , ,

ETL & Hadoop/Map-Reduce… a match made in Orlando!

I’ve been thinking hard as of late on the challenges associated with exploiting massively parallel Hadoop/Map-Reduce clusters for analytics. As most know the NoSQL movement has been growing at a strong pace. What very few seem to want to talk about, is how NoSQL can actually present an analytic query language? Yes the xQL…

We all know that MR is great for limited schema, large cardinality data, but DWH’s typically have stronger schemas and substantial dimensional data, not to mention normal forms. Today Pentaho Corporation has released capabilities into it’s BI suite which extends their ETL (Pentaho Data Integration – PDI) to support processes that exploit (read and write) Hadoop structures. In talking with James Dixon, their CTO, the next step is to support a richer set of analytic query languages.

Press Release: Pentaho… Analytics & MR

MR is well suited for simple query tasks, but analytic workloads make extensive use of meta-data and dimension tables to optimize analytic performance and consistency. In a simple Tuple-Store model (name-value pair), this is a bit of a challenge, as is the availability of structural meta-data that helps to providing basic typing and vocabulary mapping to an appropriate dictionary. Some warehouse implementations, like Hive, leverage a meta-store to define basic primitive types which are recursively defined through compositional maps/lists and vectors, and further supports inspectors/evaluators to support basic predicate operations across these type models. This meta-data, whether co-located or adjacent to the fact data, provides a valuable layer for query and analytics as we move from strongly typed, fully structured systems to late/lazy/loosely typed stores. It’s well known that many emerging DWH vendors ( Aster Data, Greenplum, Paraccel and,Vertica) are listening to the NoSQL crowd, and it’s great to see the BI crowd begin to look at new ways to manage the analytic information across the data landscape.

Great job Pentaho team, and I look forward to discussing your analytic strategy!

Technorati Tags: , , , , , , , , , , ,