An Awesome New Role

I’ve been invited to join an amazing executive team at CSC.  Mike Lawrie, the CEO and his team have asked me to lead the technologists at CSC toward three declared technical foci: Big Data, Cloud and Cyber-Security.

Let me first say that I am incredibly thankful to EMC, an amazing leadership team and some massive talent.  During my 6 years I have seen massive innovation (from within), transformation (through execution), and opportunity (through the Pivotal Initiative).  I’ve worked with people that I’ll always call friends, and know that Mike has told me that he hopes EMC and CSC can just grow closer through this transition.

I’m both excited and sad. I even found this, the other day, in my wallet:

fortune1I look back, and recognize that I’ve been innovating and executing around service providers, cloud and big data now for 6 years at EMC.   Now, I’m being given the opportunity to lead in this same technology landscape and effect a more substantial value production.  In perspective, the Gartner and Forrester articles on Big Data and Cloud, and they always said that 60-70% of the revenue was services.

Gartner: $29B will be spent on big data throughout 2012 by IT departments.  Of this figure, $5.5B will be for software sales and the balance for IT services.

But, what they didn’t say, is how rewarding solving real customer problems is – It’s truely awesome, something I’m looking forward to!  And it is my intention to participate with the CSC community, to deliver innovation and value to our team, our shareholders and our customers.

Thanks to everyone! I cherish the memories, and look forward to the challenges ahead!


Value Proposition for Virtual Networks: The Dilemma

As we think about the corporate data centers and the [Inter]networks that join them, we begin to recognize that there are a number of critical points in which information is accessed, produced, cached, transited, processed and stored.  One critical feature of the emerging landscape is that mobility and the mega-bandwidth edge has changed the underlying network basis of the design of Web Scale systems, and how they overlay with traditional ISP and Carrier Network topologies:

On the location taxonomy:

  1. PREMISE: Person, Device, Service = Ethernet/IP
  2. EDGE: Mobile Base Station, Distribution Point, DSLAM = Network Edge IP->Metro Transport Network [IP/MPLS, xWDM]
  3. EDGE/METRO/BACKBONE: Regional Data Center = Metro ->Core Carrier Transport Network [IP/MPLS, xWDM, PON]
  4. BACKBONE/CORE: Carrier Data Center  = Core Carrier Transport Network -> ISP Network
  5. INTERNET: ISP = ISP Network -> Internet
  6. repeat 5,4,3,2,1 to target.
You can see how the latency, and costs can stack up.  This model is hugely “North / South” targeted with packets making huge horseshoes through the network in order to transit content.  And for most carriers, physical and protocol barriers exist throughout location tiers preventing bypass or intermediating strategies that might support everything from local caching/routing, to increased traffic and congestion management.
Content Carriage Cost, PREMISE->EDGE and EDGE->BACKBONE bandwidth dilemma.
  • Reduce Cost, Complexity and Increase Service Flexibility
    • Ability to introduce new services quickly and adapt to changingtraffic patterns is crucial.
    • Move network functions away from expensive hardwareinfrastructure to software running on lower cost COTS hardware.
    • Use Software Defined Networks (SDN) and Cloud Computing toprovide the required flexibility, and the ability to adapt to changing traffic patterns.
  • N2 GN Principles
    • Separate Control & Data.
      • Open Standards Based Interfaces (OpenFlow)
    • Introduce application/policy -based traffic steering.
    • Introduce virtualization.
    • Standards and Automation for higher level functions will becritical to the transformation.

The Benefits:

  • Virtualized environment
    • You can load up servers to an appropriate production traffic loading
    • Hardware resilience can be disconnected from the service applications
      • this leads directly to reduced Power, AirCon, and other facility requirements
    • Standard hardware
      • Large scale drives cost down (Economy of Scale
      • One Spare holding SKU, drastically reducing the number of unitsneeded.
      • Reduces field force training requirements
      • Get away from the “new software release means a field firmwareupgrade”, which takes many truck rolls and potentially manycombinations to handle.
    • Disaster recovery/business continuity can be simplified
    • Software Implementation of Network Functions
    • Introduce new services without deploying an entire network of newhardware, and if it fails commercially you can reuse the hardware.
    • Move workloads around the network to tactically handle capacity issues in specific geographical areas.

    The Future

    Software Defined Content Networks (SDCN) via SDN:

    • Functional Agility
      • Mutable functions
      • Content/Application-aware routing and separation of packet forwarding from control
        • to rapidly introduce new services and adapt to changing traffic patterns
    • The Network Processing Slice
      • The network infrastructure consists of low cost network elements: Servers, Packet Optical Transport, Data Center
    • Switches
      • Elements are edge located where needed and consolidated where needed
      • Network functions, caching, applications, enablers all run in virtualized distributed equipment
      • Transport is an integrated optical IP network
And just to wrap up on the importance, something that I thought was relevant to the re-invention of the network:
Akamai’s [now retired CEO] Paul Sagan w/ Om Malik @ Structure 2012 advice to carriers “extend IP throughout [and increase transport network bandwidth] and limit protocol differences to enable to the Instant Internet”
  • must deliver to user expectation in TV like experience
  • need 100fold increase in bandwidth & transactions by end of decade
  • 2 BTxns/day today
  • - Content Delivery Services < 50% of business, but rather provide traffic control, namely performance and security guarantees

To play off of Akamai‘s requirements, Service Providers are building out well positioned [read distributed] data centers with a focus on capital and operational [power/people] efficiencies.  As these distributed centers operate, a key opportunity it to avoid IP transit costs (e.g. the long haul inter-connect tariffs) and also to support workload prioritization to reduce over-provisioning and support premium offers.

These new data centers, built for changing workloads like Analytic Warehouses / Hadoop, require substantial East-West or bisection bandwidth across a “stretched” core network.  This emergence is markedly different from the North/South Bandwidth delivered through Content Delivery and Transactional Web applications.

Perhaps the most forgotten value proposition in the mobile/access edge.  Pushing SDN capabilities all the way to the endpoint.  At this point carriers typically dominate the last mile, and view content loads like NetFlix, iCloud and other services as parasitic.  With the ability to create differentiated service offers all the way to the eyeballs, there’s a chance that, regulation permitting: QoS, lower latency or location aware services can again be recognized as a driver in the growth of ARPU/value premiums.

We are just starting to see the broad thinking about NV/SDN in the providers, but I do truly expect these technologies to unlock new services and new revenue streams.

Posted in Cloud Computing
Trackback URL for this post: | Leave a reply

Puppet, DevOps and Big Data

I was really excited to be asked to be one of the really cool Keynotes at PuppetLabsPuppetConf this year.  My initial excitement was that I saw that the DevOps movement, capitalized by the system administrator to make their life easier, had parallels in the life of a software developer and information architect as they tried to create automation and reproduce-ability within their own domains.

Reflecting on trying to simplify, automate and increase the pace of integration of information drove me to talk about the transitive uses of the DevOps models in new domains.

Thanks to the amazing Puppet community for offering me the soap-box to discuss some of the parallels that I am seeing as we[EMC] tries to make information integration more agile, predictable and reproducible in the face of increasing growth, complexity and urgency.

Also, I want to thank a few of the critical people that have contributed massive value to my thinking including the Greenplum team working on the K2 cluster especially Milind and Apurva, as well as Nick Weaver and Tom McSweeney who have worked tirelessly to make Razor not just real, but a leading provisioning platform for operating systems in large scale environments including provisioning Windows, like never before!

Clouds, Service Providers, Converged and Commodity Infrastructure

Over the past several months I have seen a tremendous increase in customer and partner interest in infrastructure architectures that exploit Software Defined Networks (SDN), commodity infrastructure deployments (ala. OCP) and an increasing desire for fine grained distributed tenancies (massive tenant scale). For me, many of these requests are harbingers of emerging IT requirements:

  1. the desire to maximize CapEx return: maximize the utilization of capital [equipment] (ROI or maybe even ROIC)
  2. the goal of Minimize OpEx: to drive agility and useful business value with a decreasing ratio of expense (Google is rumored to have a 10,000:1 server to administrator ratio), but we must also factor other key elements of cost including Power, WAN networking costs, management licenses, rack/stack expenses, etc . . .
  3. Value Amplification – the ability to improve overall the ability to exploit clouds for value vs. cost arbitrage

Why the Cloud Hype? CapEx Reductions!

From an CapEx perspective the OCP project helps a company get onto commodity pricing curves that offer both choice for competition on price, but also volume which puts a natural pressure on price. One often overlooked measure of advantage comes from an infrastructure that is suitably “fungible” to allow a maximum number of applications to be fulfilled through composition. Tracking the OpenStack Cloud model, there is an increasing trend to create independent composition of compute (Nova), network (Quantum) and storage (Cinder) in order to maximize the workloads that can be deployed and hopefully minimize the stranding of resource.


We’ll talk about some of the critical technology bits later.

There are obviously many other factors that contribute to capital efficiency and utilization, the efficiency of the storage in the protection of the information, from RAID and erasure encoding to de-duplication. Hadoop’s HDFS implementation is notoriously inefficient with it’s 3 copies of each object in order to fully protect the information. This is certainly an area in which most EMC (Isilon implementation of HDFS), like other storage vendors excel.

Another factor in efficiency in these emergent environments is multi-tenancy. In the server market we’ve traditionally talked about virtualization as providing the ability to stack multiple independent VM’s in a server in a way that allows some resources to be shared, and even oversubscribed. Over-subscription is obviously a technique that allows one to take a finite resource like memory or disk, and mediate access to the resource in a way that the total offered to the tenants exceeds the real availability capacity. We have been doing this for years in storage and call it “thin provisioning” and the idea is to be able to manage the addition of resource to better match actual demand vs. contracted demand.

In many environments, that have high Service Level Objectives (SLOs), we need to be more pessimistic with respect to resource provisioning. Infrastructure is increasingly providing the tools needed to provide substantially higher isolation guarantees, from Intel’s CPU’s that are providing hard resource boundaries at the per-core level, to switch infrastructures that support L2 QoS and even storage strategies in flash and disk that optimize I/O delivery against ever more deterministic boundaries. These capabilities are emerging as critical requirements for cloud providers that anticipate delivering guaranteed performance to their customers.

In a recent paper that the EMC Team produced, we reasoned that Trusted Multi-Tenancy had a number of key requirements detailed in a prior post, but certainly relevant to our discussion around advancing tenant densities “with trust” in more commodity environments.

Reducing the OpEx Target

Operational Expenses do tend to dominate many conversations today we know that power consumption and wan costs are beginning to eclipse manpower as the top expense categories. In fact some studies have suggested the Data Center power infrastructure accounts for ~50% of facilities costs and ~25% of total power costs, obviously power is an issue. I’ll cover this in another post soon, but for the interested, James Hamilton provides an interesting analysis here.

I also believe that there is further a case for looking different at workload selection criteria for placement in advantaged cloud providers dominated by the proximity to critical data ingress, use, and egress points. As I’ve reviewed many hybrid plans, they tend to center on Tier 3 and lower applications in terms of what can be pushed offsite; backup, test & dev and other non-critical work. This move of moving low value work to cheaper providers is certainly driven by cost-arbitrage, and in cases makes sense when it’s really looking for elasticity in the existing infrastructure plant to make headroom for higher priority work. But, I’m seeing a new use-case develop, namely, as the enterprise begins to focus on analyzing customer behavior, sharing information with partners, and even exploiting cloud data sets in social or vertical marketplaces. One can even look at employee mobility and BYOD as an extension as even employee devices enter through 4G networks across the corporate VPN/firewall complex. All of these cases point to information that is generated, interacted and consumed much closer to the Internet Edge (and transitively further from the corporate data center). It is my continued premise that the massive amount of information transacted at the Internet edge generates the need to store, process and deliver information from that location versus the need to backhaul it into the high cost Enterprise fortress.

RelativeCostofComputetoNetwork.pngAn Opportunity at the Edge of the Internet

Service Providers similarly see this trend, and are beginning to stand up “trusted clouds” to enable an enterprise’s GRC concerns to be better addressed – removing barriers, but they also recognize the opportunity. As processing and storage costs continue to decrease at a faster rate than transport and routing costs, companies who begin to place information close to the edge and be able to act on it there will benefit from reduced cost of operation. Traditional Content Delivery Network companies like Akamai have been using edge delivery caching for years in order to decrease WAN transport costs, but the emerging models are pointing to a much richer experience in the provider network. The ability to manage data in the network, and bring compute to the data in order to exploit improved availability, location awareness and REDUCED COST.

People and Process

The people and process cost may have the biggest immediate impact to OpEx. The massive success of Converged Infrastructure (CI), such as the VCE VBlockvBlock Data Center Cluster™ certainly point to the value of a homogeneous infrastructure plant for improvements in procurement, management, interoperability, process, and security/compliance as detailed by ESG here. Now today’s vBlocks are exceptionally well suited for low latency transactional I/O systems for applications like Oracle, SAP, Exchange and VDI, basically anything that a business has traditionally prioritized into it’s top tiers of service. The reduction in total operating costs can be terrific, and well detailed elsewhere.

Public Cloud = Commodity Converged Infrastructure at the Edge

What’s exceedingly interesting to me is the web-scale companies like Facebook, Rackspace, Goldman Sachs and though not OCP, eBay, have been looking at the notion of a CI platform for their scale-out workloads and are all focused on data and compute placement throughout the network. These workloads and their web-scale strategies shift the management of non-functional requirements (reliability, availability, scalability, serviceability, manageability) away from the infrastructure and into the application service. In effect, by looking at cheap commodity infrastructure, and new strategies for a control integration which favor out of band asynchronous models to in band synchronous ones smart applications can run at massive scale, over distributed infrastructure and with amazing availability. These are the promises of the new public cloud controllers enabled by Cloud Foundry, OpenStack and Cloud Stack like control planes (InfoWorld’s Oliver Rist, even suggests that OpenStack has become the “new Linux,”). This all sounds scary good right?

Well here’s the rub, this shift from smart infrastructure to smart services almost always requires a re-evaluation of the architecture and technologies associated with a deployment. A simple review of what NetFlix has had to build to manage their availability, scale, performance and cost. In effect, they built their own PaaS on the AWS cloud. This was no small undertaking, and, as Adrian says frequently, is under continuous improvement against specific goals:

  • Fast (to develop, deploy and mostly to the consumer),
  • Scalable (eliminate DC constraints, no vertical scaling, elastic),
  • Available (robust and available past what a data center typically provides – think dial tone, and no downtime – rolling upgrades and the like), and l
  • Productive (producing agile products, well structured and layered interfaces)

The net is that the architectural goals that NetFlix put forward, like it’s Web Scale peers forced them to really move to a green field architecture, with green field services built to take on these new objectives. Web Scale services are really only possible on a substantially homogeneous platform built to provide composibility of fungible units of Compute, Network and Storage scale, to enable the application to elastically [de]provision resources across a distributed cloud fabric with consistency . There is an implied requirement of a resource scheduler who is knowledgeable of locations and availability zones, and makes those factors available to the application services to align/control costs and performance against service requirements.

Open Virtual Networks

The emergence of Software Defined Networks, the OpenFlow control strategies and fabric models finally make the transport network a dynamically provisionable and fungible resource. No longer are we constrained by internetwork VLAN strategies which lock you into a single vendor. As Nicira states:

Virtualizing the network “changes the laws of network physics”. Virtual networks allow workload mobility across subnets and availability zones while maintaining L2 adjacency, scalable multi-tenant isolation and the ability to repurpose physical infrastructure on demand. The time it takes to deploy secure applications in the cloud goes from weeks to minutes and the process goes from manual to automatic.

Google Reveals its Open Source NetworkingThis ability to now virtualize the physical network turns a channel / path into a individually [tenant] controlled resource that can be both dynamic in it’s topology and elastic in it’s capacity. A huge step forward in creating an IaaS that is more fungible than ever before. Google has just started talking publicly about their use of OpenFlow for radical savings and efficiency.  In fact, Urs. Hölzle says that “the idea behind this advance is the most significant change in networking in the entire lifetime of Google”.


When you do allow substantial freedom in leaving some legacy behind, the economics can be remarkable, and certainly not everything has to be jettisoned.  But it should be expected that the application and platform architectures will be substantially different.  The  new opportunities afforded by these distributed cloud technologies and commodity infrastructure advances certainly entice startups with legacy first, but. we must expect enterprises to look to these new architectures.  Everyone want to benefit from reductions in their capital and operational expenses, improved service performance, and to getting closer to their partners and customers AND the data behind these interactions.

Oh, and Google just launched IaaS with “Compute Engine”…

Cloud Building: Trusted Multi-Tenancy Requirements

There are two primary actors in a cloud deployment, the customer of the cloud (Tenant) and the provider of the cloud (Provider). They represent a standard client / server relationship requiring a strong contract and visible performance against that contract.

I believe that there are a set of key capabilities required in order to maximize the adoption [by Tenants] of [Provider] cloud infrastructure. They are detailed as follows.

  • Flexibity
    • The cloud is dynamic in nature; tenant boundaries shift to meet business needs and locations change in response to overall workloads to meet contractual SL0s and to manage availability
    • Virtualization is required to support this operational model
  • Agility
    • The speed of contracting for services verses the capital budget, procurement, and deployment lifecycle to purchase the capacity will have a significant advantage in responding to the speed of business.
  • Elasticity
    • Empowerment to expand and contract capabilities as and when business requirements change and opportunities arise.
  • Economy [of Scale]
    • Simply shifting infrastructure from Customer to Cloud Provider will not provide Economies of Scale.
    • Cloud Providers must be able to pool and share their infrastructure resources in order to achieve Economies of Scale for their Customers.
    • Multi-Tenancy is the Key to saving of massive amounts of capital and maintenance costs trough sharing of pooled resources via cloud operating systems.
  • Trust [and trusted]
    • Multi-Tenancy yields cost savings, but Customers will NOT adopt Cloud Services unless they can be assured that the Cloud environment provides a higher level of granularity of visibility and control than their existing infrastructure
    • Trusted Multi-Tenancy is the Key to Cloud adoption.

These capabilities follow with a set of Tenant requirements (expectations):

  • Improved Control of and Visibility into the Environment
    • Self-service using web-based controls
    • Improved visibility of both function and expense
    • Transparency in operations and comp liane
  • Isolation from other tenants; must ensure
    • Privacy
      • protect data in use amp; data at rest
    • Non-interference
      • ensure their SLOs are met, regardless of other tenant workloads
  • Security
    • Identity
      • Single Sign-On (SSO) federated from Enterprise to SP
    • Ability to control access to shared resources (data path, control path, and key management/escrow)
  • Improved performance to expense ratio (shared capital)
    • Reliability
    • Operational agility (contract/expand)

As we distill the functional requirements, an architectural taxonomy of affected entities naturally emerges:


Then we must evaluate our core architectural tenets of trusted multi-tenancy (please excuse the pun):

  1. Make customer-visible units of resources logical not physical
    • Known MT properties/capabilities on any layer directly exposed to customers
  2. Put those logical objects into containers [nested] with recursive delegated administration capabilities @ the container layer
    • Separates the implementation of a resource from its contract
    • Provides a common point of mediation and aggregation
    • Hierarchical (Layered) relationships must be supported on the data path and the control path
  3. Implement out-of-band monitoring of management activity
    • Verify actual state of system remains in compliance throughout management / state changes across tenancies
    • Out-of-band monitoring must be done at the container boundary for the container to support multi-tenancy
    • Multi-Tenant correlation (actual vs. expected) becomes critical to GRC

And lastly we believe that these further distill into a set of discrete design factors and principles that help to create a matrix of critical requirements:

201206271413.jpgprovided under copyright © EMC Corporation, 2012, All Rights Reserved

I hope that this discussion of Trusted Multi-Tenancy helps to clarify the complexity of the Tenant / Provider relationship as well as the complexity of the infrastructure required to fulfill Tenant service requirements in a cloud setting.

I want to thank the EMC team including Jeff Nick, John Field, Tom McSweeney, the RSA team at large as well as the 30+ members of our working group for the work that precluded this blog. – Dan

Project Razor: EMC and Puppet Labs

Today EMC and Puppet Labs are releasing via open source community contribution, a key module that serves to ground the last mile of the DevOps process. Project Razor is a tool that I’ve personally been looking for, for years. Razor is a service oriented (RESTful), model based, and policy driven “Operating System provisioning tool” which use their native installers for hardware compatibility.

During the project design meetings, we determined that we wanted a tool that could be:RazorBlade.jpg

  • a programmable power controller, exploiting BMC/IPMI or an Intelligent powerstrip
  • a capability discovery tool, using a downloadable kernel to take and maintain system inventory
  • OS delivery processor, that is state machine driven, using native OS installers (ESX, Ubuntu, SLES, RedHat, even Windows), delivering what we might call a targeted “bare minimum” OS (with just enough package to join the MCollective)
  • a consistent handoff, of the explicitly provisioned computer, to the DevOps environmentm for configuration control and customization
  • and have a pluggable, easily extensible state machine that could act as a framework for downstream work (since no one solution fits IT).

Having architected large scale clouds like Sun Cloud and developed for Cloud Foundry based platform environments, I found myself spending a huge amount of time provisioning bare metal boxes with an ever changing portfolio of hardware and Operating Environments. Then, once I put the box in service, I had another set of scripts (Post Install) that would consistently provision the application stacks. For me, there were really 2 things that were inefficient, one was the fact that the installation models (usually scripts) were brittle because of machine differences, software builds/updates, and configurations, and the second was that I kept cobbling together all the different pieces: ipmi, tftp, httpd, dhcpd, syslog, + a ton of scripts. The second failure was the fact that each OS distributor employed their own provisioning systems isolated to their OS; but my environment needed 3 linux variants, VMware vSphere 5, Windows 8 and even Solaris.

Constantly looking for a better way, we evaluated a variety of DevOps tools to replace the “post OS” scripts, and settled on Puppet from Puppet Labs. For me the ability to finally get back to my programming roots within an traditional “System Administration” context was just the flexibility that we needed. We could build a first class MVC, exploit re-use, modern programming languages and an amazing flexibility of deployment.

201205090727.jpg In my view, this is an environment designed, not for the traditional system administrator, but for the hyper-productive, administrative developer. Not catering to the lowest common denominator, the flexibility afforded by the PL approach can help to tear down some of the old walls between development and deployment in a valuable and empowering way.

We built Razor to integrate with the DHCP and network boot services, to support the cycling of power, and the taking inventory information.  Interestingly we used Facter here to deliver, dynamically the “inventory taking modules” so that we could keep the inventory kernel small, but support the dynamic delivery of key sensors just in time.  The inventory information is registered with Razor, and the system is put in an wait state until Razor decides what to do with this node.

Razor takes the inventory and provides a set of matching expressions to help one “tag” key features in order to add this machine to the appropriate sets.  Razor then enables the administrator to define a model = what they want done, and a policy that maps the model against the tags, so that when a tag is matched a policy is triggered and a model delivered.  The net result, if the box fits a known profile, the model (OS + configurations) is delivered and Razor then deals with the boot / iPXE delivered OS images, and HTTP delivered customized package sets to that node (including the Puppet Agend), and when all is done (and the second reboot happens), hands this node off to Puppet for subsequent customization.

What’s cool about this solution? Razor with Puppet now support the end-to-end configuration control capability for a brand new “bare metal” node, into a working, deployed, managed, configuration with repeatability, auditability, and flexibility.

Why open source? For EMC, we see that the cloud market place is emerging quickly, and further recognize that the transparency and open-ness of frameworks will be pre-requisite to broad adoption for innovation and safety, respectively.  This just the first of many contributions that EMC technology team will make to the open source communities in support of massive scale cloud computing, scale out architectures, and the improved automation and operational excellence. Our architectural patterns for these environments are strongly influenced by DevOps strategies and SideBand controllers as typified by the OpenStack architecture. In areas like Quantum (Software Defined Networks), Cinder (Storage/Volume Controllers), and Nova (Compute Controllers). To add to this list, I believe that Puppet + Razor become a “Configuration Controller, which, can serve as mature model based policy driven glue-ware for cross controller customized and configuration managed integration.

Not to sell our VMware partners short! We believe that there is an exceptionally strong role for Razor + Puppet = top/bottom DevOps in the ability to stand up both VMware clusters in the vCloud model, but also to scale out the delivery of integrated applications on top of that environment.

We’ll be showing a really neat demonstration of Puppet and Razor at EMC World today, so please do tune in to “Chad’s World“. And, I do know that some now want more detail, but I want to call out the two key EMC developers who made Razor possible, namely, Nick Weaver and Tom McSweeney. Both of them worked tirelessly first on trying to get the other industry configuration controllers (Baracus from SuSE and Crowbar from Dell), and then to architect the pluggable dynamic state machine for Razor detailed here in Nick’s Blog. And of course the Puppet Labs team including Nan, Dan, Teyo, Nigel, Jose, Scott and Luke who integrated with our team seamlessly and made this possible.

Lastly, I need to call out Chuck Hollis, who is constantly looking for the cool factor in the industry and in EMC, who has detailed Razor in his blog.

wwHadoop: Analytics without Borders


This week at EMC World 2012, the EMC technical community is launching a community program called World Wide Hadoop.  I am really excited to be a part of a collaboration across the EMC technical community that has been looking to extend the “borders” of our Big Data portfolio by building on the success of our Greenplum Hadoop distribution in offering the open source community the ability to federate their hBase analytics across a distributed set of hadoop clusters.


In the past month, the EMC Distinguished Engineer community has been collaborating with our St. Petersburg [RUS] Center of Excellence to demonstrate the ability to distribute analytic jobs (move the code vs. moving the data) across multiple [potentially] geographically dispersed clusters, manage those jobs, and enjoin the results.

The big problem that we are addressing is that of Reed’s law, and the value of combinatorial value of networked resources, in our case, information sets.


“[E]ven Metcalfe’s law understates the value created by a group-forming network [GFN] as it grows. Let’s say you have a GFN with in/i members. If you add up all the potential two-person groups, three-person groups, and so on that those members could form, the number of possible groups equals 2i. So the value of a GFN increases exponentially, in proportion to 2i. I call that Reed’s Law. And its implications are profound.”

A few of our Big Data Challenges


  • Valuable information is produced across geographically diverse locations
  • The data has become too big to move [thus we need to process in place]
  • Scientists and Analysts have begun to move partial sets vs. full corpi to try and save time
  • But this partial data can, and often does create inadvertent variance or occlusions in correlation and value

EMC is demonstrating a working distributed cluster model for analytics across multiple clusters.


We want to work on this with the open community, as we believe that there is tremendous value in enabling the community to both derive and add value with EMC in this space and invite all to join us at

Big Data and Healthcare: Corrolate, Root Cause, Operationalize & Disrupt

I have to say that it’s been massively fun lately at EMC talking about the impact of Big Data Analytics on vertical markets. It has been fun learning from Chuck‘s experience / scars associated with advocating for disruptive change in the IT market.

Recently, Chuck blogged about the emerging ACO’s and potential disruption that information and cost/benefit aligned markets can provide to markets like healthcare.

I commented on his blog, and wanted to share the insight with my audience as well:

Thanks for your insightful “state of healthcare” review. I find that there is one more facet to healthcare that makes the data/information analytic landscape even more compelling, and that’s an interesting book by David Weinberger, Too Big to Know.

Dr. Weinberger brings forward the notion that books and traditional learning are discontinuous. And that there is emerging, due to the hyperlinked information universe, a massive ecology of interconnected fragments that continually act with the power of positive and negative reinforcement. The net is that traditional views, or in David’s vocabulary “Facts” are largely based upon constrained reasoning, and that as the basis of this reasoning changes with the arrival of new facts, so should their interpretations.

There are a number of entities sitting on top of 30years of historic clinical data, the VA, certain ACO’s and certainly academic medical centers that can chart many of the manifest changes in treatment planning and outcome, but may themselves, because of their constrains present significant correlation. But just the same, many not be able to discern accurate causality due to a lack of completeness of the information base – maybe genomics, proteomics, or the very complex nature of the human biological system.

The interconnectedness of the clinical landscape is of paramount value in the establishment of correlation, and derived causality, and that with the increase of new information, traditional best practices can be [and needs to be] constantly re-evaluated.

I believe that, this lack of causal understanding, due to highly constrained reasoning, has lead to many of the derogatory statements about todays outcome based care model. As todays outcome measures are based more on correlation than on causality, determining causal factors, and then understanding the right control structures should improve the operationalization of care. I further believe that, as the availability of substantially more complete clinical information, across higher percentages of populations, will lead to improvements in these outcome measures and the controls that can affect them. The net result of getting this right is improved outcomes at decreased costs, and effectively turning the practice of medicine into a big data science with a more common methodology and more predictable outcome. In effect, full circle, operationalized predictive analytics.

The opportunities for market disruption are substantial, but there remain high barriers. One constant in being able to exploit disruptions is the right visionary leadership who has the political capital and will, but also is willing to be an explorer, for change is a journey and the route is not always clear. Are you a change agent?

Join me at the 2nd Annual Data Science Summit @ EMC World 2012

Big Data Exchanges: Of Shopping Malls and the Law of Gravity

Working for a Storage Systems company, we are constantly looking at both the technical as well as social/marketplace challenges to our business strategy. Leading to the coining of “Cloud Meets Big Data” from EMC last year, EMC has been looking at the trends that “should” tip the balances around real “Cloud Information Management” as opposed to “data management” which is really what dominates todays practice.

There are a couple of truisms [incomplete list]:

  1. Big Data is Hard to Move = get optimal [geo] location right the first time
  2.  Corollary = Move the Function, across Federated Data
  3. Data Analytics are Context Sensitive = meta-data helps to align/select contexts for relevancy
  4. Many Facts are Relative to context = Declare contexts of derived insight (provenance amp; Scientific Method)
  5. Data is Multi-Latency & needs Deterministic support for temporality= key declarative information architectural requirement
  6. Completeness of Information for Purpose (e.g. making decision) = dependent on stuff I have, and stuff I get from others, but everything that I need to decide.

I believe that 1) and 6) above point to an emerging need for Big Data Communities to arise supporting the requirements of the others. Whether we talk about these as communities of interest, or Big Data Clouds. There are some very interesting analogies that I see in the way we humans act; namely, the Shopping Mall. Common wisdom points to the mall as providing an improved shopping efficiency, but also in the case of inward malls, a controlled environment (think walled garden). I think that both efficiency in the form of “one stop”, and control are critical enablers in the information landscape.

Big Data Mall slideThis slide from one of my presentations supports the similarities of building a shopping mall alongside the development of a big data community. Things like understanding the demographics of the community (information needs, key values), the planning of roads to get in/out. And of course how to create critical mass = the anchor store.

The interesting thing about critical mass is that it tends to have a centricity around a key [Gravitational] Force. Remember:

Force = Mass * Acceleration (change in velocity).

This means that in order to create communities and maximize force you need Mass [size/scope/scale of information] and improving Velocity [timelyness of information]. In terms of mass, truism #1 above, and the shear cost / bandwidth availability make moving 100TB of data hard, and petabytes impracticable. Similarly, velocity change does matter, whether algorithmically trading on the street (you have to be in Ft Lee, NJ or Canary Warf, London) or a physician treating a patient, the timeliness of access to emergent information is critical. So correct or not, gravitational forces do act to geo-locate information.

Not trying to take my physics analogy too far, but Energy is also interesting. This could be looked at as “activity” in a community. For energy there is an interesting both kinetic and potential models. In the case of the internet, the relative connectedness of information required for a decision could be viewed in light of “potential”. Remember:

Ep (potential energy) = Mass x force of Gravity x Height (mhg)

In our case Height could be looked at as the bandwidth between N information participant sites, Mass as the amount of total information needed to process, and Gravity as a decentralization of information = the Outer Joins required for optimal processing. If I need to do a ton of outer joins across the Internet in order to get an answer, then I need to spend a lot of energy.

So if malls were designed for optimal [human] energy efficiency, then big data malls could do exactly the same for data.

Big Data Universe: “Too Big to Know”

I was surfing to WGBH on Saturday when I came across a lecture by with David Weinberger (surrounding his new book Too Big to Know).201202200850.jpg

I was sucked in when he eluded to brick and mortar libraries as yesterdays public commons, and pointed to discontinuous and disconnected nature of books / paper. The epitaph may read something like this “book killed by hyperlink, the facts of the matter are whatever you make them.”

Overall David, in his look at the science of knowledge, points at many interesting transitions that the cloud is bringing:

  • Books/Paper are discontinuous and disconnected – giving way to the Internet which is constantly connected and massively hyper/inter-linked
  • “the Facts are NOT the Facts” which is to point out that arguments are what we make of the information presented, and our analytics given a particular context. What we claim as a fact, may in reality proved to be a fallacy -gt; just look at Louis Pasteur and germ theory. History has so many moments like this.
  • Differences and Disagreements are themselves valuable knowledge. For me this is certainly true, learning typically comes through challenge of preconception.
  • There is an ecology of knowledge – There are a set of interconnected entities that existing within an environment. These actors represent a complex set of interrelationships, a set of positive and negative reinforcements, that act as governors on this system. These promoters/detractors act to balance fact/fallacy so as to create the system tension that supports insight (knowledge?). It’s these new insights that themselves create the next arguments – the end goal being?

I wanted to share this book because I believe that it re-enforces the need for businesses to think about their cloud – big data strategies. The question becomes less of “do I move my information to the cloud?” and more of “how do I benefit from the linkage that the Internet can provide to my information?” so as to provide new insights from big data.

Read the book, take the challenge!