Friday, February 06, 2009

FORRESTER blog repost The Forrester Wave™: Enterprise Data Warehousing (EDW) Platforms Q1 2009: The Key Takeaway

The Forrester Wave™: Enterprise Data Warehousing (EDW) Platforms Q1 2009: The Key Takeaway

http://blogs.forrester.com/information_management/2009/02/forrester-wave.html

By James Kobielus

Today we published the first Forrester Wave™ specifically focused on Enterprise Data Warehousing (EDW) Platforms. The final published report is now available on Forrester’s website to clients. Information and Knowledge Management (I&KM) professionals will find it a timely and actionable study of the leading EDW platform vendors: Teradata, Oracle, IBM, Microsoft, SAP, Sybase, and Netezza. I urge you to download and read it, and then engage me, the author-analyst, in inquiries and advisories to help you apply it to your EDW initiatives.

The key takeaway from this Wave is that scalability, flexibility, and affordability are the dominant requirements in today’s budget-stressed EDW platforms market. I&KM professionals are under the gun, trying to keep EDW and business intelligence (BI) costs under tight control while preserving the flexibility to grow and repurpose these investments to support an ever-changing array of decision-support requirements. Hence, an EDW platform--to score well in the Wave--should address the following high-bar requirements:

Extremely scalable: The EDW platform should be scalable to support petabytes of usable data; thousand-plus distributed compute/storage nodes; tens of thousands of concurrent users and queries; many terabytes of daily or continuous data loads; and expanding mixed workloads of reporting, query, OLAP, in-database analytics, real-time analytics, ETL, data cleansing, and other transactions. It should support this extreme scalability through scale-out, shared-nothing MPP, optimized appliances, optimized storage, dynamic query optimization, and mixed workload management technologies.

Extremely flexible: The EDW platform should be flexible to support diverse applications, including business intelligence, online analytical processing, data mining, predictive analytics, text analytics, closed-loop business process management, and complex event processing; and various deployment roles, including multi-domain data hubs, subject-specific data marts; operational data stores, master data management hubs, staging nodes, analytic data marts, multi-temperature hierarchical storage management and archiving, and source and/or target repository in data federation environments. It should support this extreme flexibility by being fluid, adaptive, and virtualized; enabling data to be transparently persisted, in diverse physical and logical formats, to an abstract, seamless grid of interconnected memory and disk resources; and delivered with subsecond delay to consuming applications; and ensuring application service levels through an end-to-end, policy-driven, latency-agile, distributed-caching and dynamic query-optimization memory grid.

Extremely affordable: The EDW platform should be affordable for all customer segments and use cases. It should support this extreme affordability through flexible packaging/pricing, including licensed software, modular appliances, and “pay as you go” subscription-based SaaS/cloud offerings.

EDW platforms vendors that can’t address these key requirements--now or in their enhancement roadmaps over the coming 2-3 years--will not survive in this very competitive arena.

As noted above and in my blogpost last week, scalability, performance, and optimization are perhaps the most important criteria in today’s EDW market. And, of course, they are quite difficult to nail down into a single yardstick that does justice to different vendors’ approaches. Nevertheless, I believe this Wave accomplishes that. I have boiled down “scalability, performance, and optimization” (SPO) into a single criterion that defines five profiles (from 5= most scalable to 1 = least scalable), focusing on the degree of parallelism in the underlying architecture.

For each of the vendors in this Wave, I got a deep dive on their SPO architecture, but I didn’t stop there. I asked each vendor for reference customers, and conducted a structured interview with each. I asked each for a list and description of their largest production customer deployments. And I asked each for published benchmarks, plus all the supporting info on how the test environments, scenarios, and criteria. In other words, I applied the standard Forrester Wave methodology.

Essentially, the customer deployment and benchmark data corroborated whether a vendor in fact earned the particular SPO score associated with their architectural approach. Clearly, there were plenty of gray areas. Also, quite clearly, vendors had plenty of comments on the definitions of the SPO scales, and on where they fell on this spectrum. And, of course, many pointed out that being scored, say, a “2” rather than a “4” or “5” didn’t necessarily mean they were slower, less efficient, or incapable of processing various EDW and BI workloads. It also didn’t mean that they couldn’t, in practice and in customer deployments, push the scalability and speed envelope that one would associate with their architecture. Architecture isn’t destiny, but it definitely sets SPO constraints, which is the whole point of the scoring on this criterion in this Wave.

All the vendor feedback was excellent and helped me tweak and tune the scale to fit the EDW market’s current and emerging state of the art. With that said, here are the final SPO scales in this Wave:

5 = scale out through shared-nothing massively parallel processing (MPP), up to 100-1000+ storage/compute nodes in single-tier grid of compute/storage nodes, and well beyond 1000s of terabytes (TBs) of online, usable production data across distributed deployment

4 = scale out in the storage tier to 100-1000+ nodes and/or up to around 1000 TBs of online, usable production data, but lacking support for single-tier-grid shared-nothing MPP and/or lacking the ability to scale out to 100-1000+ nodes in the compute tier

3 = scale-out through shared-nothing MPP and/or clustering, up to 2-100 storage and/or compute nodes and up to 100s of TBs of online, usable production data across distributed deployment

2 = scale-up through symmetric multiprocessing (SMP), and up to 10s of TBs of online, usable production data, and scale-out in a clustered deployment of 2-99 compute nodes

1 = scale-up through SMP and up to 10s of TBs of online, usable production data on a single-node deployment

To see how the vendors ranked, you’ll need to read the Wave. Or engage me in an inquiry or advisory. Or, preferably, both.