James Kobielus' Blog: Real-time drives database virtualization

Databases are evolving faster than ever, becoming more fluid to keep pace with an online world that's becoming virtualized at every level.

In many ways, the database as we know it is disappearing into a virtualization fabric of its own. In this emerging paradigm, data will not physically reside anywhere in particular. Instead, it will be transparently persisted, in a growing range of physical and logical formats, to an abstract, seamless grid of interconnected memory and disk resources; and delivered with subsecond delay to consuming applications.

Real-time is the most exciting new frontier in business intelligence, and virtualization will facilitate low-latency analytics more powerfully than traditional approaches. Database virtualization will enable real-time business intelligence through a policy-driven, latency-agile, distributed-caching memory grid that permeates an infrastructure at all levels.

As this new approach takes hold, it will provide a convergence architecture for diverse approaches to real-time business intelligence, such as trickle-feed extract transform load (ETL), changed-data capture (CDC), event-stream processing and data federation. Traditionally deployed as stovepipe infrastructures, these approaches will become alternative integration patterns in a virtualized information fabric for real-time business intelligence.

The convergence of real-time business-intelligence approaches onto a unified, in-memory, distributed-caching infrastructure may take more than a decade to come to fruition because of the immaturity of the technology; lack of multivendor standards; and spotty, fragmented implementation of its enabling technologies among today's business-intelligence and data-warehouse vendors. However, all signs point to its inevitability.

Case in point: Microsoft, though not necessarily the most visionary vendor of real-time solutions, has recently ramped up its support for real-time business intelligence in its SQL Server product platform. Even more important, it has begun to discuss plans to make in-memory distributed caching, often known as "information fabric," the centerpiece middleware approach of its evolving business-intelligence and data-warehouse strategy.

For starters, Microsoft recently released its long-awaited SQL Server 2008 to manufacturing. Among this release's many enhancements is a new CDC module and proactive caching in its online analytical processing (OLAP) engine. CDC is a best practice for traditional real-time business intelligence, because, by enabling continuous loading of database updates from transaction redo logs, it minimizes the performance impact on source platforms' transactional workloads. Proactive caching is an important capability in the front-end data mart because it speeds response on user queries against aggregate data.

Also, Microsoft recently went public with plans to develop a next-generation, in-memory distributed-caching middleware code-named "Project Velocity." Though the vendor hasn't indicated when or how this new technology will find its way into shipping products, it's almost certain it will be integrated into future versions of SQL Server. Within Project Velocity, Microsoft is playing a bit of competitor catch-up, considering that Oracle already has a well-developed in-memory, distributed-caching technology called Coherence, which it acquired more than a year ago from Tangosol. Likewise, pure-plays, such as GigaSpaces, Gemstone Systems, and ScaleOut Software have similar data-virtualization offerings.

Furthermore, Microsoft recently announced plans to acquire data-warehouse-appliance pure-play DATAllegro and to move that grid-enabled solution over to a pure Microsoft data-warehouse stack that includes SQL Server, its query optimization tools and data-integration middleware. Though Microsoft cannot discuss any road-map details until after the deal closes, it's highly likely it will leverage DATAllegro's sophisticated massively parallel processing, dynamic task-brokering and federated deployment features in future releases of its databases, including the on-demand version of SQL Server. In addition, it doesn't take much imagination to see a big role for in-memory distributed caching, à la Project Velocity in Microsoft's future road map for appliance-based business-intelligence/data-warehouse solutions. Going even further, it's not inconceivable that, while plugging SQL Server into DATAllegro's platform (and removing the current Ingres open source database), Microsoft may tweak the underlying storage engine to support more business-intelligence-optimized logical and physical schemas.

Microsoft, however, isn't saying much about its platform road map for real-time business-intelligence/data-warehousing, because it probably hasn't worked out a coherent plan that combines these diverse elements. To be fair, neither has Oracle -- or, indeed, any other business-intelligence/data-warehouse vendor that has strong real-time features or plans. No vendor in the business-intelligence/data-warehouse arena has defined a coherent road map yet that converges its diverse real-time middleware approaches into a unified in-memory, distributed-caching approach.

Likewise, no vendor has clearly spelled out its approach for supporting the full range of physical and logical data-persistence models across its real-time information fabrics. Nevertheless, it's quite clear that the business-intelligence/data-warehouse industry is moving toward a new paradigm wherein the optimal data-persistence model will be provisioned automatically to each node based on its deployment role -- and in which data will be written to whatever blend of virtualized memory and disk best suits applications' real-time requirements.

For example, dimensional and column-based approaches are optimized to the front-end OLAP tier of data marts, where they support high-performance queries against large, aggregate tables. By contrast, relational and row-based approaches are suited best to the mid-tier of enterprise data-warehouse hubs, where they facilitate the speedy administration of complex hierarchies across multiple subject-area domains. Other persistence approaches -- such as inverted indexing -- may be suited to back-end staging nodes, where they can support efficient ETL, profiling and storage of complex data types before they are loaded into enterprise data-warehouse hubs.
For sure, all this virtualized data infrastructure will live in the "cloud," in a managed-service environment and within organizations' existing, premises-based business-intelligence/data-warehouse environments. It would be ridiculous, however, to imagine this evolution will take place overnight. Even if solution vendors suddenly converged on a common information-fabric framework -- which is highly doubtful -- enterprises have too much invested in their current data environments to justify migrating them to a virtualized architecture overnight.

Old data-warehouse platforms linger on generation after generation, solid and trusty, albeit increasingly crusty and musty. They won't get virtualized out of existence anytime soon, even as the new generation steals their oxygen. Old databases will expire only when someone migrates their precious data to a new environment, then physically pulls the plug, putting them out of their misery.

James Kobielus' Blog

Thursday, August 21, 2008

Real-time drives database virtualization

Blog Archive

James Kobielus