Tuesday, January 16, 2007

fyi Inside MySpace.com


Found content: http://www.baselinemag.com/article2/0,1540,2082921,00.asp?kc=BLBLBEMNL011607EOAD

My take:

This is one of the most fascinating case studies I’ve ever read in the IT literature. Free-of-charge social networks are among the most fragile creations in the web universe. They live and die by their ability to stoke the “network effect” of snowballing invitations among people within diverse social circles. If the service shows any chronic degradation in performance and reliability, users will abandon it freely and speedily.

The article shows in painful detail how MySpace.com—without any fixed strategy--has continually evolved its distributed access, application, processing, storage, hosting, and management infrastructure to keep pace with surging membership, traffic, content, and expectations. It breaks the architectural evolution of MySpace.com into “membership milestones”--500,000 users, 1 million, 3 million, 9 million, 26 million, ….—and shows how the service broke and was quickly fixed to avoid strangling the golden goose they had birthed.

What I found most fascinating about this case study is the following statement, in which a rival (Friendster) partly attributes MySpace.com’s runaway success to MySpace.com’s superior performance (and Friendster’s concurrent growing pains): “MySpace was launched in 2003, just as Friendster started having trouble keeping pace with its own runaway growth. In a recent interview with Fortune magazine, Friendster president Kent Lindstrom admitted his service stumbled at just the wrong time, taking 20 to 30 seconds to deliver a page when MySpace was doing it in 2 or 3 seconds.”

Once MySpace.com started to explode, they continually ran into bottlenecks in data access performance that threatened to derail them as well. The case study lays out the peril to MySpace in stark terms: “MySpace has tens of millions of people posting messages and comments or tweaking their profiles on a regular basis—some of them visiting repeatedly throughout the day. That makes the technical requirements for supporting MySpace much different than, say, for a news Web site, where most content is created by a relatively small team of editors and passively consumed by Web site visitors. In that case, the content management database can be optimized for read-only requests, since additions and updates to the database content are relatively rare. A news site might allow reader comments, but on MySpace user-contributed content is the primary content. As a result, it has a higher percentage of database interactions that are recording or updating information rather than just retrieving it…..Every profile page view on MySpace has to be created dynamically—that is, stitched together from database lookups. In fact, because each profile page includes links to those of the user's friends, the Web site software has to pull together information from multiple tables in multiple databases on multiple servers. The database workload can be mitigated somewhat by caching data in memory, but this scheme has to account for constant changes to the underlying data.”

Since the beginning, MySpace.com has operated in ad-hoc fire-fighting mode, evolving its architecture to oil whatever new squeaks presented themselves. In reading this article, I scribbled down notes on the convoluted saga of ad-hoc fixes. Here (reading like the “and then, and then, and then” run-on ramblings of toddlers trying to make sense of an apparently pointless plot) are my notes on what they’ve done to keep heads above water:

§ first: single database server, with two access/web servers:

§ then: handle access/usage growth by throwing more more web servers at the problem

§ then: divide database loads among single master database and two access databases that have replicated copies of data posted to master, plus more database servers and bigger hard disks

§ then: vertical partitioning of separate databases among various functions of the MySpace.com service

§ then: a storage area network with pool of disk storage devices tied together by a high-speed specialized network

§ then: every database was given its own copy of the users table

§ then: distributed computing architecture treating the website as a single app, with one user table, split into chunks of 1 million accounts, with those chunks in separate mirrored instances of SQL Server, with webserver/access server redirecting logins to the applicable database servers

§ then: rewrite app in faster, more efficient computing environment (ASP.NET), with re-examination of every function for streamlining opportunities

§ then: continually redistributing data across the SAN to reduce I/O imbalances, but it was a manual process

§ then: virtualized storage architecture where the entire SAN is treated as one big pool of storage capacity, without requiring that specific disks be dedicated to serving specific applications

§ then: caching tier

§ then: faster version of database server running on 64-bit hardware with more memory access/less memory bottleneck

§ then: turn off distributed denial of service protection in order to goose performance further (introducing risk)

§ then: implement backup data centers/SANs tied to different power grids

§ then: lengthen data-commit checkpoint intervals to goose performance (introducing more risk)

§ and always: in any fix, impossible to do thorough load/performance testing on each new architectural fix/stopgap, simply resigning themselves to fixing new problems ad-hoc as they spring up

Of course, MySpace.com’s teenager customers don’t care, and don’t want to care, about any of this. The service continues to grow smartly, which may be attributed, among many factors, to the fact that it has yet to cross a “MySpace sucks, let’s leave” threshold. As the article states over and over, MySpace.com continues to experience significant performance and reliability problems, but they’ve never been showstoppers.

How long would it take for a social networking site to slow down and/or crash before it gets abandoned by its users? Are these sites so “group-sticky” that participants will tolerate poor performance for long periods? Are users’ performance/reliability expectations on these services lower than for standard corporate and e-commerce websites?

Or could the life or death of a social networking service—or of any online channel/forum—be driven more by the zeitgeist—fad, fashion, weariness, exhaustion, restless, new cool alternatives? Ten years ago, my 9-year-old son created a Digimon website. He’s in college now, and I don’t snoop into his doings, but I suspect that both of my kids have MySpace.com pages (no—I have better things to do then spy on them). Ten years from now, they and their peers will probably have long abandoned whatever social networking services they’re currently using.

They’ll write it off as youthful experimentation. To the extent they’ll still be participating in any online community that resembles today’s social networking services, it’ll be an act of nostalgia, more than anything. From a technical standpoint, it’ll probably be lightning-fast and scalable as can be, the fruit of lessons learned in the ‘00s by MySpace.com and other pioneers in this world of hyper-federated data service layers. And it will probably be far more navigation-friendly, as today’s chaotic MySpace homepage designs (which resemble the overcrowded Web-site home page designs that even big corporations were using in the ‘90s) settle into more consistent, pleasing patterns that everybody accepts without question.

But at that point the messy fun of the ungoverned social-networking frontier will be a distant, and slightly quaint, cultural memory. Like hippie communes in '60s.