Pointer to article:
This article is wrong and misleading on several levels. Most fundamentally, it doesn’t go into the broader picture: compressing XML’s footprint on networks/storage resources, but also accelerating XML processing overhead burdens on app servers and endpoints.
I published a feature article in Business Communications Review on this very topic last month. With the kind indulgence of Fred Knight, Eric Krapf, and Sandy Borthick, here’s the meat of that piece re compact XML encodings (for the rest of piece, I refer you to that fine publication—it’s one of those enduring publications, which hasn’t changed in its format, but is still somehow still fresh after all these years, just like my principal publication-host, Network World):
“XML content needn’t always be encoded as plain, bandwidth-hogging ASCII text, though XML’s human-readability appeals to application developers everywhere. One of the most important new approaches is use of improved XML encoding and serialization schemes in lieu of traditional reliance on ASCII plaintext. Binary encodings of XML are generally more compact than text encodings, producing smaller XML document file sizes to be transmitted over networks and stored in databases.
The core standard, XML 1.0, supports alternate approaches for serializing the document elements whose logical data model is described in XML markup syntax. To the extent that XML data models can be serialized to binary encodings, XML becomes a more efficient interchange and storage format.
At a mandatory minimum, all XML processors (the software components that generate and/or parse XML) must be able to read data encoded as Unicode Transformation Format 8 (UTF-8, for standard ASCII text in the Latin alphabet used by English) or UTF-16 (for non-Latin alphabets such as Chinese and Cyrillic). This mandatory feature of the core XML 1.0 standard allows XML documents to contain text in all the world’s character sets. Both UTF-8 and UTF-16 are considered text formats (not binary encodings). For XML documents, UTF-8 is a more compact encoding than UTF-16; the former uses 8 bits per character whereas the latter uses 16 bits per character.
In addition, CDATA--an optional feature of XML 1.0—may be used to encapsulate binary data within XML documents. However, this approach is fraught with limitations, such as the possibility that receiving XML parsers that may not know how to process CDATA encodings correctly. As noted, support for CDATA is optional in the XML 1.0 standard. However, support for UTF-8 and UTF-16 is mandatory.
Another approach is to rely on various industry specifications that use an XML-based SOAP message as a manifest for describing binary data files within SOAP’s surrounding HTTP packet. SOAP with Attachments (SwA) and Microsoft’s Direct Internet Messaging Extensions (DIME) transmit opaque, non-textual data—such as images and digital signatures—along with an XML document. However, they don’t support binary encoding of all content within XML documents.
Neither SwA nor DIME has achieved broad adoption within industry. Recognizing the critical need for a consensus standard for compact XML encodings, the World Wide Web Consortium (W3C) has developed new Candidate Recommendations for binary encoding of XML within SOAP 1.2 payloads: SOAP Message Transmission Optimization Mechanism (MTOM) [http://www.w3.org/TR/2004/CR-soap12-mtom-20040826/] and XML-binary Optimized Packaging (XOP) [http://www.w3.org/TR/2004/CR-xop10-20040826/]. In addition, W3C’s XML Binary Characterization Working Group has released the First Public Working Draft of its XML Binary Characterization Properties” document (http://www.w3.org/TR/2004/WD-xbc-properties-20041005), describing properties desirable for MTOM, XOP, or any other serialization of the XML data model.
MTOM and XOP (which may be considered two halves of a single standard) have much broader vendor support than any predecessor specification for XML-to-binary serialization. MTOM and XOP describe how to produce optimized binary encodings of XML content within SOAP 1.2 payloads. MTOM and XOP preserve one of XML’s great strengths: the transparency of the tagged, logical data structure that a particular document implements. (Note: Where XML encoding schemes are concerned, the terms “optimized” and “efficient” are industry shorthand for “smaller XML file sizes.” We use both terms in that context in this article, as well as such synonyms as “compact” and “small.”).
Structural transparency is what distinguishes XML syntax from most text formats. XML’s tags, attributes, and other markup conventions call out the interrelationships, datatyping, and semantics of its constituent data elements. By calling out a document’s logical structure, XML facilitates fine-grained validation, transformation, and other processing on that document data elements by receiving applications. In fact, most XML-based Web services require that the underlying document markup syntax be transparent and self-describing. Web services require that all XML-processing nodes (software and/or hardware-based) be able to parse, validate, and transform all elements within SOAP/XML traffic. Deep content inspection is how XML firewalls and SOAP content routers operate. Take away XML’s logical transparency and you jeopardize Web services interoperability, management, and security.
For any given XML document, MTOM and XOP preserve its logical transparency structure by encoding that structure in a text-based “XML Information Set” manifest, while allowing any of the document’s contents to be serialized to any binary encoding. In particular, these specifications support binary encoding of XML content as Multipurpose Internet Messaging Extensions (MIME) Multipart/Related body parts and encapsulation of those parts—along with the associated XML Information Set manifest--within SOAP 1.2 envelopes. The specifications also describe how to encapsulate binary-encoded XML body parts directly within HTTP packets (in cases where SOAP doesn’t enter the equation), thereby reducing the size of XML files for transmission and/or storage.”
Oh...almost forgot...my Network World column on the same topic was published this week (long lead times on these publications, folks--half the time I forget what's in the pipeline--much has changed in my life since I wrote these pieces in October-November--pardon me for not contributing any additional insights on the blog item--I'm distracted by any number of tasks concerning finding a new job, working through some tech issues with my local phone company, and helping my son apply to colleges--he's going to be an actor, or so he hopes--he's a funny talented guy--handsome dude too).
Note that the print version of the story has my current byline correct, but the editor of the online version stills has the old byline (they need to get their internal workflow straightened out on synchronizing these sorts of things).
I'm an independent IT industry analyst currently looking for a new position. I'm no longer employed by that firm whose name is in the obsolete byline. So take note. Call me at 703-924-6224 or e-mail me at email@example.com.
Let's talk. You'll find that I'm easy to speak and work with. I'd prefer if you actually came out here to chat live in person. I miss that. I hate working in a dungeon surrounded by disembodied voices and words on the wire. It's much better to have human give-and-take, as opposed to take-and-take.