Monday, January 17, 2005

fyi Semantic Web Ontologies: What Works and What Doesn't


Pointer to article:

Kobielus kommentary:
This is a great interview with Peter Norvig, Google’s director of search quality. The core problem a search engine faces puts the semantic web question in stark relief. What do you trust more: what a webpage declares about itself—the metadata--or the information contained in that page?

I wholeheartedly agree with this statement toward the end of the article:

“You can't trust the metadata. You can't trust what people are going to say. In general, search engines have turned away from metadata, and they try to hone in more on what's exactly perceivable to the user. For the most part we throw away the meta tags, unless there's a good reason to believe them, because they tend to be more deceptive than they are helpful. And the more there's a marketplace in which people can make money off of this deception, the more it's going to happen. Humans are very good at detecting this kind of spam, and machines aren't necessarily that good. So if more of the information flows between machines, this is something you're going to have to look out for more and more.”

That’s right. You can’t necessarily trust self-assertions. Where there’s money to be made, careers to be advanced, jobs to be landed, and e-commerce transactions to be lured, people will lie. Where reputations are made or lost, people will lie. Sometimes, they’ll just lie for the hell of it. Sometimes, they’ll lie and not realize they’re lying. They may be passing on deceptive information that they’ve never quite realized is wrong—they’re fooling themselves and, if you buy what they’re selling, they’ve fooled you too.

Think about resumes. They’re self-assertions—hence metadata--on the human object. Who can totally trust their truthfulness, completeness, or freshness? You can't trust all people to have the self-knowledge and self-description abilities that make for a good, solid, useful resume. You can't expect people to write resumes that give every potential employer precisely the information they're looking for, which is why you often need to fill in the blanks in their mind in the form of a live interview. You certainly can’t expect people all over the earth to converge on a single resume format so that potential employers can compare everybody apples-to-apples. Sometimes it’s better to trust what others (i.e., references) say about us, but they can lie too (especially if they’re friends of the person, and have been put up to it).

Think about the sheer volume, variety, and dynamicity of information and other resources on the Internet. Even if everybody tagged their metadata with OWL, RDF, or some other common schema, how can you trust those assertions?

In the broadest sense, the “semantic web” refers to the master “description layer” that informs the Internet and all networked interactions. We can describe semantics generally as referring to shared understandings of the “meaning” of various entities within a distributed environment. The entities to which one might ascribe “meaning” include services, messages, resources, applications, and protocols. An entity’s “meaning” might be described by any or all of the following characteristics: structure, behavior, function, operation, scope, context, reference, use, goal state, and implied processing.

Clearly, in the broadest sense, any WS-* specification in any layer describes some aspect of a service’s semantics: usually the structures, behaviors, scope, context, reference, and processing of data, messages, and interactions within a specific functional sphere (such as identification, messaging, security, reliable messaging, and publish-and-subscribe). WSDL is a semantic/description syntax for describing Web services; XML Schema is one for XML vocabularies; WS-Policy is one for policies; and so and and so forth.

But the industry narrowly refers to a subset of WS-* specifications as specifically addressing “semantics”: OWL, RDF, and a welter of others that have achieved minimal adoption, as this article states. Typically, “semantics specifications” provide frameworks for defining the scope, reference, use, goal state, and implied processing of a particular resource type: tagged XML data.

The “semantic web” concept can’t even begin to lift off until there’s general recognition that all description markup syntaxes need to be included in this framework. And until there’s a federated registry of all metadata descriptions of all networked resources, relationships, and other entities. And until there’s a trust infrastructure that helps the “relying party” to assess the quality/accuracy of any given piece of asserted metadata on any entity. And until there’s a search infrastructure that helps us to locate, aggregate, assemble, and present all that metadata (and assessments thereof) to human beings for their evaluation.

Because, as Norvig says, there are things that people catch that machines totally miss. People are the final arbiters of "meaning." If it isn't meaningful to a human being, it isn't meaningful, period.