ReadWriteWeb

A New Commercial Ontology from Hakia

Written by RWW Sponsor / July 30, 2009 5:00 AM / 6 Comments

Editor's note: we offer our long-term sponsors the opportunity to write 'Sponsor Posts' and tell their story. These posts are clearly marked as written by sponsors, but we also want them to be useful and interesting to our readers. We hope you like the posts and we encourage you to support our sponsors by trying out their products.

We at Hakia are proud to announce our upcoming commercial ontology, perhaps the world's first. What is a commercial ontology? If you're asking this question you have just touched on an important distinction: fantasy versus reality. In the context of the Web, a commercial ontology is a realistic version of an ontology, as we explain below.

Realities of the Web

Hakia has accomplished two important innovations in building its commercial ontology (CO): first, the development of concepts and lexicons that follow strict guidelines on the realities of Web operations. What are these realities? Most search queries on the Web reflect a single dimension of intent, almost exclusively relevant to commercial topics. "Commercial topics" here must be taken in the broadest sense possible. For example, if you were looking for "the benefits of foot massage" or "the director of the movie Last Emperor," your queries would fall into a commercial pattern. One particular distinction of the commercial pattern is that they come in short packages, including a name (onomasticon) or referring to something sold, bought, watched, heard, etc.

In contrast, many (if not all) ontologies that have been built to date (or claimed to exist) are focused on the use of language in the general sense, but not in the sense of commercial patterns on the Web. Therefore, their usefulness when tackling Web search queries is greatly compromised, sometimes to the point of absolute failure. If such an ontology could disambiguate a dozen different senses of the word "kill," it would be sad news if the last 100,000 queries in the search logs did not include a single occurrence of the word "kill." Like drowning in two-inch-deep water, such ontologies do not use their disambiguation capacities for nearly 80% of queries because the queries include nothing but onomasticons or are too short (under-articulated).

The Sequence Approach

The second innovation used in the CO is the use of sequences instead of single words. A single word, like "kill," is the most ambiguous state of information and is hardly used in human communication without a strong implied context. As a result, building natural-language processing (NLP) systems by taking individual words as units of computation is an invitation for disaster.

In contrast, word sequences (two or more words) are inherently safe and highly descriptive. Take "road kill," for example. This sequence describes the corpse of an animal killed on the road by a passing vehicle. If a language processing system takes the sequence of words as a unit of computation, 99% of the ambiguity problem vanishes. There is no need to process the words "kill" and "road" separately, trace their senses, and locate convergence to identify the meaning of "road kill" if you can just take the sequence "road kill" itself as your unit of computation for mapping. This is depicted below:

Note the number of traces required in a conventional ontology approach compared to the sequence approach. The sequence approach requires a lot of data storage space (which is dirt cheap), whereas the conventional ontology approach requires a lot of CPU for a simple mapping task (which is expensive). But the bad news does not stop there. The trace routes in conventional ontology require manual work (impossible to automate), whereas sequence-based ontology can be easily built via automation.

Perhaps not everyone will understand the second point above. Nevertheless, the scalability and performance of the end product will speak for themselves when Hakia puts the testing platform online.

Usage of the Commercial Ontology

The immediate use of the CO is for search queries, or document characterizations, not tied to any advertising in conventional systems. This unrecognized domain of search queries and characterizations means loss of revenue. Hakia's CO is designed to fill in this gap. For example, if the search query or page characterization is "beat generation," the CO can map it to "literature" on the fly. As a result, systems using the CO will have a much deeper understanding of the incoming terms, and thus will be able to recognize the underlying intent beyond the face value of the words. The same capability can be used in a number of places other than advertising with the same effect.

Stay tuned for the release of the first version of Hakia's commercial ontology.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Just some constructive criticism: if you're going to use an unconventional term like 'Commercial Ontology' in a post pumping one of your products, you need to explain, CLEARLY, what that means right away. "In the context of the Web, a commercial ontology is a realistic version of an ontology, as we explain below" is not an explanation--you used the word itself in the definition! I stopped reading this post about six sentences in because at that point I realized I had gained no substantive information from what sounds like fluffy, pretentious double-speak. I'd yank this post and re-write it if I were you. Don't mean to be an a-hole, but I suspect I won't be alone on this.

    Posted by: Tom_Fishman | July 30, 2009 6:41 AM



  2. I agree with Tom, except I did slog through the whole article. The technology sounds vaguely interesting, but I have no idea what it actually is or how it can be used. It sounds like you've got an advanced, concept search method, or an improved method for analyzing search term trends for marketing purposes. Am I close?

    Posted by: chris.spizzirri.myopenid.com Author Profile Page | July 30, 2009 7:39 AM



  3. Sounds very cool. Can you tell a little more?
    My understanding is that you built a large dictionary of possibly multi-word terms, categorized into various meanings, with some disambiguation mechanism. Am I right?

    Posted by: Elad Kehat | July 30, 2009 9:43 AM



  4. Responding to #3 above..
    Yes you are right. However, it is not a dictionary, it is a concept based ontology. Disambiguation is handled both at the data level (via sequences) and at the ontology level (where senses of sequences converge).

    Responding to #2 above...
    The ontology is built based on the commercial value of the concepts. The concept of digital camera may be more important than the concept of German Opera in the commercial world, thus the former gets more refinement and detail in its ontological definition and lexicon space.

    Responding to #1 above...
    Thanks for the constructive criticism. We will consider the readers with attention span of a Twitter window next time. You got a point.

    Posted by: Riza | July 30, 2009 2:27 PM



  5. completely agree with Tom (#1) and you shouldn't be so fast to dismiss him. Its not an issue of "Twitter window attention span" but one of clear and precise language.

    And btw. multiword identifier for concepts is not really a new idea, we did this in 2002 as part of the KAON system and I'm quite sure we wheren't the first. If you're sequence approach is something different from multiple (possibly multi-word) synonyms I did not see it in the explanation.

    Anyway - looking forward to a chance to actually try this out.

    Posted by: Valentin | August 9, 2009 1:44 PM



  6. Hi Tom...
    I go through the whole article.It sounds interesting to me and I am trying to understand it fully.
    Please stay connected.Thanks for the post.

    Posted by: firewire | September 19, 2009 12:23 AM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  
RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS