This week we reported that Cognition had announced "the largest commercially available Semantic Map of the English language." In our interview with Cognition CEO Scott Janus, we asked him to compare Cognition's technologies to those of other semantic search companies Hakia and Powerset. Janus pointed to their large Semantic Map as the main differentiator. Indeed he told us that semantic search companies "must include a comprehensive semantic map" to be successful.
Is this true? We sought a response from both Hakia and Microsoft-owned Powerset on this semantically charged question.
Cognition claims that its Semantic Map has over 10 million semantic connections, including "over 4 million semantic contexts (word meanings that create contexts for specific meanings of other related words)".
Hakia CEO Riza C. Berkan responded in the comments to the original article that "hakia is deploying Ontological Semantics (OntoSem)", which he described as "a network of concepts reflecting ontology." He went on to say that hakia covers "over [a] million words in English".
However Berkan noted that the size of a Semantic Map does not necessarily matter: "the sheer size of the collection of words or concepts does not represent, by any means, the capability of the system." Hakia's position is that "there is no silver bullet for a semantic solution that will succeed", as long as the system developed is scalable and imposes "minimum reliance on 'words'".
Semantopoly: Advance token to nearest Semantic Context
At this point we were still confused. Cognition uses the term "semantic map" and said it was necessary to have. One of the commenters on the original post agreed with that assumption. Yet Hakia's Riza Berkan didn't use the term "semantic map". So we asked Hakia in a follow-up email, does it or does it not have a semantic map? Dr. Christian Hempelmann, Hakia's Chief Scientific Officer, responded:
"The term sometimes comes up in the context of data integration, but "Semantic map" is not a term used in linguistics. I can only speculate that it is what is commonly called an ontology. To the degree that they let us on about it in the documentation on their website, Cognition operates with only 2 main relations, much like WordNet: hyperonymy/hyponymy (e.g. cat is-a feline is-a mammal; their "taxonomy") and synonymy (e.g., "buy" means almost the same as "purchase"; their "thesaurus"). Furthermore, this map is not independent of English, cannot grow into other languages. hakia, on the other hand, has an ontology with many more relations, effectively raising our "semantic map" to the size of a higher power, and can and is already growing into other languages."
We also tried to get a comment from Powerset, but as of writing we haven't received it.
So, are we all clearer now on what is a Semantic Map, is it needed, and does size matter? Er, it depends. If you think you know the answers, tell us in the comments please!
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
Richard,
Thank you for keeping this great discussion alive about semantics and Semantic Search. We would like to respond in a more thorough way next week, but for now, I’d like to comment on a few inaccuracies made by Hakia.com’s CEO, Riza Berkan, and Dr. Christian Hempelmann, Hakia's Chief Scientific Officer, in their recent comments to you and to your original story about Cognition’s Semantic Map.
Dr. Berkan notes that the sheer size of the Semantic Map doesn't matter, and that the capability of the system is what counts. We agree, which is why we stated in our announcement that Cognition's Semantic Map involves hundreds of types of relationships, not just ontology and synonymy. Those relationships include:
stem/sense, stem/morphology, sense/syntactic properties, hyponymy/hypernymy (of senses, not words), synonymy (of senses, not words), function, has-part (e.g. “hand” has-part “finger”), location, etc. (for noun senses), causes, consequences, etc. (for verb senses), selectional restrictions (for verb, noun and adjective senses), sense contexts (this sense is found in the context of this other sense) and many more.
Dr. Hempelmann is concerned that Cognition's Semantic Map is English only. The map itself and the concepts in it are language independent and work for any human language. The individual senses are, of course, language-dependent.
The foundation of Cognition's Semantic Map is the word sense. Cognition has 536,000 English senses, as opposed to Hakia's 100,000. To put an end to Dr. Berkan’s speculation, we didn’t hack the servers to get that number, we just read Dr. Berkan’s own published comments.
Dr. Berkan was taken aback by the high-level, simple diagram included in the original RWW story. He thought that the picture implied non-modularity in Cognition's Semantic Map. This picture was a very simple example only, showing some of the different kinds of relationships senses participate in. Cognition's Semantic Map is fully modular.
As for Dr. Hempelmann comment, “…Semantic map" is not a term used in linguistics”, all I can say is, do a Google search for “linguistics semantic map” and see what you get.
Thanks again, Richard, for covering an important emerging area of technology. And thank you Drs. Berkan and Hempelman for your work in this area and for actively engaging in the forward progress of Semantic technologies.
Scott Jarus, CEO, Cognition Technologies (www.cognition.com)
Posted by: Scott Jarus | September 19, 2008 5:54 PM
I - human - reflecting - aggregating - appreciating - Guten Morgen - thx 4 tracking keep - as of johnson
Posted by: Steve | September 19, 2008 8:17 PM
Most literary man/woman would use 5,000 - 10,000 words in daily speech. The most sophisticated novel, like War and Peace, would use 60,000 unique words. Beyond these limits, claiming 536,000 word senses, means that these extra words or word senses are totally prehistoric (like Old English.) Otherwise, they are partly onomasticon, or unstemmed versions of the same words. Or, you could stick the entire Latin in it for medical terms. There are not 536,000 word senses in English unless you are counting these extras.
If you count this way, hakia's resources are over a million, and keeps increasing with the addition of newly acquired words in the lexicon.
Semantic Map is not a term in linguistics, academically speaking. And a linguist like Dr. Hempelmann does not have to check Google (what a bad source of reference anyway) for something he dedicated his life to. Yes, you will find anything in Google, mostly not credible and unsubstantiated information. Isn't this what we are working against collectively anyway?
But if you insist on using the term Semantic Map, be our guest. There is no law against it, and there is no harm either. It sounds good too.
Size does not matter, because any semantic system will have to cover all the word senses in a given language, this is the default requirement. Anything less is incomprehensible. What matters is how the system produces a text meaning representation (TMR) outlining the agent, instrument, event, ect., using the correct senses. How it uses concept relationships to produce output in a scalable manner.
Conclusion: a semantic system is not a semantic system if it does not cover all the word senses, if it does not do a full morphological treatment, if it cannot extract clauses, etc. These are not measurements of capacity. These are default requirements.
If Cognition built a language independent ontology, that is the very good sign, and step in the right direction.
Measuring who has more horse-power can only be done by the success of the corresponding end-products, and by the users of these products. Until then, we should all work on spreading the awareness of this technology instead of measuring who has more marbles.
Best wishes to you all.
Posted by: Riza Berkan | September 19, 2008 10:52 PM
The posturing is a little silly, both companies have biased opinions on their take on the tech of course. The size argument is similar to the pages indexed pissing contest the big search engines had a while ago, and cuil with it's bigger than you statement is actually pretty close.
The important bit, as it was with the search engines, is how you use your tech and how you build a business.
eg cuil might have more docs in the index (supposedly) but google have got adsense and a killer business model.
The only one of these companies that has a published business model is Cognition. and that business model is to sell the bit about them that's unique... so I'm confused as to how that's a long term strategy.
Posted by: Ronald Hobbs | September 20, 2008 4:34 AM
The Oxford English Dictionary is widely considered to be the definitive source of the English language. The Oxford English Dictionary admits to 291,500 entries and that includes 47,100 entries for obsolete words. The OED admits to 615,100 word forms.
A larger vocabulary and broader worldview apparently makes people more intelligent but cultural literacy can be achieved with a vocabulary of less than 10,000 words.
For this reason, I am not sure that having a larger vocabulary-- say one larger than 25,000 words --has any significance to improved recall and precision in computer search engines. It more a matter of having the right words and the capability to induce information from them. None of the NLP-based search engines have demonstrated a capability for logically inducing meaning from the words they claim to be mapping semantically.
Let me be more explicit. While people are capable of realizing the import or significance of the structures of terms in a sentence, NLP algorithms are not. Computational NLP can look up the definitions of terms in a lexical dictionary and they can disambiguate some word sense and list synonyms in a thesaurus or classes from an ontology. But they cannot induce synonyms or any other form of meaning from word roots, even though their claims imply that they can.
Consider the elements and dimensions of meaning in these examples:
Why do people fear the government is useless?
Why do people feel the government is useless?
None of the NLP-driven search engines can adequately address the meaning of either of these questions for several reasons, not the least of which is that they simply do not understand either question.
Without understanding the import of the verbs in the example sentences, the so-called "semantic" search engine is unable to induce the (semantic) relations to uncertainty, apprehension, agitation or anxiety from the word 'fear', and the NLP-based search engine is thereby unable to locate relevant articles about contemporary problems (causing fear in citizens) such as: failures of regulatory oversights, legal measures eroding liberty, and imposition of unlawful wars on it's citizens, among many other relevant causes of a fearing a useless government --as protector-- in the face of an increasingly hostile environment.
In the case of the other verb 'feel' the NLP-based search engine cannot locate relevant articles about government failures to avoid or to respond to natural and man-made catastrophes and other social calamities that are part of current events and definitely posted about on the internet.
In fact, it seems the verbs don't make much of a difference at all to these so-called semantic search engines. So my question is: Where's the semantics in the so-called semantic maps?
-Ken Ewell
Posted by: Ken Ewell | September 20, 2008 5:44 PM
Great point by Ken. Another query "How old was John F Kennedy when he died?"
Semantics is not merely about word senses and disambiguations , but also about relations and their inter-play. And encoding all that "together" is a difficult problem, which none of the current companies have solved. As a person working in a similar space, I feel that other companies are creating unnecessary hype, making tall claims and throwing jargon, reality being that their products hardly "deliver".
We are only turning consumers away by doing this. Focus on delivery, create something really useful, consumers will surely come.
Posted by: 42 | September 20, 2008 11:43 PM
Don't many words have multiple meanings? It seems to me that the higher the word knowledge the better as that would show a more complete mastery of a language.
The academics can discuss the Olde English and definitive dictionaries that have a set number of words, but I'd prefer an NLP system that understands all the meanings of those dictionary entries. That one sounds like it can build a business by licensing "the bit about them that's unique."
Why don't you all let each company tout itself (as that's what companies do) and stop weighing in with your uneducated opinions about each other's companies. Everyone should stick to what they know best -- their own technology.
Since I have no horse in the race, I didn't let my arrogance get in the way of searching on Google for "linguistics semantic map" as suggested above. It has references to quite a few academic links discussing semantic maps in the field of linguistics. It sounds to me like some linguistics like Hakia's don't like semantic maps. That seems fine, but why jump on linguists who see their value?
Posted by: Anthony C. | September 21, 2008 6:25 AM
There are a lot of opinions expressed here and I'm glad that this dialog is open. Powerset's opinion is relatively simple:
Building a semantic search engine requires many different components, and a semantic map is just one of them.
Powerset employs over 20 PhDs using techniques from computational linguistics, machine learning, AI, and computer science to create our linguistic core. Powerset was fortunate to license the XLE from PARC, the world’s most advanced parsing engine, but that’s by no means the only part of our technology stack. We pick-and-choose among the best components, whether they come from our proprietary resources, community resources, or 3rd parties.
Of course, the best way to test out any company's technology is use it for yourself and decide whether we make your life easier!
-mark {powerset program manager}
Posted by: Mark Johnson | September 22, 2008 5:32 PM
You don't need a semantic map to do semantic search. Ontologies/semantic maps capture only the words and relationships that their designers think are important. It does not matter much whether these are part-whole relationships or more complex relations.
Dictionaries represent the formal words of the language. They are not the definitive source for the words that people actually use. Lexicographers argue whether a particular word is worthy of inclusion in every upcoming edition. In addition to the formal words, there are slang expressions, jargon terms, proper nouns, brand names, initialisms, acronyms, and neologisms. According to some sources, for example, 10% of Shakespeare's vocabulary was newly coined for his work. An analysis of the English Wikipedia, for example, yields about 13 million unique items. We think that this is an over-estimate of the number of words, but even if it is off by a factor of 10, that would still leave 1.3 million unique words (including variations), many of which will not be in the OED.
Dictionaries follow the language that is in actual use, they don't lead it. If the only words we could search on were properly spelled and in the dictionary, I think that we would have a very poor search experience.
There are other ways to get at the semantics of a text.
When a child learns a new word, she does not memorize part of an ontology. Rather, she learns the meaning of the word from the context in which it is used. There can actually be an infinite number of contexts in the sense that there are likely to be gradations between them. The distinction between contexts, like the distinction between word meanings is almost always fuzzy.
A given word or object can belong to an infinite number of categories. For example, what categories does a basketball belong in? Round things, bouncy things, brown things, etc. How long is this list? Do you ever reach a point where no one could add another category to it? Things with tiny dimples? Things that my brother hates? Things that Barack Obama likes? Things that float? For this last one, imagine that you are on a sinking ship, now that category is important and obvious.
Even if you know the dictionary definition of a word, and thus its "sense," you may not know enough to return relevant search results. Wordnet lists two senses for the word "basketball" (one for the game and one for the ball) and three for the word "carbon," ~ element, ~ paper, and ~ copy). (There may be other senses as well.) A search of YAHOO! or Google for "carbon" turns up pages consistent with the element sense. A person looking for information from an environment and sustainability perspective would not be interested in most of these, but would be interested in pages talking about carbon credits and the like (see www.truevert.com), even though it is the same dictionary sense in both cases.
There is no doubt that Cognition, PowerSet, and Hakia have invested heavily in developing their respective semantic representations. All seem to improve the quality of search results one can get, but whether you've spent 20 years or engaged 20 Ph.D.s, or take a month to process 2.5 million documents, the real question, I think, is whether such efforts are necessary to achieve comparable results. I think that there is a simpler way that gives results that are at least comparable with much less effort and expense.
Posted by: Herbert L Roitblat | October 7, 2008 3:33 PM