By Alex Iskold
We've been writing recently
about the rise of semantic web and how in 2007 we'll see many interesting semantic
technologies. The fundamental problem that all these technologies need to solve is
explaining the meaning of things to computers. There are several approaches to this, all
of which in principle can work.
There are companies and technologies that are doing it bottom up - by embedding semantical annotations (meta-data) right into the data. The opposite camp is exploring the top-down approach, which relies on analyzing existing information. The ultimate top-down solution would be a fully blown natural language processor, which is able to understand text like people do.
In this post, we are going to look at ClearForest - one of the companies in the top-down camp. At first glance, you might not think much of the company's web site, but a deeper dive reveals that ClearForest is restructuring - to apply its core natural language processing technology to facilitate next generation semantic applications. The fact that ClearForest has released both a Web Service and a Firefox extension that leverages an API to deliver the end-user application, says that the company gets what the next generation web is all about.
The first Clear Forest product that we looked at was the Firefox extension called Gnosis. Here is how it is described on the Mozilla extensions page:
"With a single click, Gnosis will identify the people, companies, organizations, geographies and products on the page you are viewing. Using the built-in navigation sidebar you can gain immediate understanding of the page’s contents."
Downloading and installing Gnosis was as easy as any Firefox add-on. We used the Read/WriteWeb home page to try the extension. With one click from the menu, the page was filled with various types of annotations. The current version of Gnosis recognized Companies, Countries, Industry Terms, Organizations, People, Products and Technologies - an impressive range of things. Each word that Gnosis recognized, got colored according to the category.

This was interesting, but overwhelming. A better approach would be to have the coloring appear on a mouse over or another gesture. But this is a usability nuance that will get polished in the next iteration on the product. Overall, I was impressed. At an instance, the page was analyzed and annotated. It was not perfect (it thoughts that all the Jasons on the page were Jason Briggs), but it was more accurate than I expected it to be.
Next I turned my attention to the sidebar. The extension created a categorised tree of all words and phrases that it found on the page. We could expand and collapse each category to find the terms. It looked like vertical search for a single page. It was interesting and is very useful for blogs and lengthy pages.

Again, the interface needs to evolve - but the idea that key terms and concepts on any page can be identified and organized in such a way seems compelling. In addition to the organization, the extension offered to search for any keyword on Google, Wikipedia or Technorati. If you are interested in a keyword, you are likely to want to find more related information. So the context search seems like a logical extension of categorisation, as it makes this data further searchable.
Overall, this seemed unpolished but intriguing. The question is, how does this work? The Firefox page stated that this extension is based on a web service. So this is what I want to explore next...
Behind every great service there in an API. Modern web companies have re-discovered an old software engineering wisdom - interfaces are a powerful way to build complex software. Today we are seeing the rise of the most complex software system yet - a service powered web. ClearForest has also recognized the value (both can be monetized independently) of building a product on top of a service. Gnosis leverages the interface to offer a powerful natural language processing service.
The Semantic Web Service (perhaps the name is a bit broad) offers the SOAP interface for analyzing text, documents and web pages. The service returns the categorization and annotation information which can be further leveraged by consumer facing applications (the company recommends building mashups). I am fairly certain that SWS is powered by a web crawler, because it is able to recognize people like Richard MacManus, Jason Biggs and Alex Iskold. My guess is that the crawler is used to build a giant index, that is then used by the document parser to annotate the terms in the document.
The service right now is free to try, but you need to contact ClearForest to use it commercially. To encourage the usage of the service the company announced a mashup contest. The contest was advertised on ProgrammableWeb and ended December 11th. It is not clear to me that it was successful, as there are no announcements of winners and no showcase - but it certainly seems like a creative way to promote the new API.
Clear Forest might not have a glamorous/Ajaxy web site and might not have a polished product yet. But it is a company that has been around and has been backed by top tier VC firms. Both the approach and technology are worth attention and consideration. Their natural language processing technology, first applied to business data mining, is able to clearly distill useful information. To offer it as a service shows the insight and the understanding of the new market opportunities (think Amazon). And to create a Firefox extension that showcases the technology demonstrates their desire and the readiness to go mainstream.
All these factors indicate that Clear Forest is worth watching. And it is yet another brick to support the top-down semantic web approaches. Let us know what you think about this company.
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/2919
Comments
Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts
It's great that so many people are striving to reach a true semantic web, both from the bottom up and the top down.
I do have a question though: how will we know when we get there?
Will we need some sort of 'Semantic Test', much like the Turing Test, to verify a semantic solution. Or do we really on the most number of people telling us this is a great semantic solution. Or would the semantic solution that work in the US not work in England or New Zealand-- ie. each region, or even culture may have their semantic solution.
As everyone strives for what is acknowledged to be a holy grail, the question is not only how do we know when we get there, but how do we know when we're even making progress?
Perhaps just like the Turing Test, a Berners-Lee Test is needed to both define a goal and measure progress.
Posted by: John Milan | December 21, 2006 10:35 PM
>>There are companies and technologies that are doing it bottom up - by embedding semantical annotations (meta-data) right into the data. The opposite camp is exploring the top-down approach, which relies on analyzing existing information. The ultimate top-down solution would be a fully blown natural language processor, which is able to understand text like people do.
It's endless loop. There is no way to explain the meaning to the machine except the case when the machine interprets the meaning between the two languages. It's called Translation Matrix or just the Matrix if all the languages included in this system.
The implementation is simple. There is a company in Spain - Atril. It has a product - DejaVu 3. The main function is a pretranslation by assembling meanings and phrases. The translation project has an attribite - Subject. You choose the Subject and the system chooses the corresponding meaning during assembling. By default, there is an hierarchical classification system up to 999 subjects. This approach is completely wrong - the DejaVu system has the subsystem Lexicon that is responsible for specilized terminology (it's a glossary of a translation project) and has a priority during assembling - so the Lexicon is used for fine tuning of corresponding meanings for the project.
I did a simple thing. I divided the subjects into only two categories: General - 1 (it's a default setting for a project, it's used during assembling first) and Options - 2. It's a binary coding of meanings. In general terms, binary coding of information. My native language is Russian - so I gathered the Russian translations of English words and phrases together into a general terminology database (.tdb).
I've been working with this system for three years. It pretranslates any English literary texts. Exactly on the level of a three-years old child. That's enough to say it's an artificial intelligence.
Posted by: Michael Molin | December 21, 2006 11:58 PM
Excellent comments. I personally do not think that Semantic Web needs to be equal strong AI. We do not have to explain meaning to machine, instead we encode the meaning so that machine can take advantage and drive productivity. Thats what I think.
I spend 8 years of my life studying complexity science and I do not think that intelligence can be designed, it needs to be evolved.
Alex
Posted by: Alex Iskold | December 22, 2006 6:40 AM
It’s great to see so many people leverage the popularity of the term Semantic Web.
The Semantic Web though, *is* all about the plumbing of the Web. It's data about the data and nothing more. That’s not to say that cool applications shouldn’t be class as enabling a more Semantic.
Posted by: Paul Walsh | December 22, 2006 3:33 PM
Paul,
Certainly, this is it. If you look on wikipedia, it talks about weak AI. Do you think that data about data has to be embedded in the data or can it be inferred?
Alex
Posted by: Alex Iskold | December 22, 2006 4:34 PM
Alex,
Thanks for noticing what we're trying to do!
A quick note - the contest you mentioned wrapped up last week. A page with the winning entries and honorable mentions is located here http://sws.clearforest.com/Blog/?p=37. Some very interesting work by some smart people.
Posted by: Thomas Tague | December 23, 2006 3:56 AM
Just wanted to offer a comment as one of the participants/winners in the ClearForest contest.
Generally, I just think it's exciting to see an API offered for this type of service. Most APIs offered have been for accessing pre-existing data in an easy way, or creating data in a system. This API is actually doing a smart nontrivial service for us. I can easily screenscape Google Images for results without outside services, but I can't easily deduce the semantic entities in a Google News article by myself.
I look forward to seeing more APIs released that offer interesting services like this.
Posted by: Pamela Fox | December 23, 2006 4:39 PM
this is great but a fraction of what is required for real information processing and sense making on the web which is incredibly difficult to do because of the primitiveness of the hypertext and one window browser paradigm that noone seems to want to break out of and is sorta hardcoded into browser technology. Ever tried iframe stacks, quick way to crash your browser, even a reasonable number of layers of DHTML windows has its limitations. Suck..
I'd much rather a snippet/bookmark linking tool + browser history and have that information editable (ala wiki but with depth and layers of windows that can be connected/chained together) so I can do some spatial and information reorganization. Ie Sense making..
Sucks, do a search on the web using google. Count how many clicks and moves on the mouse you are performing. A good search will take many hours and only then can you say with maybe 80-90% probability that you have covered said domain topic.
Put ajax to use and build a fluid search engine front end to google. Eg plug in search terms and the topics and pages start expanding out. let me tick or drag the pages i'm interested in (goddamn tab browsing is so 1 dimensional) to another area of the window (using larger canvas area than browser window). then let me start drilling down further and building and linking to build up my world view of the domain target of interest.
In my mind, whilst semantic and reasoning API's are great. the interface is still but ugly and being able to handle information load, or complex amounts of information will rely on much more complex UI than the 1 window tab browsing paradigm we have suffered with for many years.
Blah, rambles, tired.
TiM
Posted by: Tim | December 29, 2006 2:24 AM
ok one more rant,
I'm quite sure some clever developer is going to use Microsoft WPF and build a replacement browser tool. Essentially the HTML rendering will be done with IE but the interface will be done in WPF. Imagine a slick 2/3d massively multi windowed app that is a cross between a mindmapping tool and an HTML rendering component :)
Certainly would make it easier to research and track topics as well as share/collaborate. Why keep your notes and other information seperate from information you find on the web?
mmm global read/write web, christ just give me my own read/write web that I can 'rip,mix,share' with collegues. let me pull in information and consume it, ie diseminate, collect, track, etc.
ranty rant, ooh k i shut up now
TiM
Posted by: Tim | December 29, 2006 2:45 AM
Alex wrote:
I personally do not think that Semantic Web needs to be equal strong AI. We do not have to explain meaning to machine, instead we encode the meaning so that machine can take advantage and drive productivity.
Well, as I see the AI as a development process is a "man-made" competition between the human intelligence and mathematical logics of the machine. As I see the Semantic Web concept - it's a new level for HTML tags as metadata for better search results and making decisions on them.
As I wrote describing the Translation Matrix, the question is only in the number of the levels. Two is enough and effective - the original information and its tags. What we have now on the Internet, the more tags the better the search system recognize a Web site's goals. These particular goals are really not the objective meaning of the information presented on the Web site. It's advertising. And that's the top level of the hierarchy because that competition between the human psychology in a desire for success (paying for the placing your Web site into the Sponsor Links category) and the machine (as a search robot) is already won by human being by default. Politics is based on economics or science is based on business.
Political science is a decision-making process. So, the Internet is already semantic by man-made decisions.
Also, there is really a fundamental question in the comment of John Milan. About the tests for AI. I think that Turing's test based on the identities of the results made by the human being and the machine doesn't include the second part of the logics - to err is human, to forgive divine.
I can give an example from the work with the Translation Matrix. There is the main algorithm for assembling phrases - the longest source text phrase between the two overlapping source text phrases is chosen for translation. It's mathematical logics. Usually, the machine (taught gradually by human logics of translation) pretranslates everything right and consistently (words, phrases one by one) but sometimes the second phrase is longer the first one so there is an inconsistency. The machine is right - math logics. My task is to combine these phrases into one source text phrase and assign a new general translation and options. That's a machine learning process to work with the meanings by human logics. Man-made intelligence - AI.
Posted by: Michael Molin | December 30, 2006 10:21 PM
"Well, as I see the AI as a development process, it's is a "man-made" competition between the human intelligence and mathematical logics of the machine."
That's correct.
Happy New Year!
My best wishes.
Michael
Posted by: Michael Molin | December 30, 2006 10:34 PM
it's a "man-made" competition - you know :)
Cheers!
Posted by: Michael Molin | December 30, 2006 10:37 PM