As Richard MacManus recently predicted, in 2008 we'll witness the rise of semantic web services. From the native support for Microformats in Firefox 3,
to the New York Times' utilization of rich headers metadata, to this week's release of the Social Graph API by Google,
semantics are starting to slip onto the web. The impact is being felt because large companies are really starting to focus on structured information.
In the same vein, last week Reuters - an international business and financial news giant - launched an API called Open Calais.
The API does a semantic markup on unstructured HTML documents - recognizing people, places, companies, and events. This technology is the next generation of the Clear Forest offering, which Reuters acquired last year. We have profiled Clear Forest on ReadWriteWeb and in this post we will look at what Reuters opened up and why.
The idea behind Calais is simple - identify interesting bits into metadata in documents. In this implementation the focus is on People, Companies, Places, and Events, but surely the technology can be adopted to other entities. The heavy lifting is done by the combination of a natural language processing engine and a massive hard coded, learning database that Clear Forest has built.

For any document submitted into Calais, entities are identified, extracted and annotated. For example, when the press release about the acquisition of Clear Forest is analyzed, the following meta data is identified:
This is rather impressive set of information. According to the documentation page, the response is delivered in under one second for larger documents, and much faster for smaller ones - in other words, real time or near to it.
What was not quite clear from the documentation is if Calais can deal with raw HTML pages. It appears that the API requires an XML document, where the main text is marked differently from the header and footer. Ideally, an API like this should be able to accept URLs, because distilling structure from HTML would not be trivial for developers. Another thing that we noticed is that the resulting document is extensively marked up. What the developers get back is literally the output of the Calais engine. It would be good to be able to get a lighter version, which simply identifies entities and their positions in the text.
Currently the API is free for both commercial and non-commercial use and Reuters says it is prepared to scale for a massive concurrent demand. The question is then how can this be used?
There are quite a few interesting applications for this technology. First - better search. Knowing the kinds of entities in the text allows developers to build intelligent search engines that look for related content. For example, imagine a page on Reuters with this press release and in the sidebar links to learn more about Clear Forest, Reuters, Inxight, etc. Similarly, Calais could enable links to countries and cities mentioned in the document. And these searches need not be generic searches, but rather specific vertical ones.
Another application would be to build engines like Inform, which automatically inserts links into raw text. By automatically identifying entities in the document, Calais also identifies what should be linked. So a big piece of Inform's secret sauce is trivialized. The rest is basically a raw search through the archive, which can be done with a Google custom search engine, for example. It is possible that more tech savvy media companies could leverage Calais in exactly this way.
Another application is structured alerts. Modern alert systems are keyword based and suffer from false positives. Using Calais it is possible to build precise alerts for people, companies, places and events like corporate acquisitions. With the flood of junk in our RSS readers this is rather welcomed news.
Yet another application would be to incorporate on the fly text analysis into the browsers. In a way, this is not much different from having Microformat annotations on the page, except that the annotations are delivered on the fly. For example, a browser could call Calais on document load and obtain a list of people, places, companies, etc. which are embedded in the document. With this information the browser would be able to create a more interesting, more contextual, and relevant experience.
Reuters has opened up a generous API, but why? During our interview, Gerry Campbell, the President/Global Head of Search & Content Technologies at Reuters, explained that Reuters wants the world to be tagged. When the world's content is quickly and readily accessible to their customers, Reuters wins. Semantic technologies result in better, faster, more precise and relevant information, and Reuters, as a big player in the information space, wants to be one of the first companies delivering this kind of experience.
Beyond an outstanding customer experience, Calais leads to a unique, attractive set of assets. First - a growing semantic database of people, places, companies and events. With each new document submitted into Calais the database gets richer and more complete. This is a roadmap to a semantic business powerhouse, which is clearly a great position to be in for any business media company. And in a way, what grows beneath Calais will not be that unlike Freebase. Except of course, it is happening completely automatically.
The second big advantage of having an open API is training the system. Any AI-based solution like Clear Forest is in constant need of tuning and evolution. Having other companies use the system would allow the engineers to run into cases that they have not thought about and broaden the capabilities of the system. Campbell told us that Calais is already processing a significant subset of Reuters information in nearly real time. This is both impressive technically and smart from an engineering point of view - it is an "eat your own dog food" approach to building a great piece of software.
The Calais API is another big win for top-down semantic web technologies. Using a mix of natural language processing, AI techniques, and a massive databases, Reuters' solution extracts important bits of information from raw HTML pages. People, Companies, Places, and Events are really at the heart of many business articles, so being able to instantly identify them in the text is a big deal. From better search to better cross-linking and more intelligent browsing, the Calais API is an invitation to tap into one of the most powerful and pragmatic semantic platforms that exists and works today.
What sort of things do you envision to be possible with Calais? What applications would you like to see built with this platform?
Listed below are links to blogs that reference this entry: Reuters Wants The World To Be Tagged.
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/3245
OpenCalais, Reuters' first take on semantic web. Read More
ReadWriteweb has an article about Reuters new open API, Calais (available at www.opencalais.com). The writeup is not just what it does but what the true impact of this type of tools has for year 2008, the year which semantic services expected to rise. Read More
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
I have to say I really like the semantic focus of this site. Keep it up!
Posted by: Paul Jensen | February 6, 2008 4:25 AMSeems very powerful to me. I wonder if or when this will be available for languages other than English.
Posted by: Thomas Fruetel | February 6, 2008 6:37 AMstructured information means more precise processing, which means better analysis results, which means 'mo money'!
Posted by: Esdee | February 6, 2008 7:31 AMthis is certainly a sweet candy and being embraced by a giant as reuters hints that others will soon follow, and when this happens - expect a new generation of search engines...
This looks very interesting. Reuters seems to be taking a very start-up approach to this, not burdened by legacy. They are a company who made tons of money on user generated content before most people had heard that term - pre Internet. At one level this looks like a much more focussed niche version of Freebase. I looked at Freebase for company info but this looks like a much better way. "Vertical Semantic Search"?
Posted by: bernard lunn | February 6, 2008 7:56 AMThe article about Reuters and Gnosis is interesting. I am just missing the real new context. What happend to the idea of an ontology where you have named entities and relations. To me it seems that Gnosis is only getting Named Entities but doesn't connect them to a real new content of meaning. Example person is linked to a company or a person lives in location. In my opinion only then the user gets real new value, menaing new knowledge. This should be the real semantic web or tagging of web pages. Ontos is working on the same topic but also has relations and supports German and Russian.
Posted by: Daniel | February 6, 2008 8:47 AMKrista from Reuters here. Just a quick clarification -- the Calais Web service API is available free to both commercial and non-commercial developers.
Posted by: KDTBut what if we want to go much beyond people, companies, places and events? Isn't this just a fraction of the semantic space? I think we need a freewheeling mechanism that lets people experiment first and lets popular usage evolve to standards. Take a look at poshzones - would love to have your feedback.
Posted by: Sriram | February 6, 2008 9:11 AMGreat initiative by Reuters. I'd like to know, given their financial background and semantic tagging of that info, if they would be porting this into the world of XBRL? Or are they hoping the community will adapt that?
Posted by: Derek | February 6, 2008 9:33 AMGerry & team have done a great job with this one. This move really shows that large companies can be an innovative part of the web in meaningful ways. I like how they exposes functionality that is expensive to build or buy to the masses of developers out there. Second, they adopted a Netflix like incentive program where the best usage of the technology gets a prize (check out their bounties tab on opencalais.com/).
Hopefully, more large companies will expose binding technology like this for others to use and congrats to Gerry & Barak (ClearForest) in putting this out there.
Posted by: Abdur | February 6, 2008 9:53 AMHi all. Tom Tague here. I’m leading the Calais initiative at Reuters and thought I’d chime in with a few comments.
First, thanks to Alex for writing such a thoughtful piece. Having someone spend the time to think and talk about what the impact might be – not just what the service does – is really appreciated.
Second, and perhaps most important: it’s early days yet. What you’re seeing of Calais is just the very first tentative steps. Our roadmap (available at www.opencalais.com) lays out our plans for the year – but plans are meant to change. We’re listening to user feedback, suggestions and criticisms and will be revising the roadmap real-time for the next year at least.
So, a few specific points / responses to comments:
As Krista mentioned above, Calais is free for both commercial and non-commercial applications.
Yes, the domain that Calais excels in is business news. But that’s just today. We are working like wild in the background to expand and enhance the domain sets to include sports, entertainment and others.
There might be a misperception that Calais is just about named entity extraction – it’s not. Calais also returns dozens of facts and events such as linkages of people to organizations, people to positions, joint ventures, etc. Again – a universe that will be expanding dramatically over the short term.
You want to go beyond people, companies, places? Well, we have – it’s quite a lot richer than that. But – we want it richer still. First – we’ll be expanding the universe every month. Second, if you take a look at the roadmap you’ll see that in the second half of the year we’ll be implementing community-developed processing extensions. Think of them as something like Firefox plugins – you build it, check it in and share it with the whole Calais community. This is a tough technical challenge though and it’s going to take us a little while to get there.
XBRL isn’t on the immediate roadmap – but that’s up for discussion. We decided to start with the most fundamental knowledge representation format (RDF) and work our way up from there.
That’s it for now. Please feel free to drop by www.opencalais.com to learn more and thanks for the interest.
Posted by: Thomas Tague | February 6, 2008 10:59 AMFinally meta data annotation coming out from the web pages
Posted by: Reza Rawassizadeh | February 8, 2008 1:28 AM"So a big piece of Inform's secret sauce is trivialized."
Yes indeed - I think Reuters may have just put Inform out of business. So much for $15M+ in venture capital.
Posted by: Thom Mullen | February 11, 2008 12:25 PMCheck out this company and compare it to the new implementation of Reuters:
http://www.inform.com
Posted by: sql | February 12, 2008 11:46 AMAFAIK investment banks ask for tagged news feeds for algorithmic trading. Maybe Open Calais is a part of this effort?
Posted by: Roman | February 13, 2008 4:30 AMActually, my PhD study (finished in March 2006) was completely on the extraction of person and organisation names from unstructured text data such as those of Reuters' news articles. One of my observations was that these articles contain a lot of information about people professionals (e.g. prime minister, chief excutive, and so forth). Thus, I thought that one can use such info to develop a 'who is who' application. Actullay, I have developed a simple prototype for such an application and wish to get it more sophisticated. So, how about having some collaboration to bring it to life together.
Posted by: Hayssam Traboulsi | February 25, 2008 3:27 AM