clear forest - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/clear forest en Copyright 2009 Richard MacManus readwriteweb@gmail.com Sat, 21 Nov 2009 05:00:00 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Reuters Open Calais Update: Apps Progress, Interview A month ago we wrote about Reuters launching an API called Open Calais, a technology that "does a semantic markup on unstructured HTML documents - recognizing people, places, companies, and events." I mentioned Calais in my Media08 presentation last week entitled Web Technology Trends for 2008 and Beyond. It generated interest in the media-focused audience I presented to, so in this post we follow up with Reuters and ask what progress is being made. Specifically we look at what apps have been built so far on Calais and get feedback from Reuters' Tom Tague.

]]>Sponsor

]]> Quick Recap of Open Calais

Open Calais is a Semantic Web technology - and in this case the next generation of the Clear Forest product, which Reuters acquired in April '07 (see our Dec '06 review). Alex Iskold's post last month is 'must read' to understand what Open Calais is and why Reuters bought it. This diagram summarizes:

The API is free for both commercial and non-commercial use and Reuters told us last month that it is prepared to scale for a massive concurrent demand. The API is great for third party developers, because it gives them access to Reuters data. And it benefits Reuters, because it enables Reuters to aggregate metadata for its own uses.

Alex listed some possible uses: intelligent search engines that look for related content, automatically inserting links into raw text, structured alerts, on-the-fly text analysis within your browser.

Example Apps?

So it sounds great in theory, but are there any examples of Open Calais apps so far? Reuters has a "bounty" program set up, whereby developers are invited to create Open Calais applications and Reuters will pay for that. However, it seems there has been little - if any - takeup of the bounties.

Top of the list of wanted apps was a Wordpress plugin. Tom Tague, who is leading the Calais initiative at Reuters, noted in the forum that "unfortunately - and unexpectedly - we haven't seen any reasonable applications for the bounty process so we'll most likely be contracting for the development of the WordPress plugin." Perhaps the amount of the bounty in this case was an issue - Reuters only offered $5000 for the Wordpress plugin, which doesn't seem like much of an incentive.

So Reuters has been forced to take the initiative and release some apps of their own. One is a new web based document submission tool and viewer. There is some sign of action in the Open Calais forum, on a page where developers can list what they're working on. A developer named Craig has built an example of Calais semantics using pure PHP and Abhay Kumar has a similar service. These are all 'data input' tools. For an 'output' example, check out Mark Choate's RSS implementation of Calais data (example below).

Interview with Reuters' Tom Tague

Clearly, it's early days. I asked Open Calais lead Tom Tague how the initiative is progressing? Tom replied that "we’re about where we expected to be in terms of applications for Calais." He told us that the service is "just a little over 45 days old and much of the effort we’re seeing is in building tools to explore the capabilities themselves."

At this time Open Calais has just over 1,500 developers signed up; with about 30% of those developers actually making calls to and experimenting with the service. "One of the more exciting things that’s going on," Tom Tague told us, "are several community-led efforts to build Calais libraries for Ruby, PHP, ASP.NET and others. These will provide a great accelerant for developers to gain access to the service."

How is Reuters using Calais In-house?

So, at this point there is nothing to see for non-developers - the apps that have come out so far are developer-focused and not something the rest of us can use. So my next question to Tom was: how is Reuters itself using the Calais technology?

Tom replied that Reuters has several things underway:

"We're in the process of adding rich metadata to over 20 years of historical news archives (many millions of articles) to improve searchability and organization. We’re doing a lot of work in automating and generally improving the efficiency of a massive real time content ingestion process. We’re working with one of the community platforms deployed for Reuters customers to improve the tagging and classification of user generated content. And, of course, we have significant efforts under way to generate “machine readable news” to drive low-latency algorithmic trading. All of these efforts are based on the same technology platform driving the Calais initiative."

Conclusion: Show Us The Apps!

I must admit that I was expecting to see some working apps by now. Perhaps it is a similar case to Marshall Kirkpatrick's experience of Twine (published earlier today), the Semantic knowledge management service that received much early hype. Marshall thinks that Twine is underdone at this time and that the 'consumer' experience is lacking. Calais is much newer of course and, as Tom Tague said, it has only been out in the open for 45 days. So it would be unfair to compare the two efforts. Nevertheless, it would be great to see some compelling consumer-facing apps for Open Calais; even better would be to see something from Reuters that shows the public the benefits of semantic technologies.

Alex Iskold listed a number of consumer apps that could be built using Calais, by Reuters or external parties. I think people need to see at least one of those pretty soon - in order to translate the interest that Open Calais is generating from media and other people, into something non-geeks can see working on the Web and producing noticeably better information results. To paraphrase the famous Jerry Maguire quote, 'Show me the apps!'.

]]>Discuss]]>
http://www.readwriteweb.com/archives/reuters_open_calais_apps_interview.php http://www.readwriteweb.com/archives/reuters_open_calais_apps_interview.php Analysis Tue, 11 Mar 2008 14:37:08 -0800 Richard MacManus
Reuters Wants The World To Be Tagged As Richard MacManus recently predicted, in 2008 we'll witness the rise of semantic web services. From the native support for Microformats in Firefox 3, to the New York Times' utilization of rich headers metadata, to this week's release of the Social Graph API by Google, semantics are starting to slip onto the web. The impact is being felt because large companies are really starting to focus on structured information.

In the same vein, last week Reuters - an international business and financial news giant - launched an API called Open Calais.

]]>Sponsor

]]> The API does a semantic markup on unstructured HTML documents - recognizing people, places, companies, and events. This technology is the next generation of the Clear Forest offering, which Reuters acquired last year. We have profiled Clear Forest on ReadWriteWeb and in this post we will look at what Reuters opened up and why.

Open Calais API Basics

The idea behind Calais is simple - identify interesting bits into metadata in documents. In this implementation the focus is on People, Companies, Places, and Events, but surely the technology can be adopted to other entities. The heavy lifting is done by the combination of a natural language processing engine and a massive hard coded, learning database that Clear Forest has built.

For any document submitted into Calais, entities are identified, extracted and annotated. For example, when the press release about the acquisition of Clear Forest is analyzed, the following meta data is identified:

  • Relations: Acquisition, CompanyInvestment, PersonProfessionalPast
  • Organization: Palo Alto Research Center
  • IndustryTerm: broader search development effort, text search, text analytics software, ...
  • Company: Time Warner Inc.,Reuters, Pitango Venture Capital, Inxight, ClearForest Ltd, ...
  • Person: Gerry Campbell
  • Country: United States, Israel
  • City: Tel Aviv, SAN FRANCISCO, Waltham

This is rather impressive set of information. According to the documentation page, the response is delivered in under one second for larger documents, and much faster for smaller ones - in other words, real time or near to it.

What was not quite clear from the documentation is if Calais can deal with raw HTML pages. It appears that the API requires an XML document, where the main text is marked differently from the header and footer. Ideally, an API like this should be able to accept URLs, because distilling structure from HTML would not be trivial for developers. Another thing that we noticed is that the resulting document is extensively marked up. What the developers get back is literally the output of the Calais engine. It would be good to be able to get a lighter version, which simply identifies entities and their positions in the text.

Currently the API is free for both commercial and non-commercial use and Reuters says it is prepared to scale for a massive concurrent demand. The question is then how can this be used?

What is Calais Good For?

There are quite a few interesting applications for this technology. First - better search. Knowing the kinds of entities in the text allows developers to build intelligent search engines that look for related content. For example, imagine a page on Reuters with this press release and in the sidebar links to learn more about Clear Forest, Reuters, Inxight, etc. Similarly, Calais could enable links to countries and cities mentioned in the document. And these searches need not be generic searches, but rather specific vertical ones.

Another application would be to build engines like Inform, which automatically inserts links into raw text. By automatically identifying entities in the document, Calais also identifies what should be linked. So a big piece of Inform's secret sauce is trivialized. The rest is basically a raw search through the archive, which can be done with a Google custom search engine, for example. It is possible that more tech savvy media companies could leverage Calais in exactly this way.

Another application is structured alerts. Modern alert systems are keyword based and suffer from false positives. Using Calais it is possible to build precise alerts for people, companies, places and events like corporate acquisitions. With the flood of junk in our RSS readers this is rather welcomed news.

Yet another application would be to incorporate on the fly text analysis into the browsers. In a way, this is not much different from having Microformat annotations on the page, except that the annotations are delivered on the fly. For example, a browser could call Calais on document load and obtain a list of people, places, companies, etc. which are embedded in the document. With this information the browser would be able to create a more interesting, more contextual, and relevant experience.

What's In It For Reuters?

Reuters has opened up a generous API, but why? During our interview, Gerry Campbell, the President/Global Head of Search & Content Technologies at Reuters, explained that Reuters wants the world to be tagged. When the world's content is quickly and readily accessible to their customers, Reuters wins. Semantic technologies result in better, faster, more precise and relevant information, and Reuters, as a big player in the information space, wants to be one of the first companies delivering this kind of experience.

Beyond an outstanding customer experience, Calais leads to a unique, attractive set of assets. First - a growing semantic database of people, places, companies and events. With each new document submitted into Calais the database gets richer and more complete. This is a roadmap to a semantic business powerhouse, which is clearly a great position to be in for any business media company. And in a way, what grows beneath Calais will not be that unlike Freebase. Except of course, it is happening completely automatically.

The second big advantage of having an open API is training the system. Any AI-based solution like Clear Forest is in constant need of tuning and evolution. Having other companies use the system would allow the engineers to run into cases that they have not thought about and broaden the capabilities of the system. Campbell told us that Calais is already processing a significant subset of Reuters information in nearly real time. This is both impressive technically and smart from an engineering point of view - it is an "eat your own dog food" approach to building a great piece of software.

Conclusion

The Calais API is another big win for top-down semantic web technologies. Using a mix of natural language processing, AI techniques, and a massive databases, Reuters' solution extracts important bits of information from raw HTML pages. People, Companies, Places, and Events are really at the heart of many business articles, so being able to instantly identify them in the text is a big deal. From better search to better cross-linking and more intelligent browsing, the Calais API is an invitation to tap into one of the most powerful and pragmatic semantic platforms that exists and works today.

What sort of things do you envision to be possible with Calais? What applications would you like to see built with this platform?

]]>Discuss]]>
http://www.readwriteweb.com/archives/reuters_calais.php http://www.readwriteweb.com/archives/reuters_calais.php Products Wed, 06 Feb 2008 01:47:18 -0800 Alex Iskold