calais - ReadWriteWeb http://www.readwriteweb.com/feeds/search/calais en Copyright 2012 Richard MacManus readwriteweb@gmail.com Tue, 14 Feb 2012 16:29:00 -0800 http://www.sixapart.com/movabletype/?v=4.35-en http://blogs.law.harvard.edu/tech/rss Calais Gets a Wordpress Plugin Open Calais, a semantic markup API from Reuters that we've written about on ReadWriteWeb before, has finally gotten the Wordpress plugin it has been looking for since January, when it started a bounty program seeking one. The new plugins come from developer Dan Grossman and represent one of the first public-facing applications of the API (as opposed to private uses like that of the Powerhouse Museum).

]]> As we reported in March, even with a $5000 bounty, Calais didn't receive much of a response. "Unfortunately - and unexpectedly - we haven't seen any reasonable applications for the bounty process so we'll most likely be contracting for the development of the WordPress plugin," wrote Reuters' Tom Tague at the time. We speculated then that the relatively small size of the bounty may have been the issue. However, Grossman's plugins took just "a few hours" to complete, and though they don't technically meet all of the bounty requirements (they don't do tag clouds or have GUIDs om the RSS items), Grossman estimates that "it'd take only a few hours more to have met all the bounty conditions."

Grossman's plugins, which are available as an auto tagger and an archive tagger (to go back and tag old posts), received over 500 downloads in the first two days. The plugins work by sending post text to Calais and retrieving a list of suggested tags. The plugins rely on an Open Calais PHP class, also written by Grossman. Eventually, the plugins will be released under a Creative Commons license. Grossman tells us he's waiting until the next Calais feature update, scheduled for May 1st, before adding any more features to his plugins.

As we've noted, because of Calais' roots as Clearforest the rules it applies while parsing text are biased toward the language of business. That means that business or tech bloggers will likely find more utility in Calais for the time being. If you're writing about Fortune 500 companies, the Calais Wordpress Auto Tagger plugin might be very useful, but if you routinely write about sewing teddy bears, though, its usefulness might be dubious.

Unfortunately for Grossman, the application deadline for the $5000 bounty passed in March and Reuters has since farmed out the work of creating a Wordpress plugin to a commercial firm. Though work on that plugin continues, we're told that people at Calais have expressed interested in working with Grossman on future Calais-related projects. Open Calais is one of the most interesting new semantic APIs, and we're keen to see developers finally start to embrace it and make some useful mashups.

]]> Discuss]]>
http://www.readwriteweb.com/archives/calais_gets_a_wordpress_plugin.php http://www.readwriteweb.com/archives/calais_gets_a_wordpress_plugin.php Semantic Web Tue, 15 Apr 2008 11:25:48 -0800 Josh Catone
Reuters Launches Calais 2.0 - Now With Pop-Culture Thomson Reuters' Calais, a semantic markup API that we first reviewed in February, has reached its 2.0 release. The latest version aims to fix one of the main issues with Calais -- that it was too focused on business. Because Calais has roots as Clearforest, the rules it applies while parsing text are biased toward the language of business, which meant that its utility was limited. Version 2.0 has added new semantic entity types in an effort to rectify that.

]]> Calais 2.0 has a dozen new semantic entity types, which Reuters says will increase its utility for "pop-culture publishers and bloggers covering media, music, entertainment and sports, as well as those covering pharmaceuticals, medicine and healthcare." In addition to expanded semantic identification capabilities, Calais 2.0 can now prints results in the Simple Tags format and Microformats, as well as the original RDF.

More than 3,200 developers have signed up to work with Calais since launch, according to product lead Thomas Tague, who said in a press release that Calais and plugins and services built on the API will "make it easy to kick-start metatagging and enter the era of the Semantic Web."

Along with an updated web site, a handful of new code samples and libraries, Thomson Reuters is announcing three new plugins that utilize Calais.

  • Calais Marmoset is a tool that enables developers to automatically create metadata for use with Yahoo!'s open search platform, Search Monkey (our coverage).
  • Calais is also announcing the official release of Tagaroo, a Wordpress plugin that allows bloggers to automatically tag relevant people, places and things in their posts, as well as pull in semantically relevant Flickr photos. We wrote recently about an unofficial Wordpress plugin for Calais, and noted that its utility would be limited mainly to business and tech bloggers because those were the API's strengths. Calais 2.0 should theoretically improve the utility for both plugins for a wider variety of bloggers.
  • Though they've been out since last month, Thomson Reuters is also officially introducing their Calais plugins for Drupal, a popular content management system, that it developed with Phase2Technology.

Calais is an awesome top-down semantic API that can help fuel the bottom-up approach by combing unstructured data and spitting out structured tags. We're excited for the second version of Reuters' product and the added utility that new semantic entity types should bring.

]]> Discuss]]>
http://www.readwriteweb.com/archives/calais_20_launches.php http://www.readwriteweb.com/archives/calais_20_launches.php Product Reviews Sun, 18 May 2008 21:01:01 -0800 Josh Catone
Calais 4.0 Released: Linked Data Meets the Commercial Web Thomson Reuters is today launching the latest version of its Calais web service and open API, Calais 4.0. Calais is a toolkit of products that enables publishers to incorporate semantic functionality within their properties - enabling them to categorize content as people, places, companies, facts, events, and more. Calais 4.0 is perhaps the most significant version since the launch of Calais one year ago, because it enables publishers to connect to the Linked Data web standard that Sir Tim-Berners Lee and others in the Semantic Web community have been promoting over the past few years.

]]> Up till now, we have yet to see much commercial activity in Linked Data - developments have been largely confined to the academic and scientific communities. So we think Calais 4.0 represents an important move forward in the commercial Semantic Web - and we expect to see some big media companies using it before long.

Specifically, Calais 4.0 goes beyond metatagging and enables publishers to integrate their content with Linked Data assets from Wikipedia, GeoNames, the Internet Movie Database (IMDB), Shopping.com and others. Calais 4.0 also lets publishers share semantic metadata about their content with "content consumers" such as search engines, news aggregators, related stories recommendation services and more.

ReadWriteWeb named Calais as one of our top 10 Semantic Web Apps of 2008, due to the progress it made last year. Since launching the Open Calais API early in 2008, over 9,000 developers have registered with it and Calais has processed 200+ million articles.

What's New in 4.0

We spoke with Thomas Tague, Calais lead at Thomson Reuters, about what specifically is new with Calais 4.0 and what use cases we might see over the coming year for it.

Tague explained to ReadWriteWeb that there are 3 pillers to the Calais initiative:

1. Getting semantic data out of text; which is what the first 3 versions of Calais focused on.
2. Connecting that semantic data to the linked data world.
3. Providing some way for people to share metadata, for example syndicating it - which Tague termed the "transport" piller.

Calais 4.0, explained Tague, fills in the final 2 of those pillers. It supports approximately 25 entity types in Linked Data - URIs are de-referencable to Calais RDF pages. Thomson Reuters is also publishing their ontology in RDFS. Calais will contribute data too, which Thomson Reuters claims is "the first contribution to the Linked Data cloud made by a major publisher." The data that Thomson Reuters is giving to the Linked Data world includes company descriptions, stock tickers, management teams and more. This data will be available to external developers to programmatically use in their apps.

Thomas Tague told ReadWriteWeb that Thomson Reuters has some big data assets and that over time "we're going to populate linked data endpoints with Thomson Reuters data". We asked Tague whether he thinks Calais 4.0 is the biggest commercial use of the Linked Data standard yet? He thinks it is; in his opinion, Linked Data has mostly been used so far for open data projects and relatively small sets of data. Tague said that "we fundamentally believe that companies need to jump into this [Linked Data]".


The Linking Open Data dataset cloud; by Richard Cyganiak

In terms of piller 3, the metadata transportation, Tague explained to us that a document gets a unique identifier - and to syndicate content, publishers just need to make available that unique identifier to external parties.

Conclusion

It will be interesting to see what companies make use of Calais over 2009. Last year we noted that IBM was using Calais - and we presume that with the extra Linked Data and transport functionality, other big companies will want to make use of Calais data too. Thomas Tague told us that they hope to announce 2 big product partners soon. He also said that they're seeing major traction around Drupal. Healthcare IT News from MedTech Publishing, a site developed in Drupal, features the full Calais suite for publishers including "More Like This", their related content plugin.

As we noted at the beginning of this post, we've been impressed with the progress Calais has made since its launch at the start of 2008. With 4.0, we expect to see it gain more traction among commercial publishers in 2009. Indeed as a (we like to think) ahead-of-the-curve 'new media' company ourselves, we're about to embark on our own project using Calais! Stay tuned for more information on that.

]]> Discuss]]>
http://www.readwriteweb.com/archives/calais_4_linked_data.php http://www.readwriteweb.com/archives/calais_4_linked_data.php Product Reviews Thu, 15 Jan 2009 05:00:00 -0800 Richard MacManus
Australian Museum Uses Open Calais to Tag Collection The Powerhouse Museum of Science and Design in Sydney, Australia has begun to utilize the Reuters Open Calais API (our coverage) to tag their collection. The museum's online collection database houses some 66,303 objects, so tagging them all by hand would be quite a task. By using the Open Calais web service, the museum is able to automate much of the process.

]]> That the museum has so much of its collection online is actually quite impressive in its own right. About 70% of the museum's electronically documented collection is online in the database which went live in June 2006. Museum objects are searchable, taggable (by humans) and painstakingly described.

However, there are so many objects, that even though users can help to tag them, many of them haven't yet been tagged. Sebastian Chan, who is the Manager of Web Services at the museum, told us that Open Calais is being used to compliment the people-powered tagging they've had running for two years. "What Open Calais lets us do now is connect people, places and companies across our collection and has already revealed many new pathways through our dataset (navigating by designer or inventor is now much easier for example)," he said.

The automatically generated tags at right were created by the API for some swim wear designed by Speedo for the 1991 Australian swimming team that competed at the World Swimming Championships in Perth. Open Calais was correctly able to identify some important locations in the document -- Perth where the competition took place, and Sydney where Speedo is based -- as well as an important corporation (Speedo). It also picked up the name of the designer, and the name of the person who owned the suits before the museum.

However, as you can see, the API made some mistakes too -- it classified "World Championships" as a company, and mistook the general text "international swimming organisation" as an actual organized body. It missed the actual organization (FINA) and probably should have picked up the MacRae Knitting Mills company, which was a predecessor to Speedo. Further, because Open Calais is built around people, places, and companies, general information about items may be lost on it. Tags that would be obvious to humans, such as swimming, swim wear, Olympics, or the year 1991, are beyond the scope of Open Calais.

"These errors and other like them reveal Open Calais' history as Clearforest in the business world," said Chan. "The rules it applies when parsing text as well as the entities that it is 'aware' of are rooted in the language of enterprise, finance and commerce." On the other hand, according to Chan, the technology has already revealed "many new connections between objects," even though it has so far been deployed only very sparingly across the collection.

Powerhouse's use of Open Calais may be the first large scale deployment of the technology across a large public data set. It will be interesting to see the results as they evolve. "It is important to remember that there is no way that this structured data could be generated manually - the volume of legacy data is too great and the burden on curatorial and cataloguing staff would be too great," reminded Chan.

]]> Discuss]]>
http://www.readwriteweb.com/archives/australian_museum_uses_open_calais.php http://www.readwriteweb.com/archives/australian_museum_uses_open_calais.php Trends Tue, 01 Apr 2008 16:45:34 -0800 Josh Catone
Reuters Open Calais Update: Apps Progress, Interview A month ago we wrote about Reuters launching an API called Open Calais, a technology that "does a semantic markup on unstructured HTML documents - recognizing people, places, companies, and events." I mentioned Calais in my Media08 presentation last week entitled Web Technology Trends for 2008 and Beyond. It generated interest in the media-focused audience I presented to, so in this post we follow up with Reuters and ask what progress is being made. Specifically we look at what apps have been built so far on Calais and get feedback from Reuters' Tom Tague.

]]> Quick Recap of Open Calais

Open Calais is a Semantic Web technology - and in this case the next generation of the Clear Forest product, which Reuters acquired in April '07 (see our Dec '06 review). Alex Iskold's post last month is 'must read' to understand what Open Calais is and why Reuters bought it. This diagram summarizes:

The API is free for both commercial and non-commercial use and Reuters told us last month that it is prepared to scale for a massive concurrent demand. The API is great for third party developers, because it gives them access to Reuters data. And it benefits Reuters, because it enables Reuters to aggregate metadata for its own uses.

Alex listed some possible uses: intelligent search engines that look for related content, automatically inserting links into raw text, structured alerts, on-the-fly text analysis within your browser.

Example Apps?

So it sounds great in theory, but are there any examples of Open Calais apps so far? Reuters has a "bounty" program set up, whereby developers are invited to create Open Calais applications and Reuters will pay for that. However, it seems there has been little - if any - takeup of the bounties.

Top of the list of wanted apps was a Wordpress plugin. Tom Tague, who is leading the Calais initiative at Reuters, noted in the forum that "unfortunately - and unexpectedly - we haven't seen any reasonable applications for the bounty process so we'll most likely be contracting for the development of the WordPress plugin." Perhaps the amount of the bounty in this case was an issue - Reuters only offered $5000 for the Wordpress plugin, which doesn't seem like much of an incentive.

So Reuters has been forced to take the initiative and release some apps of their own. One is a new web based document submission tool and viewer. There is some sign of action in the Open Calais forum, on a page where developers can list what they're working on. A developer named Craig has built an example of Calais semantics using pure PHP and Abhay Kumar has a similar service. These are all 'data input' tools. For an 'output' example, check out Mark Choate's RSS implementation of Calais data (example below).

Interview with Reuters' Tom Tague

Clearly, it's early days. I asked Open Calais lead Tom Tague how the initiative is progressing? Tom replied that "we’re about where we expected to be in terms of applications for Calais." He told us that the service is "just a little over 45 days old and much of the effort we’re seeing is in building tools to explore the capabilities themselves."

At this time Open Calais has just over 1,500 developers signed up; with about 30% of those developers actually making calls to and experimenting with the service. "One of the more exciting things that’s going on," Tom Tague told us, "are several community-led efforts to build Calais libraries for Ruby, PHP, ASP.NET and others. These will provide a great accelerant for developers to gain access to the service."

How is Reuters using Calais In-house?

So, at this point there is nothing to see for non-developers - the apps that have come out so far are developer-focused and not something the rest of us can use. So my next question to Tom was: how is Reuters itself using the Calais technology?

Tom replied that Reuters has several things underway:

"We're in the process of adding rich metadata to over 20 years of historical news archives (many millions of articles) to improve searchability and organization. We’re doing a lot of work in automating and generally improving the efficiency of a massive real time content ingestion process. We’re working with one of the community platforms deployed for Reuters customers to improve the tagging and classification of user generated content. And, of course, we have significant efforts under way to generate “machine readable news” to drive low-latency algorithmic trading. All of these efforts are based on the same technology platform driving the Calais initiative."

Conclusion: Show Us The Apps!

I must admit that I was expecting to see some working apps by now. Perhaps it is a similar case to Marshall Kirkpatrick's experience of Twine (published earlier today), the Semantic knowledge management service that received much early hype. Marshall thinks that Twine is underdone at this time and that the 'consumer' experience is lacking. Calais is much newer of course and, as Tom Tague said, it has only been out in the open for 45 days. So it would be unfair to compare the two efforts. Nevertheless, it would be great to see some compelling consumer-facing apps for Open Calais; even better would be to see something from Reuters that shows the public the benefits of semantic technologies.

Alex Iskold listed a number of consumer apps that could be built using Calais, by Reuters or external parties. I think people need to see at least one of those pretty soon - in order to translate the interest that Open Calais is generating from media and other people, into something non-geeks can see working on the Web and producing noticeably better information results. To paraphrase the famous Jerry Maguire quote, 'Show me the apps!'.

]]> Discuss]]>
http://www.readwriteweb.com/archives/reuters_open_calais_apps_interview.php http://www.readwriteweb.com/archives/reuters_open_calais_apps_interview.php Analysis Tue, 11 Mar 2008 14:37:08 -0800 Richard MacManus
Feedly Adds Bleeding Edge Tech to Feed Reading Tool feedly_ubiquity_logo.pngFeedly, a magazine style feed reader that syncs with Google Reader, just released a very interesting and useful integration with Mozilla's Ubiquity. Ubiquity gives Firefox a command-line interface that makes tasks like bookmarking a page on delicious, sending a quick message to Twitter, or searching Google and Flickr as easy as typing in a few letters without ever having to use the mouse. Among many other things, feedly's Ubiquity integration now lets you share any Web page on Google Reader and send a tweet with a link through Ubiquity.

]]> To try this integration, you will have to live on the cutting edge, though. You will first have to install the latest beta version of Ubiquity (2.0pre7) and then the latest version of feedly (1.2.32).

Besides being able to quickly send a link to Twitter, one feature we really like is feedly's integration with Open Calais, Thomson Reuter's semantic web service. Feedly's Calais command overlays semantic metadata on the current page and then links to a page on feedly with related stories from your RSS subscriptions, Delicious, YouTube, and Twitter.

Commands

feedly_ubiquity_screenshot.png Feedly's developer Edwin Khodabkchian notes that he will add more commands soon. Here are all the feedly commands that are currently availably in Ubiquity:

  • feedly-calais: Overlays semantic metadata from the Reuters Open Calais service on the current page
  • feedly-email: Allows you to email an article to a friend.
  • feedly-explore: Jump to the feedly explore page associated with the specified topic
  • feedly-mark-as-read: Marks the current page as read in both feedly and Google Reader
  • feedly-save-for-later: Save this page for later. Will also star it in Google Reader
  • feedly-share: Shares the current page in both feedly and Google Reader
  • feedly-tweet: Easily tweet a web page or an RSS article
  • feedly-view: View the current page as a feedly article
  • ]]> Discuss]]>
    http://www.readwriteweb.com/archives/feedly_integrates_with_ubiquity.php http://www.readwriteweb.com/archives/feedly_integrates_with_ubiquity.php News Thu, 15 Jan 2009 09:44:24 -0800 Frederic Lardinois
    Extractiv Launches "Semantics as a Service" Platform Extractiv has quietly launched a service that crawls the Web for text on a specific topic, then transforms it into "structured semantic data." It's a direct competitor to Thomson Reuters' Calais product, which has been doing this for a couple of years now. This type of service is potentially valuable to media companies, search services and monitoring applications - because it turns messy, unorganized HTML content into data that is organized into categories and given other semantic 'meaning.'

    I sat down with Extractiv CEO Shion Deysarkar at the recent Semantic Technology conference in San Francisco, to find out how Extractiv intends to compete with the more well-known and big media backed Calais.

    ]]> How Extractiv Works

    Extractiv is a joint venture between Houston-based web crawling service 80legs and natural language processing company LCC (which created Swingly, a Q&A service).

    Deysarkar explained that Extractiv uses technology from both of its parent companies, to crawl the Web for content on a particular topic and then - using natural language processing - transform it into structured data. This video, produced by Extractiv, explains how the service might be used to crawl the Web for stories about smart phones over the past month.

    The output of the crawl and analysis can be JSON or XML, two formats commonly used for structured data. Support for RDFa, a popular Semantic Web standard, will be available "soon" according to the company. Extractive also offers an API, allowing customers to bypass the web site.

    Extractiv is free to try, but if you'll be a moderate or heavy user of the service then you'll have to pay (the pricing is as yet unavailable on the web site).

    Extractiv vs Calais

    Deysarkar told ReadWriteWeb that Extractiv is targeting "mid-market Calais customers" - such as media companies or those developing search applications, monitoring services, recommendation engines or aggregators. He also claimed that Extractiv goes beyond what Calais offers, because it can mine sentiment data (which is data about how people feel about products and services).

    Extractiv also wants to "provide access to more types of semantic information than any other provider." As CEO of partner company LCC, Andrew Hickl, put it, "if you're interested in baseball pitchers, a generic type like PERSON just won't cut it."

    At launch, Extractiv offers about 250 different types of named entities, but it aims to have more than 3000 different entity types by the end of the U.S. summer.

    Preparing For the Future of the Web

    The product is not aimed at the consumer market, so it's not for the faint hearted and you need to know what to do with all of that XML or JSON data! It also remains to be seen how competitive it is with Calais, which is a proven performer and has many reputable companies as its customers. Some startups have taken on Calais before, but fallen short.

    However, there is undoubtedly a need for products like Extractiv and Calais that turn the Web's unstructured data into meaningful, organized content. This is the future of the Web, because there is going to be a large increase in the quantity of data online over the next 5-10 years - and all of that data will need to be structured if we're going to be make the best use of it.

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/extractiv_launches_semantics_as_a_service_platform.php http://www.readwriteweb.com/archives/extractiv_launches_semantics_as_a_service_platform.php Structured Data Mon, 12 Jul 2010 01:58:13 -0800 Richard MacManus
    SemanticProxy: Jump-Starting the Semantic Web semanticproxy_logo.pngWhile it has great potential, the Semantic Web has failed to live up to its promises so far. Part of the problem, as Thomson Reuters sees it, is that developers will not add a lot of semantic features to their products until publishers start publishing more semantic data. Reuters' OpenCalais represents one way around this problem. But starting today, Reuters' newest project SemanticProxy will give developers an easier way to extract semantic data from any web site.

    ]]> Even though SemanticProxy is geared towards developers, Reuters has created a demo site that you can try out on the web by just copying and pasting the URL of any web page into a simple form. We tested it with articles on CNN, Wikipedia, and a number of blogs, and it always returned a highly relevant set of results (as long as the page was not excessively long). The service is optimized for performance on 30 of the world's largest news sites, but it also works just as well for other sites.

    semanticproxy_demo.png

    For a news story, for example, SemanticProxy will identify politicians, cities, countries, etc. that are mentioned in the article. Once parsed, the service returns the semantic metadata of the page in three possible formats: RDF, MicroFormats, or standard HTML.

    As the name implies, SemanticProxy acts as a proxy and aggressively caches all its data, which should make it easy for a developer to scale a project that relies on this service.

    Catalyst

    SemanticProxy is part of Reuters' attempt to jump-start the semantic web. As Tom Tague, the leader of the Calais initiative at Reuters, points out, SemanticProxy can hopefully act as a catalyst and get more developers to look at semantic data, which, in return, will give more developers a reason to publish this data themselves.

    Disclosure: Calais is a RWW sponsor

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/reuters_semanticproxy_jump-start.php http://www.readwriteweb.com/archives/reuters_semanticproxy_jump-start.php Product Reviews Tue, 23 Sep 2008 08:19:34 -0800 Frederic Lardinois
    Reuters Wants The World To Be Tagged As Richard MacManus recently predicted, in 2008 we'll witness the rise of semantic web services. From the native support for Microformats in Firefox 3, to the New York Times' utilization of rich headers metadata, to this week's release of the Social Graph API by Google, semantics are starting to slip onto the web. The impact is being felt because large companies are really starting to focus on structured information.

    In the same vein, last week Reuters - an international business and financial news giant - launched an API called Open Calais.

    ]]> The API does a semantic markup on unstructured HTML documents - recognizing people, places, companies, and events. This technology is the next generation of the Clear Forest offering, which Reuters acquired last year. We have profiled Clear Forest on ReadWriteWeb and in this post we will look at what Reuters opened up and why.

    Open Calais API Basics

    The idea behind Calais is simple - identify interesting bits into metadata in documents. In this implementation the focus is on People, Companies, Places, and Events, but surely the technology can be adopted to other entities. The heavy lifting is done by the combination of a natural language processing engine and a massive hard coded, learning database that Clear Forest has built.

    For any document submitted into Calais, entities are identified, extracted and annotated. For example, when the press release about the acquisition of Clear Forest is analyzed, the following meta data is identified:

    • Relations: Acquisition, CompanyInvestment, PersonProfessionalPast
    • Organization: Palo Alto Research Center
    • IndustryTerm: broader search development effort, text search, text analytics software, ...
    • Company: Time Warner Inc.,Reuters, Pitango Venture Capital, Inxight, ClearForest Ltd, ...
    • Person: Gerry Campbell
    • Country: United States, Israel
    • City: Tel Aviv, SAN FRANCISCO, Waltham

    This is rather impressive set of information. According to the documentation page, the response is delivered in under one second for larger documents, and much faster for smaller ones - in other words, real time or near to it.

    What was not quite clear from the documentation is if Calais can deal with raw HTML pages. It appears that the API requires an XML document, where the main text is marked differently from the header and footer. Ideally, an API like this should be able to accept URLs, because distilling structure from HTML would not be trivial for developers. Another thing that we noticed is that the resulting document is extensively marked up. What the developers get back is literally the output of the Calais engine. It would be good to be able to get a lighter version, which simply identifies entities and their positions in the text.

    Currently the API is free for both commercial and non-commercial use and Reuters says it is prepared to scale for a massive concurrent demand. The question is then how can this be used?

    What is Calais Good For?

    There are quite a few interesting applications for this technology. First - better search. Knowing the kinds of entities in the text allows developers to build intelligent search engines that look for related content. For example, imagine a page on Reuters with this press release and in the sidebar links to learn more about Clear Forest, Reuters, Inxight, etc. Similarly, Calais could enable links to countries and cities mentioned in the document. And these searches need not be generic searches, but rather specific vertical ones.

    Another application would be to build engines like Inform, which automatically inserts links into raw text. By automatically identifying entities in the document, Calais also identifies what should be linked. So a big piece of Inform's secret sauce is trivialized. The rest is basically a raw search through the archive, which can be done with a Google custom search engine, for example. It is possible that more tech savvy media companies could leverage Calais in exactly this way.

    Another application is structured alerts. Modern alert systems are keyword based and suffer from false positives. Using Calais it is possible to build precise alerts for people, companies, places and events like corporate acquisitions. With the flood of junk in our RSS readers this is rather welcomed news.

    Yet another application would be to incorporate on the fly text analysis into the browsers. In a way, this is not much different from having Microformat annotations on the page, except that the annotations are delivered on the fly. For example, a browser could call Calais on document load and obtain a list of people, places, companies, etc. which are embedded in the document. With this information the browser would be able to create a more interesting, more contextual, and relevant experience.

    What's In It For Reuters?

    Reuters has opened up a generous API, but why? During our interview, Gerry Campbell, the President/Global Head of Search & Content Technologies at Reuters, explained that Reuters wants the world to be tagged. When the world's content is quickly and readily accessible to their customers, Reuters wins. Semantic technologies result in better, faster, more precise and relevant information, and Reuters, as a big player in the information space, wants to be one of the first companies delivering this kind of experience.

    Beyond an outstanding customer experience, Calais leads to a unique, attractive set of assets. First - a growing semantic database of people, places, companies and events. With each new document submitted into Calais the database gets richer and more complete. This is a roadmap to a semantic business powerhouse, which is clearly a great position to be in for any business media company. And in a way, what grows beneath Calais will not be that unlike Freebase. Except of course, it is happening completely automatically.

    The second big advantage of having an open API is training the system. Any AI-based solution like Clear Forest is in constant need of tuning and evolution. Having other companies use the system would allow the engineers to run into cases that they have not thought about and broaden the capabilities of the system. Campbell told us that Calais is already processing a significant subset of Reuters information in nearly real time. This is both impressive technically and smart from an engineering point of view - it is an "eat your own dog food" approach to building a great piece of software.

    Conclusion

    The Calais API is another big win for top-down semantic web technologies. Using a mix of natural language processing, AI techniques, and a massive databases, Reuters' solution extracts important bits of information from raw HTML pages. People, Companies, Places, and Events are really at the heart of many business articles, so being able to instantly identify them in the text is a big deal. From better search to better cross-linking and more intelligent browsing, the Calais API is an invitation to tap into one of the most powerful and pragmatic semantic platforms that exists and works today.

    What sort of things do you envision to be possible with Calais? What applications would you like to see built with this platform?

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/reuters_calais.php http://www.readwriteweb.com/archives/reuters_calais.php Product Reviews Wed, 06 Feb 2008 01:47:18 -0800 Alex Iskold
    The State of the Market in Semantic Technologies Tom Tague from Thomson Reuters' OpenCalais team did a keynote speech today at SemTech in San Jose. His presentation was a wonderful wrapup of current semantic technology trends, and what we can expect over the next few years.

    To open, he said that where we are now in the evolution of the Web is content rich, but information poor - plus "experientially deficient". He suggested that 'web 3.0' is about cleaning up the mess of web 2.0 and improving interfaces. In terms of semantic technology, he explained that over the past 5 years it has evolved from invention of standards to a period of commercial innovation on top of those inventions. While standards are still being worked on, now "we are at an inflection point where innovation is exploding."

    ]]> Tague called Calais, the project he leads at Thomson Reuters, "a web service a.k.a. plumbing". They've had 13 releases, talked with 100+ customers about Calais, have 13,000 registered developers. He put the ideas that he's been talking about with customers and developers into 6 buckets, which we've listed with sub-categories below.

    Tools

    • Semantic data mgmt
    • Semantic data generation
    • Databases
    • Integration and workflow

    Tague said that tools are important, particularly in the enterprise. He sounded a note of caution to tools vendors: they need to simplify their stories, along with have "simple basic tools."

    Social

    • Semantics-powered link sharing
    • Network mining
    • News sharing
    • Tweet mining

    Tague said that we shouldn't focus on providing "frosting" on top of current social Web tools. He advised to focus on commercial imperatives, such as the categories above.

    Advertising

    • Semantic ad placement
    • Contextual ad placement
    • Semantically driven landing pages
    • Mashup ads

    There are clearly opportunities to improve advertising using semantic technology, said Tague.

    Search

    Tague noted that semantic search may be "the answer to the question nobody is asking." He said that we should look at general "semantic search" vs domain specific semantically-enhanced search. The latter is where the commercial opportunity actually is, but he questioned the economics of general semantic search.

    Publishing

    He put this into 3 sub-categories:

    • A-Content Producers - from back office to user experience
    • B-Editorial + Aggregation Publishing Models
    • C-Robotic publishing - aggregation only

    Tague explained that Calais has really focused on this over the last 8-9 months. He said that classic publishers can get an enormous amount of value from this. Right now the big focus is "back in the bolier room," for example to cut editors from 3 to 2. He expects that later on more focus will go on enhancing the user experience.

    Tague thinks that B is the biggest opportunity, using Huffington Post as an example. He said that it gives a "near newspaper like experience" at perhaps a 5th of the cost. It's an area where they're seeing adoption of Calais.

    Interface

    Tague noted that gaming is a huge industry that the semantic technology industry can learn from. He listed these attributes:

    • Great story line
    • High interactivity, immediate responsiveness
    • No interuptions
    • Graphically engaging
    • Seamless
    • Fun

    So he asked who out there is trying to really change the user experience in semantic technology? He listed 4 companies (all of whom we've profiled on ReadWriteWeb):

    • Zemanta
    • Apture
    • Feedly
    • Glue

    Tague told the audience that the next big innovation in interface will be something that stays with the user where they are, which will be mobile and in the browser.

    To sum up, Tague suggested that semantic technologies vendors should decide whether they care about semantics or about user value. If it's semantics, then be a tools vendor. He said the basic building blocks are out there already, so focus on user experience.

    Disclosure: SemTech has been a recent sponsor of ReadWriteWeb

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/the_state_of_the_market_in_semantic_technologies.php http://www.readwriteweb.com/archives/the_state_of_the_market_in_semantic_technologies.php Conferences Tue, 16 Jun 2009 09:23:17 -0800 Richard MacManus
    Bit.ly: Please Use This TinyURL of the Future bitlylogo.jpgURL shorteners like TinyURL are a wildly popular way to share long links over email, IM, microblogging and other contexts. The millions of shortcuts that have been created through such services represent a huge opportunity to capture interesting data - but to date those opportunities have all just gone down the drain.

    Bit.ly, a new URL shortening service from the innovation network Betaworks, is launching today with a staggering feature set for both end users and forward-looking developers.

    ]]> We've been waiting for a more intelligent URL shortening service to hit the market but even in our most ambitious visions we haven't seen something like this coming. We hope you'll use it - the more we all do, the more everyone will benefit.

    What Bit.ly Does Today

    bitlyresized.jpgAt launch Bit.ly is a relatively sophisticated URL shortener. It uses a cookie to remember the last 15 links you've shortened and displays that history on the home page when you visit. It allows you to set up a custom URL ending for your link. It automatically creates 3 thumbnails for every page you save a link to.

    How about these features, though? Bit.ly saves a cached copy forever of every page you shorten a link to, on Amazon's S3 storage (processing is done on EC2, as well, so uptime looks good). Bit.ly also tracks clickthrough numbers and referrers so you can see what kind of traffic your shortcut got and from where. There's a simple API for adding Bit.ly functionality to any other web app (Betaworks affiliated gaming site ImInLikeWithYou already has this live) and all the data, including traffic data and thumbnails, is easily accessible by XML and JSON feeds.

    Those are some pretty awesome features but that's only the beginning. A javascript submission bookmarklet and user accounts should be available soon. (Update: Bit.ly just added a simple bookmarklet that will make it easier to use casually.)

    The Future of Bit.ly: Semantic and Geo Spatial Analysis

    In the background, Bit.ly is analyzing all of the pages that its users create shortcuts to using the Open Calais semantic analysis API from Reuters! Calais is something we've written about extensively here. Bit.ly will use Calais to determine the general category and specific subjects of all the pages its users create shortcuts to. That information will be freely available to the developer community using XML and JSON APIs as well.

    As if that's not a whole lot of awesome already - Bit.ly is also using the MetaCarta GeoParsing API to draw geolocation data out of all the web pages it collects.

    You want to see all the web pages related to the US Presidential election, Barack Obama and Asheville, North Carolina? Or about Technology, Google and The Dalles, Oregon? That will be what Bit.ly delivers if it can build up a substantial database of pages. Once it does, it will open that data up to other developers as well.

    Why use a URL shortener to catalog all those pages? Why not? Each shortcut signals a page that's of importance to a real human user and an army of link-senders sounds like a great way to build up that database. Semantic indexing of the web through casual but opt-in and common user activity is a great strategy.

    Then we can all share access to that data. We're excited and we hope you'll put Bit.ly to use.

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/bitly_alternative_to_tinyurl.php http://www.readwriteweb.com/archives/bitly_alternative_to_tinyurl.php Product Reviews Tue, 08 Jul 2008 11:50:53 -0800 Marshall Kirkpatrick
    Hakia Announces Semantic API Semantic search engine Hakia today announced a set of APIs that opens up their natural language processing and search platform to developers. Hakia's Syndication Web Services really comes in two parts: search queries, which allow developers to add web search functionality leveraging Hakia's five billion page index, and XML feed calls, which give developers access to Hakia's underlying natural language processing technology. The latter of the two is clearly the more compelling of the offerings.

    ]]> Mobile video firm, Berggi, released Berggi Search, a mobile search application that lets users search Hakia's index via the API from mobile phones. Berggi is leveraging the part of the Hakia's API that lets developers lean on the company's search platform -- that, however, is not the part that really interests us.

    What is more interesting are the XML feed calls that Hakia is offering that give access to their underlying NLP engine. Right now, only the "Summarizer" element is available. Summarizer, which Hakia says can be used to suggest tags or abstracts, analyzes and extracts meaning from large blocks of text or the contents of URLs. Other elements that are not yet available are Categorizer, which identifies "categorical phrases" in text, Characterizer, which "identifies and expands descriptive keywords or tags," and Text Meaning Representation.

    Hakia has an XML testing form up on their Club Hakia page, and in our testing it seemed a little rough around the edges. Compared to our testing of Open Calais from Reuters (our coverage), the summaries and tags the XML testing form returned using the Summarizer element weren't very impressive. Mostly, it seemed to just return the headline or first sentence as the summary for articles we threw at it. And for RWW articles, Hakia Summarizer would suggest as tags the tags that we entered by hand in MovableType.

    Hakia's Syndication Web Services are free for up to 30,000 requests per day for search services (unlimited free queries for Quotes and Cartoons), and free for up to 1,000 requests per day for XML feed calls. Have you had a chance to play with Hakia's new semantic API? If so, what did you think? How does it compare to Calais or Semantic Hacker? Let us know in the comments below.

    Full Disclosure: Occasional ReadWriteWeb contributor Emre Sokullu is a technology evangelist at Hakia.

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/hakia_announces_semantic_api.php http://www.readwriteweb.com/archives/hakia_announces_semantic_api.php Semantic Web Thu, 19 Jun 2008 12:56:42 -0800 Josh Catone
    Media Cloud Leverages Calais to Track News Trends Media Cloud, a new project from the Berkman Center at Harvard University, has an ambitious goal: It will do the heavy lifting of analyzing stories from thousands of traditional news sources, analyzing the semantics of the content through Calais (covered here and here), and then providing tools to quickly get trending results. This approach promises to bring what used to be an expensive and laborious process to anyone who has a need for this type of data but lacks the means to get it.

    ]]> At launch, Media Cloud will offer three primary trend visualization tools and a discussion forum for anyone to use. In addition, a news RSS feed and mailing list are available. We'll now take a moment to review how these visualization tools work, and we'd also like to point you to a very illustrative video interview of project developer Ethan Zuckerman, by the Nieman Journalism Lab.

    Disclosure: Reuters Calais is an RWW sponsor. And they are awesome.

    Top 10 Chart

    This tool lets you compare up to three media sources, generating a list of the top ten most mentioned terms for that source and relative frequency of use for each term. This chart can be useful in a number of ways, indicating not only what terms are considered most important by each news source at the moment you generate the chart, but also showing if there is a clear standout term that may indicate a very hot topic. Also, when comparing two similar media sources, say for example the New York Times and the Washington Post, the resulting chart can give you an idea of what each paper considers more important leading topics.

    Top 10 Term Pivot Chart

    You can put in your own search term and up to three media sources in this tool to see what terms are most frequently mentioned alongside the search term in those sources' stories. This allows you to gain insight on how frequently related terms cluster together. So, for example if you search for Obama, you might find that, while United States is the most common related term, CNN's focus is more on Congress while FOX News writes more about the White House.

    World Map Chart

    This tool shows global coverage of all terms in the Media Cloud database for the selected media sources. Naturally, a newspaper that is focused on national US news will not have the depth of coverage of a source that has an international perspective. But even when comparing similar international sources, the weight each source gives to news from different regions can differ greatly. Take the New York Times versus BBC news coverage, you will see that darker colors mean that BBC has heavier coverage of European affairs, while NYT has stronger coverage of Canada and Mexico news topics.

    Discussion

    Media Cloud is a project that developed from discussions around where story trends came from. This tool attempts to serve as a foundation to help move these conversations forward, and the Berkman Center is keeping the door open for new ideas and ways of using this data. To that end, they also have a discussion forum where people can contribute suggestions, thoughts and ideas to the project. Media Cloud also provides an RSS feed and an email list you can subscribe to if you want to stay in the loop for any new developments.

    Like our coverage of the New York Times R&D Labs, we see this as an example of how the Internet is driving traditional media to change and respond in new ways. We are excited by the scope and potential that Media Cloud brings to anyone interested in following news and media trends.

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/media_cloud_leverages_calais_to_track_news_trends.php http://www.readwriteweb.com/archives/media_cloud_leverages_calais_to_track_news_trends.php News Wed, 11 Mar 2009 17:05:00 -0800 Phil Glockner
    Look Out TinyURL; Bit.ly Gets Hot Silicon Valley Cash Link shortening services are so common you can't throw a stone online without hitting one, but TinyURL is the undisputed champ. It's one of the oldest, its name says what it does and despite repeated outages - its downtime is small enough that millions of people keep using it.

    TinyURL has also allowed incomprehensible amounts of value, both in terms of technology and in terms of money, to sit on the table unclaimed. For years. Now a group of some of the web's hottest investors are betting a few million dollars that a smart TinyURL competitor called Bit.ly can take advantage of being the conduit through which millions of people visit sites of interest to them.

    ]]> Today Bit.ly announced that it has raised about $2 million in its first round of funding. The round was led by Tim O'Reilly's venture fund and included money from Mitch Kapor (the inventor of Lotus), Jeff Clavier (portfolio), Ron Conway (early Google investor), the Accelerator Group and Howard Lindzon's new fund Social Leverage. All of those names are some of the hottest in the startup scene and all the companies in those various portfolios will now have a close business connection to Bit.ly.

    We reviewed Bit.ly when the project launched last July and urged readers to use this service to shorten their long links instead of other services like TinyURL. Why do we care what service people use? Because we're fans of innovation and Bit.ly is aiming to be a platform for innovation like TinyURL should have been. If web 2.0 is about democratizing publishing, the next step is machine leveraging all the resulting data.

    The Bit.ly Magic

    What does Bit.ly do that's so special? They use all the data they see and make it available to third party developers who want to build on top of it. They keep track of the clickthrough numbers and can tell you what the hottest links on the web are at any time. See this @bitlynow Twitter account for one display of that information. Bit.ly says it resolved 20 million distinct URLs last week. That's the beginning of a really large database.

    Bit.ly also uses Reuters Calais to extract semantic terms out of the pages that shortcuts are created to. That's valuable information. Want to see the most popular web pages that talk about Dancing With The Stars, or the Federal Stimulus Package, or some other topic, in the last 30 minutes? Somebody wants to, you'd better believe, and that's the kind of real-time information that the Bit.ly API aims to make available. (Disclosure: Calais is an RWW sponsor.)

    We've had some concerns about the clickthrough numbers that Bit.ly has reported but the company says they are going through a list of reporting sources that give them problems and eliminating them one at a time. The company says it is now reporting real-time traffic stats that are within 10% of what Google Analytics reports much later. We've been watching the numbers improve in accuracy when it comes to our numbers and can confirm that they are getting much better.

    A number of people have looked at today's news and thought it was ridiculous that a link shortening business could raise $2 million in funding. We don't think it's ridiculous at all. Show us a service that can report in real time how many people are visiting millions of pages around the web and what those pages are about, that exposes that data in an API, and we'll show you a platform we're very excited to see work.

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/look_out_tinyurl_bitly_gets_hot_silicon_valley_h.php http://www.readwriteweb.com/archives/look_out_tinyurl_bitly_gets_hot_silicon_valley_h.php News Mon, 30 Mar 2009 11:42:44 -0800 Marshall Kirkpatrick
    Guzzle Turns Out 2.0 guzzle_logo.pngParis-based Lemonchik has announced Version 2.0 of their topic-by-topic aggregator, Guzzle. Do you...dig?

    Guzzle's new backend, Nibble, has been rewritten from scratch. Nibble receives PubSubHubbub notifications and every story is automatically processed with Reuters Calais technology, adding rich semantic encoding information. There's also a new user interface, categories and archives and a magazine-like "extended view."

    ]]> guzzle_screenshot.jpgTo set up a Guzzle page you search terms and the service searches down rich content, tearing out spam by the roots and never even stopping to say its sorry.

    "Guzzle constantly monitors hundreds of feeds. Each new article is carefully inspected, analysed. The language it has been written in is detected, and important keywords, places, companies and people's name are extracted and indexed."

    If you're like me, you looked at a thing called Guzzle by an outfit called Lemonchik from a place like Paris and you expected cocktails. A sidecar, perhaps, in a space-age bachelor pad set in an alley off the place de la Bastille.

    Well, who said life was fair?

    ]]> Discuss]]>
    http://www.readwriteweb.com/archives/guzzle_turns_out_20.php http://www.readwriteweb.com/archives/guzzle_turns_out_20.php Wed, 05 May 2010 20:17:00 -0800 Curt Hopkins