Microformats - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/Microformats en Copyright 2009 Richard MacManus readwriteweb@gmail.com Sun, 22 Nov 2009 12:00:55 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Firefox Could Offer New Ways to View Data (Mock-ups) ffsunglasses.jpgBees can see ultraviolet light that the human eye cannot see. Snakes and mosquitoes can see infrared light. The Firefox (browser) can see things that the human eye can't, too, but a lot of it doesn't get used for anything. So far.

Microformats are one thing that the browser notices while serving up web pages. This type of markup designating certain types of information has just begun to be leveraged in real use cases. Alex Faaborg, Principle Designer on the Firefox team, has some interesting ideas about how the browser could leverage the microformatted information it comes across.

]]>Sponsor

]]> faaborg.jpgFaaborg gave a presentation at this weekend's FOOCamp about some of the concepts he'd like to see played out in the future of the browser.

These are a few of the conceptual mock-ups he showed in his presentation; they aren't planned features - but it sure would be cool if they became reality.

Location

The gist of this idea is that information marked up with microformats as locations, events, etc. on pages around the web could be aggregated by Firefox and made available for viewing in other applications. Information made machine readable with the right markup could be passively captured and reused in different contexts to add new value. That's a pretty smart idea.

In this first mock-up you can see a user doing a search across multiple sites for apartments for rent. The browser captures all the locations viewed in the sidebar for organizing and viewing elsewhere.

ffmicro1.jpg

In this next image you can see one resulting use case for the data captured above: viewing browsed addresses together in Google Earth.

ffmicro2.jpg

Events

Location is just one type of microformat. Another is events listings. In the mock-up below, the browser has captured all of the events listings in a user's browsing history and made them available as a "ghost calendar" in Google Calender. Just a reminder - that event you stumbled across is happening later today!

ffmicro3.jpg

Other types of microformats include designation of people, reviews, tags and more. All these item types could be pulled out of a person's browsing history and analyzed or viewed in new and different ways. There are websites and services doing this already but for the browser to do it too is a very interesting idea.

Firefox is a dynamic and widely used collection of software for which the future is wide open. This idea of capturing and leveraging microformats across applications is one of our favorite proposed directions for the future. The browser already sees this data, so doing something with it makes a whole lot of sense. To follow these and other ideas, check out Alex Faaborg's blog at Mozilla.

Firefox sunglasses image via Photobucket

]]>Discuss]]>
http://www.readwriteweb.com/archives/firefox_could_offer_new_ways_to_view_data_mock-ups.php http://www.readwriteweb.com/archives/firefox_could_offer_new_ways_to_view_data_mock-ups.php data portability Tue, 21 Apr 2009 11:13:27 -0800 Marshall Kirkpatrick
SocialWhois: Whois Lookups for the Social Web socialwhois.gifWhen you want to know about a domain name, you jump to whois to get all of the information on the person who registered it. But when you want to know more about the person who just started following you on Twitter or FriendFeed, it hasn't been that easy - even though we've tried to provide you with tools to do it. Now, a new service promises to simplify the process. It's a new take on whois for the social web: SocialWhois, a service that uses XFN, microformats, APML, and tagging to provide a more complete picture of that new follower's presence online.

]]>Sponsor

]]> The service works like any number of XFN crawlers we've seen, but it's simple enough that anyone can use it. Simply enter the Twitter or FriendFeed username of the person you'd like to look up. The service will do its best to guess who the person is. As we tested it, we found it doing an incredibly good job of guessing - finding all sorts of interesting and relevant links about the users we tried.

socialwhoisScreen.gifIf you'd like to tweak the results for your profile - or hide your profile completely - you can always log in using your Twitter credentials or your FriendFeed key.

Oh great. Yet another profile to complete? Not exactly. Thanks to SocialWhois' "voodoo" button. One click and you're likely to have your profile pre-populated with relevant information from your profiles across the Web.

This Isn't a Popularity Contest

One of the things that makes SocialWhois so appealing isn't what it is, but rather what it isn't. It isn't a popularity contest. It's a search for relevance:

"SocialWhois is about everything but popularity. You'll think that it's hypocrisy or irony, but I (SocialWhois' creator) am not popular on SocialWhois! And guess what, I like it that way! Really :) In fact, on SocialWhois, no one is popular.... You can navigate in the graph and discover new faces, and the way this graph is being traversed is different for everyone of us."

With the tagging functionality, you're more likely to find that user who shares similar interests with you. And in so doing, you're likely to have more engaging conversations.

It Just Works

One of the things we've always loved about whois is the fact that it just works. There are any number of services that allow you to look at the data held by the registrars, all of which have varying levels of usability and clutter. But by and large, we run whois lookups because they serve a specific purpose.

SocialWhois has a lot of that same appeal. Simple, straightforward, and it provides the information you're seeking. It will be interesting to add this to the collection of tools we use to find - and better understand - those around us on the social web.

]]>Discuss]]>
http://www.readwriteweb.com/archives/socialwhois_social_web_whois.php http://www.readwriteweb.com/archives/socialwhois_social_web_whois.php Social Web Wed, 04 Feb 2009 23:45:47 -0800 Rick Turoczy
Semantic Web Patterns: A Guide to Semantic Technologies In this article, we'll analyze the trends and technologies that power the Semantic Web. We'll identify patterns that are beginning to emerge, classify the different trends, and peak into what the future holds.

In a recent interview Tim Berners-Lee pointed out that the infrastructure to power the Semantic Web is already here. ReadWriteWeb's founder, Richard MacManus, even picked it to be the number one trend in 2008. And rightly so. Not only are the bits of infrastructure now in place, but we are also seeing startups and larger corporations working hard to deliver end user value on top of this sophisticated set of technologies.

]]>Sponsor

]]> Editor's note: Looking back over 2008, there were some posts on ReadWriteWeb that did not get the attention we felt they deserved - whether because of timing, competing news stories, etc. So in this end-of-year series, called Redux, we're resurrecting some of those hidden gems. This is one of them, we hope you enjoy (re)reading it!

The Semantic Web means many things to different people, because there are a lot of pieces to it. To some, the Semantic Web is the web of data, where information is represented in RDF and OWL. Some people replace RDF with Microformats. Others think that the Semantic Web is about web services, while for many it is about artificial intelligence - computer programs solving complex optimization problems that are out of our reach. And business people always redefine the problem in terms of end user value, saying that whatever it is, it needs to have simple and tangible applications for consumers and enterprises.

The disagreement is not accidental, because the technology and concepts are broad. Much is possible and much is to be imagined.

1. Bottom-Up and Top-Down

We have written a lot about the different approaches to the Semantic Web - the classic bottom-up approach and the new top-down one. The bottom-up approach is focused on annotating information in pages, using RDF, so that it is machine readable. The top-down approach is focused on leveraging information in existing web pages, as is, to derive meaning automatically. Both approaches are making good progress.

A big win for the bottom-up approach was recent announcement from Yahoo! that their search engine is going to support RDF and microformats. This is a win-win-win for publishers, for Yahoo!, and for customers - publishers now have an incentive to annotate information because Yahoo! Search will be taking advantage of it, and users will then see better, more precise results.

Another recent win for the bottom-up approach was the announcement of the Semantify web service from Dapper (previous coverage). This offering will enable publishers to add semantic annotations to existing web pages. The more tools like Semantify that pop up, the easier it will be for publishers to annotate pages. Automatic annotation tools combined with the incentive to annotate the pages is going to make the bottom-up approach more compelling.

But even if the tools and incentive exist, to make the bottom-up approach widespread is difficult. Today, the magic of Google is that it can understand information as is, without asking people to fully comply with W3C standards of SEO optimization techniques. Similarly, top-down semantic tools are focused on dealing with imperfections in existing information. Among them are the natural language processing tools that do entity extraction - such as the Calais and TextWise APIs that recognize people, companies, places, etc. in documents; vertical search engines, like ZoomInfo and Spock, which mine the web for people; technologies like Dapper and BlueOrganizer, which recognize objects in web pages; and Yahoo! Shortcuts, Snap and SmartLinks, which recognize objects in text and links.

[Disclosure: Alex Iskold is founder and CEO of AdaptiveBlue, which makes BlueOrganizer and SmartLinks.]

Top-down technologies are racing forward despite imperfect information. And, of course, they benefit from the bottom-up annotations as well. The more annotations there are, the more precise top-down technologies will get - because they will be able to take advantage of structured information as well.

2. Annotation Technologies: RDF, Microformats, and Meta Headers

Within the bottom-up approach to annotation of data, there are several choices for annotation. They are not equally powerful, and in fact each approach is a trade off between simplicity and completeness. The most comprehensive approach is RDF - a powerful, graph-based language for declaring things, and attributes and relationships between things. In a simplistic way, one can think of RDF as the language that allows expressing truths like: Alex IS human (type expression), Alex HAS a brain (attribute expression), and Alex IS the father of Alice, Lilly, and Sofia (relationship expression). RDF is powerful, but because it is highly recursive, precise, and mathematically sound, it is also complex.

At present, most use of RDF is for interoperability. For example, the medical community uses RDF to describe genomic databases. Because the information is normalized, the databases that were previously silos can now be queried together and correlated. In general, in addition to semantic soundness, the major benefit of RDF is interoperability and standardization, particularly for enterprises, as we will discuss below.

Microformats offer a simpler approach by adding semantics to existing HTML documents using specific CSS styles. The metadata is compact and is embedded inside the actual HTML. Popular microformats are hCard, which describes personal and company contact information, hReview, which adds meta information to review pages, and hCalendar, which is used to describe events.

Microformats are gaining popularity because of their simplicity, but they are still quite limiting. There is no way to describe type hierarchies, which the classic semantic community would say is critical. The other issue is that microformats are somewhat cryptic, because the focus is to keep the annotations to a minimum. This, in turn, brings up another question of whether embedding metadata into the view (HTML) is a good idea. The question is: what happens if the underlying data changes when someone makes a copy of the HTML document? Nevertheless, despite these issues, microformats are gaining popularity because they are simple. Microformats are currently used by Flickr, Eventful, and LinkedIn; and many other companies are looking to adopt microformats, particularly because of the recent Yahoo! announcement.

An even simpler approach is to put meta data into the meta headers. This approach has been around for a while and it is a shame that it has not been widely adopted. As an example, the New York Times recently launched extended annotations for its news pages. The benefit of this approach is that it works great for pages that are focused on a topic or a thing. For example, a news page can be described with a set of keywords, geo location, date, time, people, and categories. Another example would be for book pages. O'Reilly.com has been putting book information into the meta headers, describing the author, ISBN, and category of the book.

Despite the fact that all these approaches are different, they are also somewhat complementary; and each of them is helpful. The more annotations there are in web pages, the more standards are implemented, and the more discoverable and powerful the information becomes.

3. Consumer and Enterprise

Yet another dimension of the conversation about the Semantic Web is the focus on consumer and enterprise applications. In the consumer arena we have been looking for a Killer App - something that delivers tangible and simple consumer value. People simply do not care that a product is built on the Semantic Web; all they are looking for is utility and usefulness.

Up until recently, the challenge has been that the Semantic Web focused on rather academic issues - like annotating information to make it machine-readable. The promise was that once the information is annotated and the web becomes one big giant RDF database, then exciting consumer applications would come. The skeptics, however, have been pointing out that first there needs to be a compelling use case.

Some consumer applications based on the Semantic Web: generic and vertical search, contextual shortcuts and previews, personal information management systems, semantic browsing tools. All of these applications are in their early days and have a long way to go before being truly compelling for the average web user. Still, even if these applications succeed, consumers will not be interested in knowing about the underlying technology - so there is really no marketing play for the Semantic Web in the consumer space.

Enterprises are a different story for a couple of reasons. First, enterprises are much more used to techno speak. To them utilizing semantic technologies translates into being intelligent and that, in turn, is good marketing. 'Our products are better and smarter because we use the Semantic Web' sounds like a good value proposition for the enterprise.

But even above the marketing speak, RDF solves a problem of data interoperability and standards. This "Tower of Babel" situation has been in existence since the early days of software. Forget semantics; just a standard protocol, a standard way to pass around information between two programs, is hugely valuable in the enterprise.

RDF offers a way to communicate using XML-based language, which on top of it has sound mathematical elements to enable semantics. This sounds great, and even the complexity of RDF is not going to stop enterprises from using it. However, there is another problem that might stop it - scalability. Unlike relational databases, which have been around for ages and have been optimized and tuned, XML-based databases are still not widespread. In general, the problem is in the scale and querying capabilities. Like object-oriented database technologies of the late '90s, XML-based databases hold a lot of promise, but we have yet to see them in action in a big way.

4. Semantic APIs

With the rise of Semantic Web applications, we are also seeing the rise of Semantic APIs. In general, these web services take as an input unstructured information and find entities and relationships. One way to think of these services is mini natural language processing tools, which are only concerned with a subset of the language.

The first example is the Open Calais API from Reuters that we have covered in two articles here and here. This service accepts raw text and returns information about people, places, and companies found in the document. The output not only returns the list of found matches, but also specifies places in the document where the information is found. Behind Calais is a powerful natural language processing technology developed by Clear Forest (now owned by Reuters), which relies on algorithms and databases to extract entities out of text. According to Reuters, Calais is extensible, and it is just a matter of time before new entities will be added.

Another example is the SemanticHacker API from TextWise, which is offering a one million dollar prize for the best commercial semantic web application developed on top of it. This API classifies information in documents into categories called semantic signatures. Given a document, it outputs entities or topics that the document is about. It is kind of like Calais, but also delivers a topical hierarchy, where the actual objects are leafs.

Another semantic API is offered by Dapper - a web service which facilitates the extraction of structure from unstructured HTML pages. Dapper works by enabling users to define attributes of an object based on the bits of the page. For example, a book publisher might define where the information about author, ISBN and number of pages is on a typical book page and the Dapper application would then create a recognizer for any page on the publisher site and enable access to it via REST API.

While this seems backwards from an engineering point of view, Dapper's technology is remarkably useful in the real world. In a typical scenario, for websites that do not have clean APIs to access their information, even non-technical people can build an API in minutes with Dapper. This is a powerful way of quickly turning websites into web services.

5. Search Technologies

Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search. The premise that a semantic understanding of pages leads to vastly better search has yet to be validated. The two main contenders, Hakia and PowerSet, have made some progress, but not enough. The problem is that Google's algorithm, which is based on statistical analysis, deals just fine with semantic entities like people, cities, and companies. When asked What is the capital of France? Google returns a good enough answer.

There is a growing realization that marginal improvement in search might not be enough to beat Google or to declare search the killer app for the Semantic Web. Likely, understanding semantics is helpful but not sufficient to build a better search engine. A combination of semantics, innovative presentation, and memory of who the user is, will be necessary to power the next generation search experience.

Alternative approaches also attempt to overlay semantics on top of the search results. Even Google ventures into verticals by partitioning the results into different categories. The consumer can then decide which type of answer they are interested in.

Yet search is a game that is far from won and a lot of semantic companies are really trying to raise the bar. There may be another twist to the whole search play - contextual technologies, as well as semantic databases, could lead to qualitatively better results. And so we turn to these next.

6. Contextual Technologies

We are seeing an increasing number of contextual tools entering the consumer market. Contextual navigation does not just improve search, but rather shortcuts it. Applications like Snap or Yahoo! Shortcuts, and SmartLinks "understand" the objects inside text and links and bring relevant information right into the user's context. The result is that the user does not need to search at all.

Thinking about this more deeply, one realizes that contextual tools leverage semantics in a much more interesting way. Instead of trying to parse what a user types into the search box, contextual technologies rely on analyzing the content. So the meaning is derived in a much more precise way - or rather, there is less guessing. The contextual tools then offer the users relevant choices, each of which leads to a correct result. This is fundamentally different from trying to pull the right results from a myriad of possible choices resulting from a web search.

We are also seeing an increasing number of contextual technologies make their way into the browser. Top-down semantic technologies need to work without publishers doing anything; and so to infer context, contextual technologies integrate into the browser. Firefox's recommended extensions page features a number of contextual browsing solutions - Interclue, ThumbStrips, Cooliris, and BlueOrganizer (from my own company).

The common theme among these tools is the recognition of information and the creation of specific micro contexts for the users to interact with that information.

7. Semantic Databases

Semantic databases are another breed of semantic applications focused on annotating web information to be more structured. Twine, a product of Radar Networks and currently in private beta, focuses on building a personal knowledge base. Twine works by absorbing unstructured content in various forms and building a personal database of people, companies, things, locations, etc. The content is sent to Twine via a bookmarklet, via email, or manually. The technology needs to evolve more, but one can see how such databases can be useful once the kinks are worked out. One of the very powerful applications that could be built on top of Twine, for example, is personalized search - a way to filter the results of any search engine based on a particular individual.

It is worth noting that Radar Networks has spent a lot of time getting the infrastructure right. The underlying representation is RDF and is ready to be consumed by other semantic web services. But a big chunk of the core algorithms, the ones that are dealing with entity extraction, are being commoditized by Semantic Web APIs. Reuters offers this as an API call, for example, and so moving forward, Twine won't need to be concerned with how to do that.

Another big player in the semantic databases space is a company called Metaweb, which created Freebase. In its present form, Freebase is just a fancier and more structured version of Wikipedia - with RDF inside and less information in total. The overall goal of Freebase, however, is to build a Wikipedia equivalent of the world's information. Such a database would be enormously powerful because it could be queried exactly - much like relational databases. So once again the promise is to build much better search.

But the problem is, how can Freebase keep up with the world? Google indexes the Internet daily and grows together with the web. Freebase currently allows editing of information by individuals and has bootstrapped by taking in parts of Wikipedia and other databases, but in order to scale this approach, it needs to perfect the art of continuously taking in unstructured information from the world, parsing it, and updating its database.

The problem of keeping up with the world is common to all database approaches, which are effectively silos. In the case of Twine, there needs to be continuous influx of user data, and in the case of Freebase there needs to be influx of data from the web. These problems are far from trivial and need to be solved successfully in order for the databases to be useful.

Conclusion

With any new technology it is important to define and classify things. The Semantic Web is offering an exciting promise: improved information discoverability, automation of complex searches, and innovative web browsing. Yet the Semantic Web means different things to different people. Indeed, its definitions in the enterprise and consumer spaces are different, and there are different means to a common end - top-down vs. bottom-up and microformats vs. RDF. In addition to these patterns, we are observing the rise of semantic APIs and contextual browsing tools. All of these are in their early days but hold a big promise to fundamentally change the way we interact with information on the web.

What do you think about Semantic Web Patterns? What trends are you seeing and which applications are you waiting for? And if you work with semantic technologies in the enterprise, please share your experiences with us in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php Trends Fri, 26 Dec 2008 09:00:00 -0800 Alex Iskold
Giftag: Social Wishlists Using Open Standards Giftag may not be a revolutionary product, but it is kind of nifty. The product was created by Best Buy (BBY), a retailer that didn't have an online registry service. Instead of creating one, though, they decided to create Giftag instead: a browser plugin that lets you make online wishlists and share them with your friends. The technology will be integrated into Best Buy's web site in the coming months.

]]>Sponsor

]]> Why Giftag?

You may be wondering why you should use a Firefox plugin to create a wish list instead of simply using Amazon's new universal wish list service. The reason is openness. Where Amazon's tool comes from what is somewhat of a closed platform, Giftag is using an open data format: hProduct.

hProduct is an emerging data standard that is suitable for embedding in (X)HTML, Atom, RSS, and arbitrary XML. The format will be related to several other microformats like hListing and hReview. Since we like to support open standards here at RWW, we like what Giftag has done.

The Giftag Homepage

How It Works

Using Giftag is simple, especially if the site you are on already supports hProduct. You just click the button in your Firefox toolbar and, at the bottom of the screen, a tray will appear where all the information about the product (name, description, price, etc.) displays. All you need to do is select which of your lists to put the item on. If the retailer's site doesn't support hProduct, you're still able to add items to the list by drawing a box around the item, but you'll have to fill in the information about the product yourself.

All the lists you create can be selectively shared with others. You could choose some lists to be shared with family and friends and others could be shared publicly. You can also share your items with your friends on Facebook via the Giftag application. Developers can access Giftag APIs to build applications of their own.

The value of the hProduct standard is clear when you use an application like Gifttag. Given Best Buy's involvement, we hope that this will push more retailers into adopting the standard on their own sites. A web of retail sites that support the same standard could open the door to even more applications that take advantage of the standard - and that's something we would like to see.

]]>Discuss]]>
http://www.readwriteweb.com/archives/giftag_social_wishlists_using_open_standards.php http://www.readwriteweb.com/archives/giftag_social_wishlists_using_open_standards.php Products Tue, 09 Sep 2008 18:40:35 -0800 Sarah Perez
Making the Web Searchable: The Story of SearchMonkey Last week at the SemTech 2008 Conference that took place in San Jose, Yahoo! Researcher Peter Mika spoke in detail about the company's new SearchMonkey search platform initiative. Mika talked broadly about his work looking at metadata on the web, and how that led to the birth of SearchMonkey. This post is based on notes from that talk.

]]>Sponsor

]]> History of Web Page Annotations

The motivating question for Mika's presentation was: How can we make web search better by leveraging web annotation? There are many kinds of annotations, but Mika focused on simple data and lightweight semantics, and began by reviewing the history and evolution of annotations to explain how we got to where we are today.

One of the first methods of annotating HTML was Simple HTML Ontology Extensions (SHOE). This method allowed for the declaration of ontologies as well as relationships between the entities on HTML pages. The problem with it was that it introduced new tags that were not part of standard HTML and were not recognized by most browsers.

In 2003 Tantek Celik started work on Microformats - a way to embed light semantics using XHTML. Microformats are now driven by a community of developers, which evangelizes existing formats and is working on new ones. The major focus of this effort is to leverage standards, but Microformats are limited because they don't share common syntax. Every microformat looks different and there are no ontologies, and no schemas.

Things get particularly complicated when you start combining different Microformats, for example, when you describe that a person wrote a review at a particular event. In addition to this, Microformats have no concept of unique identity, and for this reason are largerly incompatible with other Semantic Web efforts. Yet, Microformats took off and have become somewhat widespread. So, the take away here is that simple things can quickly gain adoption.

Another way of providing metadata that emerged recently is tagging. As an example, Flickr uses tags for photos to enable its users to annotate and describe the content. The problem with tags is that there is no agreement on meaning, so the same tag on Flickr and del.icio.us can mean different things, and there's no way to be sure which tag means what. Tags are a much more personal way of annotating information; they are not objective.

In 2005, Ian Davis, CTO of Semantic Web infrastructure company Talis, proposed eRDF - a form of RDF that can be embedded into HTML (compatible with HTML4). There is a simple mapping from eRDF to RDF so you can use any RDF/OWL vocabulary. But eRDF is not full RDF -- it has limitations. For example, there are no data types and there no blank nodes. Also, each page can only "talk" about itself and not about other pages.

Finally, the W3C published RDFa the latest embedding of RDF in XHTML, which has full RDF support. RDFa adds complexity in terms of implementation, but at the moment, gives the best way to embed RDF into HTML.

How Much Metadata is Out There?

Given the increasing trend towards web annotations, the natural question is, Just how much metadata is already out there?. Peter Mika set out to answer this question and created a prototype, called Microsearch. The idea was to look at web pages and to see how much metadata was there. Beyond that, Mika was also interested in what type of metadata, as well as the ratio between annotated and plain HTML pages.

With the Microsearch exercise, Mika wanted to demonstrate what could be done to enhance search with this information. For each type of metadata, Mika augmented search results with additional links and information. For example, maps, events, information from hCard, etc. are presented in an enhanced way, unlike what we're used to seeing with today's search engines.

Mika discovered a few interesting things. First, about 53% of queries have 1 page with metadata in the top 10 results. However, lots of the data Mika saw was not clean and contained information that was not well formed, and performance was pretty poor due to lack of an index. So the unfortunate conclusion that Mika came to was that RDF templating was difficult and the approach was not easily scalable. Finally, Mika realized that metadata really needs to be on the page for users to see, because otherwise there is a big opportunity for semantic spam.

The Birth of SearchMonkey

The point of any experiment is to draw the right conclusions. Looking at the facts, Mika and the Yahoo! search team realized that they could not count on enhancing search by leveraging metadata on today's web - it simply does not exist to the extent needed. At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation. And thus, SearchMonkey was born.

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience - enhanced results contain what Yahoo! calls an "infobar" - a set of overlays to present additional information. For example, with SearchMonkey, LinkedIn is able to surface additional information from the user profile, Netflix can present a blurb a about plot and a rating for a movie, and Barnes & Nobles can embed a preview of a book.

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best - the publishers - to define what should be presented and how.

A Better Search Experience Ahead

This first version of Search Monkey is just the first small step towards creating a better search experience. Much more is planned, but even with this first simple version, we can clearly see the power of semantics and annotations in web pages. By creating the right incentive for publishers and putting them in control, Yahoo! is aiming to up the bar on search results, and, who knows, maybe even start attracting converts from Google's plain-looking results.

]]>Discuss]]>
http://www.readwriteweb.com/archives/semtech_making_the_web_searchable_searchmonkey.php http://www.readwriteweb.com/archives/semtech_making_the_web_searchable_searchmonkey.php Semantic Web Tue, 27 May 2008 20:29:34 -0800 Alex Iskold
Wine, Film and Books: Adaptive Blue Offers Open Format to Make the Web Smarter Semantic web company Adaptive Blue has published what it hopes will become a standard for publishers who want to signal in their header tags when a webpage is primarily about a particular book, film, wine or other type of objects. From search to trend analysis to a richer browsing experience - the developments that could come from adoption such a standard are many.

Called AB Meta, the format was developed in concert with a number of other web companies and is aimed to be part of a larger effort to pick up where existing Semantic Web and microformats markup leaves off. It's simple and extensible.

]]>Sponsor

]]> When the meaning of web pages becomes machine readable - magical things can happen.

Bloggers who want to mark up particular pages or post pages with AB Meta can do so using Dougal Campbell's HeadMeta WordPress plugin. Some post-level meta data editing is possible with Typepad but Blogger users are out of luck. Hopefully someone will build a UI for self-publishers.

For commercial publishers and retail sites, the AB Meta standard should be much easier to implement across their sites. In addition to the new spec drawn up to describe objects, AB Meta also leverages existing Dublin Core markup when available.

Picture 107.png

Above is a sample of some simple AB Meta, below is an extended version.

Picture 108.png

AM Meta is based largely on Adaptive Blue's work developing its BlueOrganizer smart browser plug-in and SmartLinks contextual reference tool. Now that the company has come up with a robust, simple and extensible format for designating the primary object of a web page and describing its various characteristics - the next logical step is to open that format up and do some biz dev building adoption in web pages themselves. Though anyone will be able to index AB Meta, Adaptive Blue's products will presumably be the most advanced at first in what it can do with the markup of its own creation.

We're big fans of the semantic web here at RWW and (disclosure) Adaptive Blue CEO Alex Iskold writes some of the smartest posts about it that you'll find here or anywhere.

]]>Discuss]]>
http://www.readwriteweb.com/archives/wine_film_and_books_adaptive_b.php http://www.readwriteweb.com/archives/wine_film_and_books_adaptive_b.php Products Mon, 21 Apr 2008 17:36:00 -0800 Marshall Kirkpatrick
Yahoo! Pushes 26.5 Million Microformats Into the Wild It was just a couple of weeks ago that Yahoo! announced that it would begin indexing semantic markup language such as microformats in its search engine. That's a huge win for the bottom-up approach to building the Semantic Web, and provides an incentive for publishers to start adopting semantic markup like RDF and microformats. As a publisher, Yahoo! is also eating its own dogfood, so to speak, and putting microformats to use on its own sites.

]]>Sponsor

]]> Yesterday, Yahoo! announced that it had begun using microformats on its European shopping search engine Kelkoo. Specifically, Yahoo! Europe pushed out the biggest deployment yet of the draft hListing format, which is a new format used for marking up classifieds listings.

The actual number of hListing's Yahoo! put out there was 26,456,448, as well as an additional 6,500 hCard listings describing merchants. "This bumper injection of structured data into Kelkoo’s pages makes it ripe for re-use, be that browser extensions to draw out product information on our pages, indexing services aggregating product listings together or mashing up the data for reuse in widgets," said developer Ben Ward of Yahoo! Europe.

Ward also indicated that Yahoo! hoped that other sites would adopt the hListing microformat. "After years of waiting for technology to move the web forward, it’s happening. There’s information our there now to pull of functionality we never had before. As web developers, there’s little to do but slip in microformatted mark-up wherever we can, and start having fun in consuming it," he said.

]]>Discuss]]>
http://www.readwriteweb.com/archives/yahoo_kelkoo_microformats.php http://www.readwriteweb.com/archives/yahoo_kelkoo_microformats.php Products Fri, 28 Mar 2008 10:59:40 -0800 Josh Catone
Semantic Web Patterns: A Guide to Semantic Technologies In this article, we'll analyze the trends and technologies that power the Semantic Web. We'll identify patterns that are beginning to emerge, classify the different trends, and peak into what the future holds.

In a recent interview Tim Berners-Lee pointed out that the infrastructure to power the Semantic Web is already here. ReadWriteWeb's founder, Richard MacManus, even picked it to be the number one trend in 2008. And rightly so. Not only are the bits of infrastructure now in place, but we are also seeing startups and larger corporations working hard to deliver end user value on top of this sophisticated set of technologies.

]]>Sponsor

]]> The Semantic Web means many things to different people, because there are a lot of pieces to it. To some, the Semantic Web is the web of data, where information is represented in RDF and OWL. Some people replace RDF with Microformats. Others think that the Semantic Web is about web services, while for many it is about artificial intelligence - computer programs solving complex optimization problems that are out of our reach. And business people always redefine the problem in terms of end user value, saying that whatever it is, it needs to have simple and tangible applications for consumers and enterprises.

The disagreement is not accidental, because the technology and concepts are broad. Much is possible and much is to be imagined.

1. Bottom-Up and Top-Down

We have written a lot about the different approaches to the Semantic Web - the classic bottom-up approach and the new top-down one. The bottom-up approach is focused on annotating information in pages, using RDF, so that it is machine readable. The top-down approach is focused on leveraging information in existing web pages, as-is, to derive meaning automatically. Both approaches are making good progress.

A big win for the bottom-up approach was recent announcement from Yahoo! that their search engine is going to support RDF and microformats. This is a win-win-win for publishers, for Yahoo!, and for customers - publishers now have an incentive to annotate information because Yahoo! Search will be taking advantage of it, and users will then see better, more precise results.

Another recent win for the bottom-up approach was the announcement of the Semantify web service from Dapper (previous coverage). This offering will enable publishers to add semantic annotations to existing web pages. The more tools like Semantify that pop up, the easier it will be for publishers to annotate pages. Automatic annotation tools combined with the incentive to annotate the pages is going to make the bottom-up approach more compelling.

But even if the tools and incentive exists, to make the bottom-up approach widespread is difficult. Today, the magic of Google is that it can understand information as is, without asking people to fully comply with W3C standards of SEO optimization techniques. Similarly, top-down semantic tools are focused on dealing with imperfections in existing information. Among them are the natural language processing tools that do entity extraction - such as the Calais and TextWise APIs that recognize people, companies, places, etc. in documents; vertical search engines, like ZoomInfo and Spock, which mine the web for people; technologies like Dapper and BlueOrganizer, which recognize objects in web pages; and Yahoo! Shortcuts, Snap and SmartLinks, which recognize objects in text and links.

[Disclosure: Alex Iskold is founder and CEO of AdaptiveBlue, which makes BlueOrganizer and SmartLinks.]

Top-down technologies are racing forward despite imperfect information. And, of course, they benefit from the bottom-up annotations as well. The more annotations there are, the more precise top-down technologies will get - because they will be able to take advantage of structured information as well.

2. Annotation Technologies: RDF, Microformats, and Meta Headers

Within the bottom-up approach to annotation of data, there are several choices for annotation. They are not equally powerful, and in fact each approach is a tradeoff between simplicity and completeness. The most comprehensive approach is RDF - a powerful, graph-based language for declaring things, and attributes and relationships between things. In a simplistic way, one can think of RDF as the language that allows expressing truths like: Alex IS human (type expression), Alex HAS a brain (attribute expression), and Alex IS the father of Alice, Lilly, and Sofia (relationship expression). RDF is powerful, but because it is highly recursive, precise, and mathematically sound, it is also complex.

At present, most use of RDF is for interoperability. For example, the medical community uses RDF to describe genomic databases. Because the information is normalized, the databases that were previously silos can now be queried together and correlated. In general, in addition to semantic soundness, the major benefit of RDF is interoperability and standardization, particularly for enterprises, as we will discuss below.

Microformats offer a simpler approach by adding semantics to existing HTML documents using specific CSS styles. The metadata is compact and is embedded inside the actual HTML. Popular microformats are hCard, which describes personal and company contact information, hReview, which adds meta information to review pages, and hCalendar, which is used to describe events.

Microformats are gaining popularity because of their simplicity, but they are still quite limiting. There is no way to described type hierarchies, which the classic semantic community would say is critical. The other issue is that microformats are somewhat cryptic, because the focus is to keep the annotations to a minimum. This, in turn, brings up another question of whether embedding metadata into the view (HTML) is a good idea. The question is: what happens if the underlying data changes when someone makes a copy of the HTML document? Nevertheless, despite these issues, microformats are gaining popularity because they are simple. Microformats are currently used by Flickr, Eventful, and LinkedIn; and many other companies are looking to adopt microformats, particularly because of the recent Yahoo! announcement.

An even simpler approach is to put meta data into the meta headers. This approach has been around for a while and it is a shame that it has not been widely adopted. As an example, the New York Times recently launched extended annotations for its news pages. The benefit of this approach is that it works great for pages that are focused on a topic or a thing. For example, a news page can be described with a set of keywords, geo location, date, time, people, and categories. Another example would be for book pages. O'Reilly.com has been putting book information into the meta headers, describing the author, ISBN, and category of the book.

Despite the fact that all these approaches are different, they are also somewhat complimentary; and each of them is helpful. The more annotations there are in web pages, the more standards are implemented, and the more discoverable and powerful the information becomes.

3. Consumer and Enterprise

Yet another dimension of the conversation about the Semantic Web is the focus on consumer and enterprise applications. In the consumer arena we have been looking for a Killer App - something that delivers tangible and simple consumer value. People simply do not care that a product is built on the Semantic Web, all they are looking for is utility and usefulness.

Up until recently, the challenge has been that the Semantic Web is focused on rather academic issues - like annotating information to make it machine readable. The promise was that once the information is annotated and the web becomes one big giant RDF database, then exciting consumer applications will come. The skeptics, however, have been pointing out that first there needs to be a compelling use case.

Some consumer applications based on the Semantic Web: generic and vertical search, contextual shortcuts and previews, personal information management systems, semantic browsing tools. All of these applications are in their early days and have a long way to go before being truly compelling for the average web user. Still, even if these applications succeed, consumers will not be interested in knowing about the underlying technology - so there is really no marketing play for the Semantic Web in the consumer space.

Enterprises are a different story for a couple of reasons. First, enterprises are much more used to techno speak. To them utilizing semantic technologies translates into being intelligent and that, in turn, is good marketing. 'Our products are better and smarter because we use the Semantic Web' sounds like a good value proposition for the enterprise.

But even above the marketing speak, RDF solves a problem of data interoperability and standards. This "Tower of Babel" situation has been in existence since the early days of software. Forget semantics; just a standard protocol, a standard way to pass around information between two programs, is hugely valuable in the enterprise.

RDF offers a way to communicate using XML-based language, which on top of it has sound mathematical elements to enable semantics. This sounds great, and even the complexity of RDF is not going to stop enterprises from using it. However, there is another problem that might stop it - scalability. Unlike relational databases, which have been around for ages and have been optimized and tuned, XML-based databases are still not widespread. In general, the problem is in the scale and querying capabilities. Like object-oriented database technologies of the late nineties, XML-based databases hold a lot of promise, but we are yet to see them in action in a big way.

4. Semantic APIs

With the rise of Semantic Web applications, we are also seeing the rise of Semantic APIs. In general, these web services take as an input unstructured information and find entities and relationships. One way to think of these services is mini natural language processing tools, which are only concerned with a subset of the language.

The first example is the Open Calais API from Reuters that we have covered in two articles here and here. This service accepts raw text and returns information about people, places, and companies found in the document. The output not only returns the list of found matches, but also specifies places in the document where the information is found. Behind Calais is a powerful natural language processing technology developed by Clear Forest (now owned by Reuters), which relies on algorithms and databases to extract entities out of text. According to Reuters, Calais is extensible, and it is just a matter of time before new entities will be added.

Another example is the SemanticHacker API from TextWise, which is offering a one million dollar prize for the best commercial semantic web application developed on top of it. This API classifies information in documents into categories called semantic signatures. Given a document, it outputs entities or topics that the document is about. It is kind of like Calais, but also delivers a topical hierarchy, where the actual objects are leafs.

Another semantic API is offered by Dapper - a web service which facilitates the extraction of structure from unstructured HTML pages. Dapper works by enabling users to define attributes of an object based on the bits of the page. For example, a book publisher might define where the information about author, isbn and number of pages is on a typical book page and the Dapper application would then create a recognizer for any page on the publisher site and enable access to it via REST API.

While this seems backwards from an engineering point of view, Dapper's technology is remarkably useful in the real world. In a typical scenario, for web sites that do not have clean APIs to access their information, even non-technical people can build an API in minutes with Dapper. This is a powerful way of quickly turning web sites into web services.

5. Search Technologies

Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search. The premise that semantical understanding of pages leads to vastly better search has yet to be validated. The two main contenders, Hakia and PowerSet, have made some progress, but not enough. The problem is that Google's algorithm, which is based on statistical analysis, deals just fine with semantic entities like people, cities, and companies. When asked What is the capital of France? Google returns a good enough answer.

There is a growing realization that marginal improvement in search might not be enough to beat Google, and to declare search the killer app for the Semantic Web. Likely, understanding semantics is helpful but not sufficient to build a better search engine. A combination of semantics, innovative presentation, and memory of who the user is, will be necessary to power the next generation search experience.

Alternative approaches also attempt to overlay semantics on top of the search results. Even Google ventures into verticals by partitioning the results into different categories. The consumer can then decide which type of answer they are interested in.

Yet search is a game that is far from won and a lot of semantic companies are really trying to raise the bar. There may be another twist to the whole search play - contextual technologies, as well as semantic databases, could lead to qualitatively better results. And so we turn to these next.

6. Contextual Technologies

We are seeing an increasing number of contextual tools entering the consumer market. Contextual navigation does not just improve search, but rather shortcuts it. Applications like Snap or Yahoo! Shortcuts or SmartLinks "understand" the objects inside text and links and bring relevant information right into the user's context. The result is that the user does not need to search at all.

Thinking about this more deeply, one realizes that contextual tools leverage semantics in a much more interesting way. Instead of trying to parse what a user types into the search box, contextual technologies rely on analyzing the content. So the meaning is derived in a much more precise way - or rather, there is less guessing. The contextual tools then offer the users relevant choices, each of which leads to a correct result. This is fundamentally different from trying to pull the right results from a myriad of possible choices resulting from a web search.

We are also seeing an increasing number of contextual technologies make their way into the browser. Top-down semantic technologies need to work without publishers doing anything; and so to infer context, contextual technologies integrate into the browser. Firefox's recommended extensions page features a number of contextual browsing solutions - Interclue, ThumbStrips, Cooliris, and BlueOrganizer (from my own company).

The common theme among these tools is the recognition of information and the creation of specific micro contexts for the users to interact with that information.

7. Semantic Databases

Semantic databases are another breed of semantic applications focused on annotating web information to be more structured. Twine, a product of Radar Networks and currently in private beta, focuses on building a personal knowledge base. Twine works by absorbing unstructured content in various forms and building a personal database of people, companies, things, locations, etc. The content is sent to Twine via bookmarklet or via email or manually. The technology needs to evolve more, but one can see how such databases can be useful once the kinks are worked out. One of the very powerful applications that could be built on top of Twine, for example, is personalized search - a way to filter the results of any search engine based on a particular individual.

It is worth noting that Radar Networks has spent a lot of time getting the infrastructure right. The underlying representation is RDF and is ready to be consumed by other semantic web services. But a big chunk of the core algorithms, the ones that are dealing with entity extraction, are being commoditized by Semantic Web APIs. Reuters offers this as an API call, for example, and so moving forward, Twine won't need to be concerned with how to do that.

Another big player in the semantic databases space is a company called Metaweb, which created Freebase. In its present form, Freebase is just a fancier and more structured version of Wikipedia - with RDF inside and less information in total. The overall goal of Freebase, however, is to build a Wikipedia equivalent of the world's information. Such a database would be enormously powerful because it could be queried exactly - much like relational databases. So once again the promise is to build much better search.

But the problem is, how can Freebase keep up with the world? Google indexes the Internet daily and grows together with the web. Freebase currently allows editing of information by individuals and has bootstrapped by taking in parts of Wikipedia and other databases, but in order to scale this approach, it needs to perfect the art of continuously taking in unstructured information from the world, parsing it, and updating its database.

The problem of keeping up with the world is common to all database approaches, which are effectively silos. In the case of Twine, there needs to be continuous influx of user data, and in the case of Freebase there needs to be influx of data from the web. These problems are far from trivial and need to be solved successfully in order for the databases to be useful.

Conclusion

With any new technology it is important to define and classify things. The Semantic Web is offering an exciting promise: improved information discoverability, automation of complex searches, and innovative web browsing. Yet the Semantic Web means different things to different people. Indeed, its definition in the enterprise and consumer spaces is different, and there are different means to a common end - top-down vs. bottom up and microformats vs. RDF. In addition to these patterns, we are observing the rise of semantic APIs and contextual browsing tools. All of these are in their early days, but hold a big promise to fundamentally change the way we interact with information on the web.

What do you think about Semantic Web Patterns? What trends are you seeing and which applications are you waiting for? And if you work with semantic technologies in the enterprise, please share your experiences with us in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/semantic_web_patterns.php http://www.readwriteweb.com/archives/semantic_web_patterns.php Trends Tue, 25 Mar 2008 15:20:45 -0800 Alex Iskold
And Nerds Became Kings: Yahoo! to Announce Semantic Web Support TechCrunch and Search Engine Land are reporting this morning that Yahoo! will now be indexing Semantic Web and Microformats markup from around the web and will use that information to display more structured search results. Here is the Yahoo! post about the news.

We asked last month how vulnerable Google is in search and the leveraging of standards-based structured data may be the most obvious approach to improving on the search industry's current best practices. As Tim Berners-Lee said just weeks ago the time for the semantic web is now.

]]>Sponsor

]]> What Does This Mean?

Here's one example of what that could mean: Today, a web service might work very hard to scour the internet to discover all the book reviews written on various sites, by friends of mine, who live in Europe. That would be so hard that no one would probably try it. The suite of technologies Yahoo! is moving to support will make such searches trivial. Once publishers start including things like hReview, FOAF and geoRSS in their content then Yahoo!, and other sites leveraging Yahoo! search results, will be able to ask easily what it is we want to do with those book reviews. Say hello to a new level of innovation.

This has been really geeky stuff for a long time, with little market traction and a whole lot of promises from academic research and outlying innovators. That will now change.

The basic idea behind Semantic Web technology is that by signaling what kind of content you are publishing on an item-by-item or field-by-field basis, publishers can help make the meaning of their text readable by machines. If machines are able to determine the meaning of the content on a page, then our human brains don't have to waste time determining, for example, which search results go beyond containing our keywords and actually mean what we are looking for.

Publishers will now be able to clearly designate content on a page as related to other particular content, as business card type information, as a calendar event, a review or as many other types of content. It will make Yahoo! a lot smarter and should shake up the world of Search Engine Optimization and web publishing, a lot.

Who Does the Markup?

Many observers of the Semantic Web, including us at times, have argued that it's unrealistic to expect web publishers to markup their own content and that a more realistic path to market for technologies based on semantics is to build applications that can parse the semantics out of other peoples' content from outside.

In my interview with Mark Zuckerberg last week, for example, the Facebook CEO expressed disinterest in participating in the Semantic Web. I didn't publish it in the interview, but he indicated such a move would be up to a third party site organizing information via the Facebook Platform if it was going to happen at all. He will probably change his tune now, as adding hCard support to Facebook public profiles will now be a no-brainer. Other publishers will be faced with similar questions.

Semantic web markup will quickly become standard practice though for all CMS/publishing systems and we'll wonder what we ever did without it or why it seemed so hard.

Google Will Soon Follow

This move by Yahoo! will likely be followed up by Google, it's just too much opportunity for any search engine to pass up. Semantic markup is like a content-level site map, something all the search engines have agreed on a standard for already. Semantic web technology is next. There will be big job opportunities, more than there are for SEO in the short term, for people who can help publishers implement Semantic Web markup retroactively and into the future.

The Semantic Web was one of a handful of topics that we identified as key themes for the coming year in our RWW Toolkit for 2008. Check that toolkit out for resources you can use to follow this important topic as it unfolds.

]]>Discuss]]>
http://www.readwriteweb.com/archives/yahoo_supports_semantic_web.php http://www.readwriteweb.com/archives/yahoo_supports_semantic_web.php Semantic Web Thu, 13 Mar 2008 09:35:11 -0800 Marshall Kirkpatrick
4 Technologies for Portability in Social Networks: A Primer Today Marshall Kirkpatrick interviewed Facebook CEO Mark Zuckerberg at SXSW, with the main topic of discussion being Data Portability. Later in the day at the festival, a star studded panel discussed building portable social networks. The panel highlighted four technologies that help make identity and data more portable across social networks: hCard; XFN and FOAF; OpenID; OAuth.

]]>Sponsor

]]> This post serves as an introduction to each of these technologies.

hCard: Providing Your Contact Information

MicroformatsUsers are tired of repeatedly entering profile information over and over again. This problem is solved by the microformat hCard. Leslie Chicoine, an Experience Designer at Get Satisfaction, talked about how her company had created a sign up process for their web application using hCard. (see screen shot below)

HCardGetSatisfaction

XFN & FOAF: Who are your contacts

SocialGraphAPIAnother microformat, XFN, and the FOAF project are techniques for embedding relationships in links. This allows social networks to recommend contacts that should be shared, without scraping web based email clients. Recently, Google introduced a Social Graph API, which "index[es] the public Web for XHTML Friends Network (XFN), Friend of a Friend (FOAF) markup and other publicly declared connections".

Something very interesting that I wasn't aware of until today's panel was that both Plaxo & Six Apart were working on something similar before Google announced OpenSocial, according to Joe Smarr and David Recordon. However, once Google started focusing on this they were happy to hand it over to them - because Google "has the web on a hard drive", so it makes the crawling component of this far less difficult. For a good overview on Google's Social Graph API, check out the following introductory video:

OpenID: Authenticating Individuals

Openid Big Logo OpenID is a decentralized framework for allowing social networks (and other web applications) to authenticate users. In other words, it lets users login using shared credentials across different services. It also allows individuals to decide what information they want to share with each application. For example, a user might decide not to provide their postal or email address.

OAuth: Authorizing Access

The final protocol discussed was OAuth. It is a protocol that is less about authentication (OpenID) and more about authorization. The protocol has been developed over the last year. The specification was released in December 2007 and modeled off a number of authorization protocols, including the Flickr Authorization protocol. According to Chris Messina, a number of services have already started using it including:

OAuth
  • Fireeagle
  • Open Social
  • Pownce
  • Get Satisfcation, and
  • Magnolia
  • (and Twitter support will be coming soon)

Chris also pointed to a comment in a recent post of ours about email passwords, that highlighted the need for tools like these. Also there was a comment on RWW from Oren Michels at Mashery, indicating it is the most requested feature for them right now.

Conclusion

DPLogoSecurely moving your data around the web has increasingly become an important concept on the web. Arguably, it was the most discussed meme at this year's SXSW. While not an application, you could say it has been 'this year's Twitter'.

The Data Portability group deserves credit for educating the market. Beyond that, it is also an idea whose time has clearly come. It is interesting to think what applications will be built on top of these portability standards - they might be popular by next year's SXSW!

]]>Discuss]]>
http://www.readwriteweb.com/archives/4_technologies_for_portability.php http://www.readwriteweb.com/archives/4_technologies_for_portability.php SXSW 2008 Mon, 10 Mar 2008 21:39:34 -0800 Sean Ammirati
Reuters Wants The World To Be Tagged As Richard MacManus recently predicted, in 2008 we'll witness the rise of semantic web services. From the native support for Microformats in Firefox 3, to the New York Times' utilization of rich headers metadata, to this week's release of the Social Graph API by Google, semantics are starting to slip onto the web. The impact is being felt because large companies are really starting to focus on structured information.

In the same vein, last week Reuters - an international business and financial news giant - launched an API called Open Calais.

]]>Sponsor

]]> The API does a semantic markup on unstructured HTML documents - recognizing people, places, companies, and events. This technology is the next generation of the Clear Forest offering, which Reuters acquired last year. We have profiled Clear Forest on ReadWriteWeb and in this post we will look at what Reuters opened up and why.

Open Calais API Basics

The idea behind Calais is simple - identify interesting bits into metadata in documents. In this implementation the focus is on People, Companies, Places, and Events, but surely the technology can be adopted to other entities. The heavy lifting is done by the combination of a natural language processing engine and a massive hard coded, learning database that Clear Forest has built.

For any document submitted into Calais, entities are identified, extracted and annotated. For example, when the press release about the acquisition of Clear Forest is analyzed, the following meta data is identified:

  • Relations: Acquisition, CompanyInvestment, PersonProfessionalPast
  • Organization: Palo Alto Research Center
  • IndustryTerm: broader search development effort, text search, text analytics software, ...
  • Company: Time Warner Inc.,Reuters, Pitango Venture Capital, Inxight, ClearForest Ltd, ...
  • Person: Gerry Campbell
  • Country: United States, Israel
  • City: Tel Aviv, SAN FRANCISCO, Waltham

This is rather impressive set of information. According to the documentation page, the response is delivered in under one second for larger documents, and much faster for smaller ones - in other words, real time or near to it.

What was not quite clear from the documentation is if Calais can deal with raw HTML pages. It appears that the API requires an XML document, where the main text is marked differently from the header and footer. Ideally, an API like this should be able to accept URLs, because distilling structure from HTML would not be trivial for developers. Another thing that we noticed is that the resulting document is extensively marked up. What the developers get back is literally the output of the Calais engine. It would be good to be able to get a lighter version, which simply identifies entities and their positions in the text.

Currently the API is free for both commercial and non-commercial use and Reuters says it is prepared to scale for a massive concurrent demand. The question is then how can this be used?

What is Calais Good For?

There are quite a few interesting applications for this technology. First - better search. Knowing the kinds of entities in the text allows developers to build intelligent search engines that look for related content. For example, imagine a page on Reuters with this press release and in the sidebar links to learn more about Clear Forest, Reuters, Inxight, etc. Similarly, Calais could enable links to countries and cities mentioned in the document. And these searches need not be generic searches, but rather specific vertical ones.

Another application would be to build engines like Inform, which automatically inserts links into raw text. By automatically identifying entities in the document, Calais also identifies what should be linked. So a big piece of Inform's secret sauce is trivialized. The rest is basically a raw search through the archive, which can be done with a Google custom search engine, for example. It is possible that more tech savvy media companies could leverage Calais in exactly this way.

Another application is structured alerts. Modern alert systems are keyword based and suffer from false positives. Using Calais it is possible to build precise alerts for people, companies, places and events like corporate acquisitions. With the flood of junk in our RSS readers this is rather welcomed news.

Yet another application would be to incorporate on the fly text analysis into the browsers. In a way, this is not much different from having Microformat annotations on the page, except that the annotations are delivered on the fly. For example, a browser could call Calais on document load and obtain a list of people, places, companies, etc. which are embedded in the document. With this information the browser would be able to create a more interesting, more contextual, and relevant experience.

What's In It For Reuters?

Reuters has opened up a generous API, but why? During our interview, Gerry Campbell, the President/Global Head of Search & Content Technologies at Reuters, explained that Reuters wants the world to be tagged. When the world's content is quickly and readily accessible to their customers, Reuters wins. Semantic technologies result in better, faster, more precise and relevant information, and Reuters, as a big player in the information space, wants to be one of the first companies delivering this kind of experience.

Beyond an outstanding customer experience, Calais leads to a unique, attractive set of assets. First - a growing semantic database of people, places, companies and events. With each new document submitted into Calais the database gets richer and more complete. This is a roadmap to a semantic business powerhouse, which is clearly a great position to be in for any business media company. And in a way, what grows beneath Calais will not be that unlike Freebase. Except of course, it is happening completely automatically.

The second big advantage of having an open API is training the system. Any AI-based solution like Clear Forest is in constant need of tuning and evolution. Having other companies use the system would allow the engineers to run into cases that they have not thought about and broaden the capabilities of the system. Campbell told us that Calais is already processing a significant subset of Reuters information in nearly real time. This is both impressive technically and smart from an engineering point of view - it is an "eat your own dog food" approach to building a great piece of software.

Conclusion

The Calais API is another big win for top-down semantic web technologies. Using a mix of natural language processing, AI techniques, and a massive databases, Reuters' solution extracts important bits of information from raw HTML pages. People, Companies, Places, and Events are really at the heart of many business articles, so being able to instantly identify them in the text is a big deal. From better search to better cross-linking and more intelligent browsing, the Calais API is an invitation to tap into one of the most powerful and pragmatic semantic platforms that exists and works today.

What sort of things do you envision to be possible with Calais? What applications would you like to see built with this platform?

]]>Discuss]]>
http://www.readwriteweb.com/archives/reuters_calais.php http://www.readwriteweb.com/archives/reuters_calais.php Products Wed, 06 Feb 2008 01:47:18 -0800 Alex Iskold
How YOU Can Make the Web More Structured We have written a lot here about the the vision of building a structured layer on top of the current web. Annotating billions of HTML documents in a bottom-up way or building top-down tools that can automagically interpret the existing information are the two approaches that we discussed. Together these approaches would result in a global database which will make the web even more connected. The ability to correlate content and concepts accross web sites would reduce the time necessary for searching and would enable the discovery of related information.

]]>Sponsor

]]> In previous posts we discussed the difficulties with the bottom-up approach to the Semantic Web - a sophisticated form of annotating information using tools like RDF and OWL. Among the factors that impair the web wide adoption of these tools is complexity and the lack of clear end user benefits.

On the other hand, the top-down approach that we discussed does not place any burden on content owners and delivers instant benefits to end users. Yet, the top-down tools run into a difficulty - interpreting raw information is not that simple. Typical solutions focus on a vertical, but still suffer from imperfections.

What if there was some minimal annotation in the content to help top-down tools interpret it? In this post we look at how content owners can implement simple annotation strategies which can help the top-down tools and search engines to make the web more structured.

Annotation Basics - Headers

It is striking how many sites today do not use meta tags in the head of the document to provide the bare minimum information about a page's content. Forget building a smarter web, this is just plain bad SEO practice. The work that is being put into generating great content can be offset by lack of a succinct, meaningful description of that content. Every page on the web should have the following information filled in:

  • title - a sentence briefly describing the site/page
  • description - a paragraph about the site/page
  • keywords - a list of keywords that describe the site/page

Note that it makes sense to provide different information for the root page and subsequent pages. For example, for a newspaper or a blog, the root page should provide information about the site at large, while individual article and post pages should contain information about that specific page, not the overall site.

The New York Times' web site provides a good example of how to properly use meta tags. For example, this article on Slowdown in US Growth includes the following meta data:

  • title - U.S. Growth Slowed Drastically in 4th Quarter
  • description - The economy expanded by a weak 0.6 percent in the latest indication of a substantial slowdown and perhaps a recession.
  • keywords - United States Economy,Gross Domestic Product

The New York Times is actually a great example of taking the basics of annotation and building on top of them. Each page includes an extended set of rich meta data including, the author of the article, the date it was published, thumbnail image URL, creator, category and even ticker symbols for public companies that are mentioned in the article. Certainly, the New York Times provides a really great set of information, perhaps even wider than needed for most content, but lets focus on the ones that should be used on a wider scale.

author: Web content is produced by people and for people. With the rise of social culture we are increasingly interested in finding bits of everyone's identity around the web. If something piqued your interest enough for you to blog or to write an article, at least you can put your name on it. Having people attached to content would allow seamless navigation from one to another. There is already a standard meta tag for this, with a suggestive name: author.

thumbnail: We love pictures. Since the launch of Flickr we can't live without them. Facebook's success owes a lot to photo sharing. With bandwidth becoming cheap, we are increasingly become more visual. We do not want text we want pictures, so if a news article or blog post contains an image, it is simple to do what the Times did - generate a meta tag for it. There is no standard meta as far as I know, but any of these would do: thumb, image, picture, thumbnail, etc.

date: As we are becoming a real-time culture the freshness of content becomes paramount. Tagging the page with date is important way of helping classify the page in time. Most blog posts and articles contain dates anyways, and having a standard date header would make it simple and obvious.

location: Location is becoming increasingly more important as well. With GPS and widely available Internet access we are able to easily let people know where we are and are able to take advantage of local services. If the article or a post is related to a specific location there is a conventional way of annotating it. The technical term for annotating content with location information is Geotagging. It generally means placing a pair of latitude and longtitude coordinates. A more relaxed form would be specifying country/region/city and is described in detail by the Geo microformat specification. While specifying exact position coordinates may be difficult, even something as simple as the geo header New York, NY would be very helpful.

Tags in Blog Posts

The concept of tagging, which was popularized by services like del.icio.us and Flickr, is now commonly understood and is ubiquitous. The idea of humans tagging content to categorize it and later to find it is a simple, yet important bit of the web infrastructure. Most major blogging platforms support tags. The tags are standardized based on the rel-tag microformat. You can see the implementation on ReadWriteWeb - each post is tagged with a set of tags.

For example, one of our recent posts contains this tag:
<a href="http://www.readwriteweb.com/tag/twitter" rel="tag">twitter</a>
The tag has several benefits:

  • Readers can instantly click to find other posts with this tag
  • Search engines can better classify the content
  • Semantic tools can offer additional services such as finding related content, pictures, and video

Tags are similar in principle to keywords, but provide more flexibility because they are inside the post and can have richer content. In principle, it could be possible to add more information into the keywords meta tag in the head of the document but it has existed in its current form for several decades and is thus probably not likely to change. In any case, all modern blogging platforms make it trivial to tag content, so there should be no excuses.

Standardizing Blog Templates Across Platforms

In the nineties people created web sites. These days only companies have web sites, individuals have blogs and social network profiles. There is a great opportunity to standardize and structure the information because blogs and profiles are based on templates. Consider a common structure for each blog. One or a few sidebars and the central area for the content. In the content area, on a post page there is a post body, date, author and tags - a minimum set of elements.

Why not standardize on a few things here?

  • <div class="post"> - a container for the post body
  • <div class="sidebar"> - a container for the sidebar
  • <div class="author"> - a container for the author
  • <div class="date"> - a container for the date
  • <div class="tags"> - a container for the tags
  • <div class="comments"> - a container for the comments

Platforms already do have very similar things in place and standardizing between them is rather simple. In no way would this be a competitive advantage or disadvantage to them, but it would be a big help towards making the web more structured. Extending on these basics, it would also be helpful if widgets were wrapped into standard enclosures. A simple widget tag can go a long way toward distinguishing widgets from the other content in the sidebar.

If blogging platforms standardized on these basic conventions, likely major newspapers would follow as well.

The situation with social network profiles is different, as the information contained in them is not public. In addition, there is a competitive advantage to Facebook in having its own proprietary structure. However, entities like the DataPortability group have been created precisely to deal with this problem and Facebook just joined. So we may yet seem some progress on that front.

Beyond Basics - Microformats

The annotations that we discussed up to now are very basic and would a require minimum amount of work from newspapers, bloggers, and blogging platforms to deploy. The advantage of them is that they are simple to implement but would deliver big bang for the buck. Yet, these are primitive ways to annotate content. The next step is to use bottom-up technologies like microformats, which offers a way to embed objects into HTML documents in a compact way.

Microformats have been around for a few years and have certainly caught the attention of some. Several major services are using microformats. For example, Flickr is using the geo microformat and headers to geotag photos. Eventful uses the hCal format to describe meta data for each event. Blogger pages contain hCards for each blogger. But the problem is that there needs to be more and better integration of microformats into the blogging platforms. For example, coming back to the Blogger hCard, right now, most of them are not useful because they do not require people to fill in information and just generate the card based on the login. This is more harmful than good as semantic tools can not take advatange of such cards and they do not look good to people either.

Similarly, there is not much support for geotagging photos and event microformats in the platforms. But even beyond the lack of support, the limitation of the current microformat specs is that they do not cover the basic range of things that people discuss on the web - books, music, movies, recipes, and restaurants are all noticibly absent (the existing hReview microformat does not have a way to express the type of the object or the attributes).

But it does look like with a bit of a push on both the community behind the microformat specs and blogging platforms we could see microformats becoming a major way of annotating information inside blog posts. This would be a welcomed development and would allow a large subset of the web - the blogosphere - to become quite structured.

Conclusion

The vision of the structured web is big and compelling and at the same time is hard to attain. At times, it is difficult to see how we can ever get there. But on some days we think that even if the web could be just a tiny bit more structured it would become so much more connected. And so in this post we considered a set of very basic bottom-up techniques that newspapers, bloggers, and blogging platforms can put in place to make the web more structured.

Putting meta information into page headers is easy and should be a must-do thing for everyone. Beyond that, providing information such as author, date, and location makes data that much more valuable. And if blogging platforms could also standardize on the key elements of the pages, crawlers and intelligent browsing tools could do a better job making sense of the content. Beyond that, microformats are the front runner in annotating the web with meta information about things, but they still need more pushing and effort.

What do you think about these basic structures? Are you going to fix up your blog after reading this post? What other things should we push to standardize on?

]]>Discuss]]>
http://www.readwriteweb.com/archives/structured_web_microformats_tagging_meta_data.php http://www.readwriteweb.com/archives/structured_web_microformats_tagging_meta_data.php Trends Wed, 30 Jan 2008 22:48:55 -0800 Alex Iskold