ReadWriteWeb

Making the Web Searchable: The Story of SearchMonkey

Written by Alex Iskold / May 27, 2008 8:29 PM / 13 Comments

Last week at the SemTech 2008 Conference that took place in San Jose, Yahoo! Researcher Peter Mika spoke in detail about the company's new SearchMonkey search platform initiative. Mika talked broadly about his work looking at metadata on the web, and how that led to the birth of SearchMonkey. This post is based on notes from that talk.

History of Web Page Annotations

The motivating question for Mika's presentation was: How can we make web search better by leveraging web annotation? There are many kinds of annotations, but Mika focused on simple data and lightweight semantics, and began by reviewing the history and evolution of annotations to explain how we got to where we are today.

One of the first methods of annotating HTML was Simple HTML Ontology Extensions (SHOE). This method allowed for the declaration of ontologies as well as relationships between the entities on HTML pages. The problem with it was that it introduced new tags that were not part of standard HTML and were not recognized by most browsers.

In 2003 Tantek Celik started work on Microformats - a way to embed light semantics using XHTML. Microformats are now driven by a community of developers, which evangelizes existing formats and is working on new ones. The major focus of this effort is to leverage standards, but Microformats are limited because they don't share common syntax. Every microformat looks different and there are no ontologies, and no schemas.

Things get particularly complicated when you start combining different Microformats, for example, when you describe that a person wrote a review at a particular event. In addition to this, Microformats have no concept of unique identity, and for this reason are largerly incompatible with other Semantic Web efforts. Yet, Microformats took off and have become somewhat widespread. So, the take away here is that simple things can quickly gain adoption.

Another way of providing metadata that emerged recently is tagging. As an example, Flickr uses tags for photos to enable its users to annotate and describe the content. The problem with tags is that there is no agreement on meaning, so the same tag on Flickr and del.icio.us can mean different things, and there's no way to be sure which tag means what. Tags are a much more personal way of annotating information; they are not objective.

In 2005, Ian Davis, CTO of Semantic Web infrastructure company Talis, proposed eRDF - a form of RDF that can be embedded into HTML (compatible with HTML4). There is a simple mapping from eRDF to RDF so you can use any RDF/OWL vocabulary. But eRDF is not full RDF -- it has limitations. For example, there are no data types and there no blank nodes. Also, each page can only "talk" about itself and not about other pages.

Finally, the W3C published RDFa the latest embedding of RDF in XHTML, which has full RDF support. RDFa adds complexity in terms of implementation, but at the moment, gives the best way to embed RDF into HTML.

How Much Metadata is Out There?

Given the increasing trend towards web annotations, the natural question is, Just how much metadata is already out there?. Peter Mika set out to answer this question and created a prototype, called Microsearch. The idea was to look at web pages and to see how much metadata was there. Beyond that, Mika was also interested in what type of metadata, as well as the ratio between annotated and plain HTML pages.

With the Microsearch exercise, Mika wanted to demonstrate what could be done to enhance search with this information. For each type of metadata, Mika augmented search results with additional links and information. For example, maps, events, information from hCard, etc. are presented in an enhanced way, unlike what we're used to seeing with today's search engines.

Mika discovered a few interesting things. First, about 53% of queries have 1 page with metadata in the top 10 results. However, lots of the data Mika saw was not clean and contained information that was not well formed, and performance was pretty poor due to lack of an index. So the unfortunate conclusion that Mika came to was that RDF templating was difficult and the approach was not easily scalable. Finally, Mika realized that metadata really needs to be on the page for users to see, because otherwise there is a big opportunity for semantic spam.

The Birth of SearchMonkey

The point of any experiment is to draw the right conclusions. Looking at the facts, Mika and the Yahoo! search team realized that they could not count on enhancing search by leveraging metadata on today's web - it simply does not exist to the extent needed. At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation. And thus, SearchMonkey was born.

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience - enhanced results contain what Yahoo! calls an "infobar" - a set of overlays to present additional information. For example, with SearchMonkey, LinkedIn is able to surface additional information from the user profile, Netflix can present a blurb a about plot and a rating for a movie, and Barnes & Nobles can embed a preview of a book.

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best - the publishers - to define what should be presented and how.

A Better Search Experience Ahead

This first version of Search Monkey is just the first small step towards creating a better search experience. Much more is planned, but even with this first simple version, we can clearly see the power of semantics and annotations in web pages. By creating the right incentive for publishers and putting them in control, Yahoo! is aiming to up the bar on search results, and, who knows, maybe even start attracting converts from Google's plain-looking results.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Good points, Alex - I agree, SearchMonkey is a great step forward, in that it enables and even encourages site developers and content creators to annotate content to make it easier for searching. This takes us a little bit further in the direction of the bottom-up Semantic Web, which is a positive sign.

    One question I've been asking, including to Amit Kumar, Director of Product Management for Yahoo! Web Search, is:
    How does SearchMonkey compare with/differ from Google Co-op (the "annotate search results" part of Co-op)?.
    Unfortunately, I haven't found a good answer yet.

    There are also other concerns with SearchMonkey, which I've listed in my recent post:
    http://blog.softwareabstractions.com/the_software_abstractions/2008/05/yahoo-searchmon.html

    Posted by: NitinK | May 27, 2008 9:22 PM



  2. SearchMonkey and Google co-op - how does it really differ good question Nitink. I've asked that myself and yet also couldn't get a reasonable answer...?

    Posted by: Sesli Sohbet | May 27, 2008 11:36 PM



  3. Any technology that will improve the available metadata in a scalable way ranks high on my list. I love searchmonkey and I believe that it offers the right incentive for publishers to take better care of their content.

    I also agree with Mika's point that "metadata really needs to be on the page for users to see, because otherwise there is a big opportunity for semantic spam," but I am not sure how feasible it is.

    Posted by: Tal Keinan | May 28, 2008 12:27 AM



  4. SearchMonkey is a good attempt. And it is also a beautiful dream. But whether it can really succeed, just like many older Yahoo projects, we have to watch.

    I have a feeling that it is too ambitious a project. The success may fairly depend on whether Yahoo really is willing to give up the control on its data. If Yahoo indeed does what it says, there may be a chance for SearchMonkey. Otherwise, I doubt of the future of this project.

    Yihong

    Posted by: Yihong Ding | May 28, 2008 12:44 AM



  5. it is amazing to be made aware of how we are so stuck to the alphabet ... searchmonkey is basically tagging, isn't it?

    how does the mind do this? via meaning associations? certainly not via words....

    a way to do this problem via colors?

    Posted by: gregory | May 28, 2008 2:41 AM



  6. it is amazing to be made aware of how we are so stuck to the alphabet ... searchmonkey is basically tagging, isn't it?

    how does the mind do this? via meaning associations? certainly not via words....

    a way to do this problem via colors? categories of meaning coded via color field?

    really hard to beat consciousness, ain't it

    Posted by: gregory | May 28, 2008 2:42 AM



  7. SearchMonkey is great initiative and let us all hope it brings better search to the internet as the consequence.

    What I am seeing as a problem is that all this semantics only has any effect on the display of the results, not on actual inner workings of the engine or at least inter-result display.

    For example, sites could state where they first got news, and then search results could show how information flew between matches displayed, find out "who broke the news", etc...

    Also sites could on the other hand annotate part of the content as unimportant to be indexed (yes, this would mean less search keywords, but would also place the site higher when it actually has 'content match').

    I can't wait to see what comes next out of Yahoo in this regard.


    Andraz, Zemanta

    Posted by: andraz | May 28, 2008 4:01 AM



  8. @andraz you are spot on, this is just UI. Its the step forward though and more to come.

    Posted by: Alex Iskold | May 28, 2008 5:23 AM



  9. "The point of any experiment is to draw the right conclusions."

    Umm...

    Posted by: Toby | May 28, 2008 5:25 AM



  10. Thanks Professor, as always a great explanation of a complex subject.

    I am interested in the "motivates publishers to use semantic annotations" bit. If the commercials work for publishers and the tools are there, then we will get the annotation.

    Posted by: bernard lunn | May 28, 2008 5:29 AM



  11. The author is to soem extent a hypocrite. He know that RDFa is the best approach to achieving a better semantic web yet his company AdaptiveBlue promote ab meta which is yet another for.

    He deserves a boot up the backside.

    Posted by: pete | May 28, 2008 7:31 AM



  12. Pete,

    These are certainly some harsh words. Let me explain a couple of things to you:

    1) I report on RWW what is going on in the industry with completely open mind without any agenda. This work and the work that I do at AdaptiveBlue is separate.

    2) I do not know that RDFa is the best approach. It is canonical, but it is complicated. What I do know is that AB Meta is easy and simple and people can actually understand it.

    3) AB Meta and RDFa are not mutually exclusive. In addition, we are working on extended the AB Meta spec to describe how it can be used with RDFa and eRDF.

    So I think you are jumping to conclusion without knowing all the facts.

    Posted by: Alex Iskold | May 28, 2008 8:01 AM



  13. Hi Alex,

    Great article!

    A couple of comments, if I may...

    First, those writing comments along the lines that it's not yet certain whether SearchMonkey will succeed have a point, I guess, insofar as nothing is certain. But I do happen to think that the guys at Yahoo! are on to something. And even if it doesn't fly, the broader point is that the relationship between data, search and browsing is changing for good now, and there is no going back.

    This is an area I've been interested in for a number of years, and is in fact the reason I devised RDFa in the first place. But in the model I've been working to create, embedded metadata is an enabler for users to customise parts of any web page, not just search results.

    (If anyone is interested in this topic, I gave a tech talk at Google a few weeks ago, which is available on YouTube at:

    http://www.youtube.com/watch?v=mxE3FeOyS-E

    The talk first introduces the RDFa syntax, and then shows some demos of the way that RDFa can be used to create richer UIs.)

    My second point is just a little bit on the history of RDFa. You mentioned dates for Tantek and Ian, but almost imply that RDFa appeared from nowhere at the W3C!

    The first public draft of the syntax (then called RDF/XHTML) was in February 2004:

    http://www.w3.org/MarkUp/2004/02/xhtml-rdf.html

    although there had been a number of earlier drafts in 2003, that weren't made public. Also, my first public presentation on the topic was later that year, at XML Europe:

    http://webbackplane.com/node/57

    although once again, I had presented on the subject in private meetings. One example is at the W3C Technical Plenary earlier that year, as reported on xml.com:

    http://www.xml.com/pub/a/2004/03/03/deviant.html

    Anyway...the point is that RDFa has a solid history, which is at least as long as the other solutions that have been proposed.

    One final point; you say in the comment that you've added, that RDFa is complicated, counterposing that people can "actually understand" your syntax. I have to take issue with you there, and point out that whilst RDFa has sufficient power to support all of RDF, it is just as much at home expressing simple structures. For example, the well-known Microformat rel-license is also perfectly acceptable RDFa, and you can't get much simpler than that.

    Don't get me wrong, it's great to see this topic getting some air-time, and I enjoyed your article. But as RDFa is rapidly attracting interest, I thought it worth ensuring that your readers had no cause for misunderstanding.

    All the best,

    Mark

    webBackplane
    http://webBackplane.com/mark-birbeck

    Posted by: Mark Birbeck | June 6, 2008 4:06 PM



RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS