powerset - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/powerset en Copyright 2009 Richard MacManus readwriteweb@gmail.com Mon, 23 Nov 2009 10:24:13 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Search and Rescue: 6 Approaches to Semantic Data Collection semantic_search_logo_jun09.jpgIt's been more than ten years since Tim Berners-Lee first spoke about the semantic web and computers indexing all web-based data. He said, "The day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The 'intelligent agents' people have touted for ages will finally materialize." Since then a handful of companies have attempted to tackle the issue of machine-based indexing and language interpretation. None of them are perfect. Below are 6 unique approaches to semantic data collection.

]]>Sponsor

]]> 1. Powerset semantic_search_bing_jun09.jpg This site was one of the first to publicly apply machine-based natural language processing to a consumer search engine. Nevertheless, because public expectations were so high, when Powerset launched a Wikipedia-only beta, reviewers were harsh. The site was acquired by Microsoft shortly after the initial launch and the team has been low key ever since. While Powerset is one of the definitive semantic engines in existence, Microsoft is currently concentrating on using Powerset's technology to index Wikipedia pages in Bing. Powerset's search result pages actually contain a "Try this on Bing Reference" note in the sidebar of the site.

2. Cuil

semantic_search_cuil_jun09.jpg This team touted its language processing product as being much faster to index pages than Google; however, consumers rarely covet speed over quality and the site was criticized right from the start. Expectations were not met as Cuil's claim to 120 billion pages indexed did not match up to the results on Google's reported 1 trillion unique URLs. However, what Cuil did right was separate related search results from regular web results. That being said, without any human intervention, the related results are often bizarre and irrelevant. For instance, my name produces the rankings of Ultimate Fighting Challenge Champions.

3. Hakia

semantic_search_hakia_jun09.jpg This is a natural language search engine where sponsored results, regular web results and "credible" web results are broken down visually into separate categories. Similar to Wikipedia, Hakia employs a community monitoring system for credibility and "credible" results must be peer reviewed and seemingly free of corporate interest. One of the great features of Hakia is that users can tab over the site to show only images or news.

4. Worio

semantic_search_worio_jun09.jpg Worio is considered a "discovery engine" as it is not technically a search engine destination site. While users are still required to visit the Worio destination, search is actually powered by Yahoo, Google or Windows Live search. Regular web results appear in the larger left-side column and natural language-based "discoveries" appear on the right. These discoveries are further refined by personal bookmarks and shared relevancy with Facebook friends.

5. Ubiquity

Ubiquity for Firefox from Aza Raskin on Vimeo.

Ubiquity is perhaps the opposite of a semantic web engine, but it serves a similar function for those looking to aggregate useful data. The Firefox plugin allows users to create command lines that incorporate natural language search with a series of mashups. Users can then combine relevant data from Craigslist, translation tools, maps, reviews and social networks for easy user visualization. While the end product is an extremely useful document, users may not be ready for the drastic behavioral change of using command lines for semantic data collection.

6. Semanti

semantic_search_semanti_jun09.jpg From a consumer standpoint, Semanti sits somewhere on the spectrum between Worio and Ubiquity. ReadWriteWeb reviewed the product earlier this week and like Ubiquity it is a Firefox plug-in rather than a destination site. However, like Worio, it employs leading search engines, bookmarking and Facebook friends to produce results. Semanti's key difference is that it prompts users to choose from multiple definitions prior to completing the search. Decision-making is actually human-powered rather than machine-powered. CEO, Bruce Johnson, said, "I tried machine-based semantic tagging, but my priority has always been a faster search experience." While this is not the "use of intelligent agents" that Berners-Lee suggested, it is a "semantic" tool in that it helps the user distill meaning and relevancy from language.

If you've got more examples of semantic data collection tools, list them in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/search_and_rescue_6_approaches_to_semantic_data_collection.php http://www.readwriteweb.com/archives/search_and_rescue_6_approaches_to_semantic_data_collection.php Semantic Web Thu, 25 Jun 2009 15:45:41 -0800 Dana Oshiro
Semantic Web Patterns: A Guide to Semantic Technologies In this article, we'll analyze the trends and technologies that power the Semantic Web. We'll identify patterns that are beginning to emerge, classify the different trends, and peak into what the future holds.

In a recent interview Tim Berners-Lee pointed out that the infrastructure to power the Semantic Web is already here. ReadWriteWeb's founder, Richard MacManus, even picked it to be the number one trend in 2008. And rightly so. Not only are the bits of infrastructure now in place, but we are also seeing startups and larger corporations working hard to deliver end user value on top of this sophisticated set of technologies.

]]>Sponsor

]]> Editor's note: Looking back over 2008, there were some posts on ReadWriteWeb that did not get the attention we felt they deserved - whether because of timing, competing news stories, etc. So in this end-of-year series, called Redux, we're resurrecting some of those hidden gems. This is one of them, we hope you enjoy (re)reading it!

The Semantic Web means many things to different people, because there are a lot of pieces to it. To some, the Semantic Web is the web of data, where information is represented in RDF and OWL. Some people replace RDF with Microformats. Others think that the Semantic Web is about web services, while for many it is about artificial intelligence - computer programs solving complex optimization problems that are out of our reach. And business people always redefine the problem in terms of end user value, saying that whatever it is, it needs to have simple and tangible applications for consumers and enterprises.

The disagreement is not accidental, because the technology and concepts are broad. Much is possible and much is to be imagined.

1. Bottom-Up and Top-Down

We have written a lot about the different approaches to the Semantic Web - the classic bottom-up approach and the new top-down one. The bottom-up approach is focused on annotating information in pages, using RDF, so that it is machine readable. The top-down approach is focused on leveraging information in existing web pages, as is, to derive meaning automatically. Both approaches are making good progress.

A big win for the bottom-up approach was recent announcement from Yahoo! that their search engine is going to support RDF and microformats. This is a win-win-win for publishers, for Yahoo!, and for customers - publishers now have an incentive to annotate information because Yahoo! Search will be taking advantage of it, and users will then see better, more precise results.

Another recent win for the bottom-up approach was the announcement of the Semantify web service from Dapper (previous coverage). This offering will enable publishers to add semantic annotations to existing web pages. The more tools like Semantify that pop up, the easier it will be for publishers to annotate pages. Automatic annotation tools combined with the incentive to annotate the pages is going to make the bottom-up approach more compelling.

But even if the tools and incentive exist, to make the bottom-up approach widespread is difficult. Today, the magic of Google is that it can understand information as is, without asking people to fully comply with W3C standards of SEO optimization techniques. Similarly, top-down semantic tools are focused on dealing with imperfections in existing information. Among them are the natural language processing tools that do entity extraction - such as the Calais and TextWise APIs that recognize people, companies, places, etc. in documents; vertical search engines, like ZoomInfo and Spock, which mine the web for people; technologies like Dapper and BlueOrganizer, which recognize objects in web pages; and Yahoo! Shortcuts, Snap and SmartLinks, which recognize objects in text and links.

[Disclosure: Alex Iskold is founder and CEO of AdaptiveBlue, which makes BlueOrganizer and SmartLinks.]

Top-down technologies are racing forward despite imperfect information. And, of course, they benefit from the bottom-up annotations as well. The more annotations there are, the more precise top-down technologies will get - because they will be able to take advantage of structured information as well.

2. Annotation Technologies: RDF, Microformats, and Meta Headers

Within the bottom-up approach to annotation of data, there are several choices for annotation. They are not equally powerful, and in fact each approach is a trade off between simplicity and completeness. The most comprehensive approach is RDF - a powerful, graph-based language for declaring things, and attributes and relationships between things. In a simplistic way, one can think of RDF as the language that allows expressing truths like: Alex IS human (type expression), Alex HAS a brain (attribute expression), and Alex IS the father of Alice, Lilly, and Sofia (relationship expression). RDF is powerful, but because it is highly recursive, precise, and mathematically sound, it is also complex.

At present, most use of RDF is for interoperability. For example, the medical community uses RDF to describe genomic databases. Because the information is normalized, the databases that were previously silos can now be queried together and correlated. In general, in addition to semantic soundness, the major benefit of RDF is interoperability and standardization, particularly for enterprises, as we will discuss below.

Microformats offer a simpler approach by adding semantics to existing HTML documents using specific CSS styles. The metadata is compact and is embedded inside the actual HTML. Popular microformats are hCard, which describes personal and company contact information, hReview, which adds meta information to review pages, and hCalendar, which is used to describe events.

Microformats are gaining popularity because of their simplicity, but they are still quite limiting. There is no way to describe type hierarchies, which the classic semantic community would say is critical. The other issue is that microformats are somewhat cryptic, because the focus is to keep the annotations to a minimum. This, in turn, brings up another question of whether embedding metadata into the view (HTML) is a good idea. The question is: what happens if the underlying data changes when someone makes a copy of the HTML document? Nevertheless, despite these issues, microformats are gaining popularity because they are simple. Microformats are currently used by Flickr, Eventful, and LinkedIn; and many other companies are looking to adopt microformats, particularly because of the recent Yahoo! announcement.

An even simpler approach is to put meta data into the meta headers. This approach has been around for a while and it is a shame that it has not been widely adopted. As an example, the New York Times recently launched extended annotations for its news pages. The benefit of this approach is that it works great for pages that are focused on a topic or a thing. For example, a news page can be described with a set of keywords, geo location, date, time, people, and categories. Another example would be for book pages. O'Reilly.com has been putting book information into the meta headers, describing the author, ISBN, and category of the book.

Despite the fact that all these approaches are different, they are also somewhat complementary; and each of them is helpful. The more annotations there are in web pages, the more standards are implemented, and the more discoverable and powerful the information becomes.

3. Consumer and Enterprise

Yet another dimension of the conversation about the Semantic Web is the focus on consumer and enterprise applications. In the consumer arena we have been looking for a Killer App - something that delivers tangible and simple consumer value. People simply do not care that a product is built on the Semantic Web; all they are looking for is utility and usefulness.

Up until recently, the challenge has been that the Semantic Web focused on rather academic issues - like annotating information to make it machine-readable. The promise was that once the information is annotated and the web becomes one big giant RDF database, then exciting consumer applications would come. The skeptics, however, have been pointing out that first there needs to be a compelling use case.

Some consumer applications based on the Semantic Web: generic and vertical search, contextual shortcuts and previews, personal information management systems, semantic browsing tools. All of these applications are in their early days and have a long way to go before being truly compelling for the average web user. Still, even if these applications succeed, consumers will not be interested in knowing about the underlying technology - so there is really no marketing play for the Semantic Web in the consumer space.

Enterprises are a different story for a couple of reasons. First, enterprises are much more used to techno speak. To them utilizing semantic technologies translates into being intelligent and that, in turn, is good marketing. 'Our products are better and smarter because we use the Semantic Web' sounds like a good value proposition for the enterprise.

But even above the marketing speak, RDF solves a problem of data interoperability and standards. This "Tower of Babel" situation has been in existence since the early days of software. Forget semantics; just a standard protocol, a standard way to pass around information between two programs, is hugely valuable in the enterprise.

RDF offers a way to communicate using XML-based language, which on top of it has sound mathematical elements to enable semantics. This sounds great, and even the complexity of RDF is not going to stop enterprises from using it. However, there is another problem that might stop it - scalability. Unlike relational databases, which have been around for ages and have been optimized and tuned, XML-based databases are still not widespread. In general, the problem is in the scale and querying capabilities. Like object-oriented database technologies of the late '90s, XML-based databases hold a lot of promise, but we have yet to see them in action in a big way.

4. Semantic APIs

With the rise of Semantic Web applications, we are also seeing the rise of Semantic APIs. In general, these web services take as an input unstructured information and find entities and relationships. One way to think of these services is mini natural language processing tools, which are only concerned with a subset of the language.

The first example is the Open Calais API from Reuters that we have covered in two articles here and here. This service accepts raw text and returns information about people, places, and companies found in the document. The output not only returns the list of found matches, but also specifies places in the document where the information is found. Behind Calais is a powerful natural language processing technology developed by Clear Forest (now owned by Reuters), which relies on algorithms and databases to extract entities out of text. According to Reuters, Calais is extensible, and it is just a matter of time before new entities will be added.

Another example is the SemanticHacker API from TextWise, which is offering a one million dollar prize for the best commercial semantic web application developed on top of it. This API classifies information in documents into categories called semantic signatures. Given a document, it outputs entities or topics that the document is about. It is kind of like Calais, but also delivers a topical hierarchy, where the actual objects are leafs.

Another semantic API is offered by Dapper - a web service which facilitates the extraction of structure from unstructured HTML pages. Dapper works by enabling users to define attributes of an object based on the bits of the page. For example, a book publisher might define where the information about author, ISBN and number of pages is on a typical book page and the Dapper application would then create a recognizer for any page on the publisher site and enable access to it via REST API.

While this seems backwards from an engineering point of view, Dapper's technology is remarkably useful in the real world. In a typical scenario, for websites that do not have clean APIs to access their information, even non-technical people can build an API in minutes with Dapper. This is a powerful way of quickly turning websites into web services.

5. Search Technologies

Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search. The premise that a semantic understanding of pages leads to vastly better search has yet to be validated. The two main contenders, Hakia and PowerSet, have made some progress, but not enough. The problem is that Google's algorithm, which is based on statistical analysis, deals just fine with semantic entities like people, cities, and companies. When asked What is the capital of France? Google returns a good enough answer.

There is a growing realization that marginal improvement in search might not be enough to beat Google or to declare search the killer app for the Semantic Web. Likely, understanding semantics is helpful but not sufficient to build a better search engine. A combination of semantics, innovative presentation, and memory of who the user is, will be necessary to power the next generation search experience.

Alternative approaches also attempt to overlay semantics on top of the search results. Even Google ventures into verticals by partitioning the results into different categories. The consumer can then decide which type of answer they are interested in.

Yet search is a game that is far from won and a lot of semantic companies are really trying to raise the bar. There may be another twist to the whole search play - contextual technologies, as well as semantic databases, could lead to qualitatively better results. And so we turn to these next.

6. Contextual Technologies

We are seeing an increasing number of contextual tools entering the consumer market. Contextual navigation does not just improve search, but rather shortcuts it. Applications like Snap or Yahoo! Shortcuts, and SmartLinks "understand" the objects inside text and links and bring relevant information right into the user's context. The result is that the user does not need to search at all.

Thinking about this more deeply, one realizes that contextual tools leverage semantics in a much more interesting way. Instead of trying to parse what a user types into the search box, contextual technologies rely on analyzing the content. So the meaning is derived in a much more precise way - or rather, there is less guessing. The contextual tools then offer the users relevant choices, each of which leads to a correct result. This is fundamentally different from trying to pull the right results from a myriad of possible choices resulting from a web search.

We are also seeing an increasing number of contextual technologies make their way into the browser. Top-down semantic technologies need to work without publishers doing anything; and so to infer context, contextual technologies integrate into the browser. Firefox's recommended extensions page features a number of contextual browsing solutions - Interclue, ThumbStrips, Cooliris, and BlueOrganizer (from my own company).

The common theme among these tools is the recognition of information and the creation of specific micro contexts for the users to interact with that information.

7. Semantic Databases

Semantic databases are another breed of semantic applications focused on annotating web information to be more structured. Twine, a product of Radar Networks and currently in private beta, focuses on building a personal knowledge base. Twine works by absorbing unstructured content in various forms and building a personal database of people, companies, things, locations, etc. The content is sent to Twine via a bookmarklet, via email, or manually. The technology needs to evolve more, but one can see how such databases can be useful once the kinks are worked out. One of the very powerful applications that could be built on top of Twine, for example, is personalized search - a way to filter the results of any search engine based on a particular individual.

It is worth noting that Radar Networks has spent a lot of time getting the infrastructure right. The underlying representation is RDF and is ready to be consumed by other semantic web services. But a big chunk of the core algorithms, the ones that are dealing with entity extraction, are being commoditized by Semantic Web APIs. Reuters offers this as an API call, for example, and so moving forward, Twine won't need to be concerned with how to do that.

Another big player in the semantic databases space is a company called Metaweb, which created Freebase. In its present form, Freebase is just a fancier and more structured version of Wikipedia - with RDF inside and less information in total. The overall goal of Freebase, however, is to build a Wikipedia equivalent of the world's information. Such a database would be enormously powerful because it could be queried exactly - much like relational databases. So once again the promise is to build much better search.

But the problem is, how can Freebase keep up with the world? Google indexes the Internet daily and grows together with the web. Freebase currently allows editing of information by individuals and has bootstrapped by taking in parts of Wikipedia and other databases, but in order to scale this approach, it needs to perfect the art of continuously taking in unstructured information from the world, parsing it, and updating its database.

The problem of keeping up with the world is common to all database approaches, which are effectively silos. In the case of Twine, there needs to be continuous influx of user data, and in the case of Freebase there needs to be influx of data from the web. These problems are far from trivial and need to be solved successfully in order for the databases to be useful.

Conclusion

With any new technology it is important to define and classify things. The Semantic Web is offering an exciting promise: improved information discoverability, automation of complex searches, and innovative web browsing. Yet the Semantic Web means different things to different people. Indeed, its definitions in the enterprise and consumer spaces are different, and there are different means to a common end - top-down vs. bottom-up and microformats vs. RDF. In addition to these patterns, we are observing the rise of semantic APIs and contextual browsing tools. All of these are in their early days but hold a big promise to fundamentally change the way we interact with information on the web.

What do you think about Semantic Web Patterns? What trends are you seeing and which applications are you waiting for? And if you work with semantic technologies in the enterprise, please share your experiences with us in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php Trends Fri, 26 Dec 2008 09:00:00 -0800 Alex Iskold
Hakia Relaunches With 'Credible Sites' hakia_logo.pngSemantic search engine Hakia announced a major redesign of its site today, including the addition of 'credible sites' to its search index. In order to create this index of trustworthy sites, Hakia is asking volunteers to submit credible, peer reviewed sources. Credible sites are currently limited to health and environmental topics, but Hakia is planning to expand this quickly. By adding these credible sources, Hakia wants to go beyond '10 blue links' and give its users an alternative to popularity driven approaches like Google's PageRank. Hakia has also added a 'Galleries' section, which is a structured directory of some of the most popular search topics.

]]>Sponsor

]]> Credible Sources

In order to create this index of credible and trustworthy sites, Hakia is relying on volunteers. Hakia is specifically recruiting librarians, though it seems anybody can sign up, which could potentially leave the site open to spammers. Hakia asks submitters for their professional credentials, but it is not clear if the company will actually check these.

hakia_new_sshot.png

Hakia uses a very strict definition for what makes a site credible. To be included in the index, a site should have gone through a peer review process, not have any commercial bias, and the information should be current. The fact that Hakia insists on only adding peer reviewed sites should greatly enhances the signal-to-noise ratio of the search results.

Great Structured Results

In our tests, we were often impressed by hakia's ability to structure its regular search results. For 'Sarah Palin', for example, Hakia organizes the results by official websites, images, news, biography, awards, and speeches. A search for 'Portland, OR,' on the other hand, first displays general information about the city, images, transportation options, and restaurant guides.

hakia_credible_small.pngAll results now also feature images and user-generate content.

Whenever we tried to ask more general questions ("What is a blog?"), however, Hakia's results were often underwhelming and uneven. Sometimes we got results that were spot-on, while at other times, the results barely had anything to do with our query.

Hakia also introduced 'my hakia,' a personal start page which still looks a bit unfinished, but seems to rely on Hakia's expertise in structuring search results to give users more background information about current events.

Overall, we liked Hakia's updates and we are looking forward to the expansion of the 'credible sources' to other topics, as we were quite impressed with the results it returns already.

]]>Discuss]]>
http://www.readwriteweb.com/archives/hakia_relaunches_with_credible.php http://www.readwriteweb.com/archives/hakia_relaunches_with_credible.php News Mon, 06 Oct 2008 10:24:55 -0800 Frederic Lardinois
Do Semantic Search Companies Need a Semantic Map? It's All Semantics... This week we reported that Cognition had announced "the largest commercially available Semantic Map of the English language." In our interview with Cognition CEO Scott Janus, we asked him to compare Cognition's technologies to those of other semantic search companies Hakia and Powerset. Janus pointed to their large Semantic Map as the main differentiator. Indeed he told us that semantic search companies "must include a comprehensive semantic map" to be successful.

Is this true? We sought a response from both Hakia and Microsoft-owned Powerset on this semantically charged question.

]]>Sponsor

]]> Cognition claims that its Semantic Map has over 10 million semantic connections, including "over 4 million semantic contexts (word meanings that create contexts for specific meanings of other related words)".

Hakia CEO Riza C. Berkan responded in the comments to the original article that "hakia is deploying Ontological Semantics (OntoSem)", which he described as "a network of concepts reflecting ontology." He went on to say that hakia covers "over [a] million words in English".

However Berkan noted that the size of a Semantic Map does not necessarily matter: "the sheer size of the collection of words or concepts does not represent, by any means, the capability of the system." Hakia's position is that "there is no silver bullet for a semantic solution that will succeed", as long as the system developed is scalable and imposes "minimum reliance on 'words'".

Semantopoly: Advance token to nearest Semantic Context

At this point we were still confused. Cognition uses the term "semantic map" and said it was necessary to have. One of the commenters on the original post agreed with that assumption. Yet Hakia's Riza Berkan didn't use the term "semantic map". So we asked Hakia in a follow-up email, does it or does it not have a semantic map? Dr. Christian Hempelmann, Hakia's Chief Scientific Officer, responded:

"The term sometimes comes up in the context of data integration, but "Semantic map" is not a term used in linguistics. I can only speculate that it is what is commonly called an ontology. To the degree that they let us on about it in the documentation on their website, Cognition operates with only 2 main relations, much like WordNet: hyperonymy/hyponymy (e.g. cat is-a feline is-a mammal; their "taxonomy") and synonymy (e.g., "buy" means almost the same as "purchase"; their "thesaurus"). Furthermore, this map is not independent of English, cannot grow into other languages. hakia, on the other hand, has an ontology with many more relations, effectively raising our "semantic map" to the size of a higher power, and can and is already growing into other languages."

We also tried to get a comment from Powerset, but as of writing we haven't received it.

So, are we all clearer now on what is a Semantic Map, is it needed, and does size matter? Er, it depends. If you think you know the answers, tell us in the comments please!

]]>Discuss]]>
http://www.readwriteweb.com/archives/do_semantic_search_companies_need_a_semantic_map.php http://www.readwriteweb.com/archives/do_semantic_search_companies_need_a_semantic_map.php Analysis Fri, 19 Sep 2008 15:05:28 -0800 Richard MacManus
Live Search: Powerset Integration Already Going Live live_search_logo_sep08.pngMicrosoft only acquired the semantic search engine Powerset a little more than a month ago, but today, the Powerset team announced the first integration of its search technology into Microsoft's Live Search.  Specifically, Live Search will now show better instant answers for queries like "San Francisco weather" and return better results based on Freebase and Wikipedia articles. Currently, these Powerset enhanced results will only appear for a random set of users, but over time, we assume that most of these features will be rolled out for everybody.

]]>Sponsor

]]> Powerset has also integrated xRank biographies into Live Search, which, at least for us, appeared in almost every related search. Live Search will also make use of Powerset's Factz engine to display better related searches.

powerset_live_xrank.png

It is encouraging to see that Microsoft has been able to integrate Powerset's technology into its own products this quickly. Live Search, which is far behind Google in terms of market share, needs exactly these kinds of features to make its search more relevant.

After the acquisition was announced, we wondered if a combination of Microsoft and Powerset could indeed beat Google. Judging from these first results of the Powerset integration, we can at least conclude that Microsoft will make a strong effort to beat Google in terms of search relevance. Whether this is enough to challenge Google's dominance remains to be seen.

]]>Discuss]]>
http://www.readwriteweb.com/archives/live_search_powerset_integrati.php http://www.readwriteweb.com/archives/live_search_powerset_integrati.php Products Wed, 17 Sep 2008 10:26:50 -0800 Frederic Lardinois
Cognition Announces "World's Largest Semantic Map" Cognition Technologies, a Semantic Web company that specialises in Natural Language Processing (NLP) search, is today announcing the release of what it claims is "the largest commercially available Semantic Map of the English language." We interviewed Cognition CEO Scott Janus to find out what this means.

We also discovered that Cognition, which currently licenses its technology to other organizations, is planning to build a general consumer search engine - which will compete with Google and others.

]]>Sponsor

]]> What is a Semantic Map?

A Semantic Map is kind of like a dictionary, in that it's a representation of Cognition's ability to define things. Cognition claims that its Semantic Map has over 10 million semantic connections; over 4 million semantic contexts (word meanings that create contexts for specific meanings of other related words); over 536,000 word senses (word and phrase meanings); 75,000 concept classes (or synonym classes of word meanings); 7,500 nodes in the technology's ontology or classification scheme; and 506,000 word stems (roots of words) for the English language.

Image from Cognition

The company says that its Semantic Map "is more than double the size of any other computational linguistic dictionary for English".

Cognition Technologies has been working on its technology for 24 years, with a lot of input from lexicographers and linguists over that time. Because they've used a mix of algorithms and human input, Cognition has been able to discern relevancy, meaning, synonymy. Scott Janus told us that one of Cognition's strengths is that it can disambiguate words and phrases, which Janus says differentiates them from the keyword and pattern matching algorithms of Google, Yahoo and others.

For example Janus told us that Cognition's technology can find results even if direct words are not used - which he says Google can't do.

Cognition Plans General Search Engine

The comparisons to Google led us to ask the obvious question: does Cognition's semantic technology have a more general application? In other words, does Cogition plan to take on Google by creating a search engine for consumers? CEO Scott Janus replied that yes they do plan to "one day offer search on the general web". However he said that they need more capital funding to index the entire Web, put infrastructure in place, etc.

As of now Cognition will continue to license its semantic technology to verticals like law and health. Janus told us that Cognition is "good for complex content where lot of synonyms are used", so right now data-intensive industries are where it is aiming.

Cognition's current applications include legal (e.g. LexisNexis Concordance's case management), health (e.g. MEDLINE), and a semantically charged version of Wikipedia.

Image from Cognition

Cognition vs Powerset and Hakia

Two other Semantic search engines we've been tracking closely on ReadWriteWeb are Powerset and Hakia. We asked CEO Scott Janus what makes Cognition different from those two products?

In a nutshell, Janus says that its Semantic Map is bigger and better.

Specifically, he said that Powerset is actually "not so similar" to Cognition. According to Janus, Powerset does "parsing" - which it licensed from Xerox Parc. That is 20-25% of the solution, said Janus, but Powerset "doesn't have a good semantic map". Cognition went so far as to write a white paper (pdf) explaining why it thinks Powerset "misses the point".

As for Hakia, Janus said that as far as he can see Hakia is focused on "ontological classifications" - classifying words and concepts together. But he says Hakis doesn't have as full a semantic map as Cognition, so he thinks Cognition has "a better understanding" compared to Hakia.

In summary, Janus told us that semantic search companies "must include a comprehensive semantic map" to be successful. We're sure that Powerset and Hakia will have different opinions on what makes a successful semantic search company, but it does make for a good differentiator for Cognition.

Open Question

Tell us in the comments what you think of Cognition and whether you think it can compete with Google in the long run?

]]>Discuss]]>
http://www.readwriteweb.com/archives/cognition_semantic_map.php http://www.readwriteweb.com/archives/cognition_semantic_map.php Semantic Web Tue, 16 Sep 2008 09:55:00 -0800 Richard MacManus
Weekly Wrapup, 30 June - 4 July 2008 It's time to review the week that was on ReadWriteWeb. On the product side we looked at Adobe's announcement of searchable Flash, checked in with online TV service Hulu, reviewed a couple of innovative new web apps (Gnip and Identi.ca) and reviewed Firefox's recent world record. On the trends side, we analyzed Microsoft's acquisition of semantic search company Powerset, looked into the latest Yahoo stats, asked if email is in danger, and reported on a new Mobile Web standards initiative.

]]>Sponsor

]]> Sponsored by:

Web Products

Adobe Makes Flash Searchable

For years the big problem with Flash-based websites is that they could not be properly indexed by search engines. Flash websites have been favored by marketers and advertisers for a long time, because of the ability to create rich, interactive Web experiences. However for most other businesses, particularly those with a lot of information on their website (let's face it, that's everyone except marketers and advertisers), Flash has been nearly an automatic 'no' for website development. That may be about to change.

Hulu To Earn Up to $90M In First Year....But It's Not A Success Story Yet!

To the average user, Hulu.com, the free web site that offers high-quality streams of TV shows and movies in the U.S., looks like a runaway success: the selection of available content is more than decent, Hulu's Collections make browsing related videos easy, HD videos have been made available, embed codes are provided for re-posting the videos on the web, and the site gets a good amount of traffic, too. In fact, Hulu's CEO reported in March that 5 million visitors watched videos on the site during the past 30 days while the service was still in beta, and that number has been increasing ever since.

Gnip: Grand Central Station for the Social Web

gniplogo.jpgPing, ping, ping! That's the sound made day and night by the new social media technologies rapidly proliferating around the web... and the machines are getting tired. Polling for updates to user data streams, wishing they spoke the same language and dreaming they knew which accounts belonged to the same people across different services. Sounds like a great opportunity for an infrastructure provider, doesn't it? Enter the sexiest infrastructure provider we've seen in a long time: Gnip. Venture funded and built by exited MyBlogLog co-founder Eric Marcoullier, Gnip wants to serve as the grand central station and universal translation service for the new social web.

Identi.ca: May A Million Twitters Bloom

idneticalogo.jpgIdenti.ca is a new microblogging service that launched this week- but it's not just another also-ran. The service is an Open Source, CreativeCommons framework for a distributed network of federated microblogging services. If you've become interested in the paradigm changing model of communication popularized by Twitter but have been frustrated by Twitter's frequent down time or other shortcomings - then Identi.ca could be for you.

It's Official: Firefox Downloads Set Guinness World Record

firefox-logo.pngWe already knew that Mozilla had a record breaking day on June 17th when Firefox 3 was downloaded close to 8 million times, despite the download site not working for at least part of the morning. Now, Mozilla has announced that Firefox 3 has indeed made it into the Guinness Book of World Records with 8,002,530 downloads. Mozilla had set itself a goal of only 5 million downloads.

See also: Mozilla Releases Weave 0.2: Filling in for Browser Sync

SEE MORE WEB PRODUCTS COVERAGE IN OUR PRODUCTS CATEGORY

Web Trends

Does Microsoft + Powerset Beat Google?

What can the plan be with Microsoft's purchase of hot startup Powerset? The 3-year old company, founded by Dr Barney Pell, recently launched a semantic search experience for Wikipedia. It is doubtful that Microsoft bought the company just to enhance Live Search. Possibly the plan is to replicate the Wikipedia solution, then incorporate Powerset into Internet Explorer. In this post we look at what the thinking behind the acquisition might be.

See also: Microsoft Releases Interop Docs: Is This What Data Portability Looks Like?

Yahoo Would be Just Fine Without Search

yahoologo6.jpgHitwise Intelligence took an interesting look at the breakdown of Yahoo's properties this week. They come to the conclusion that, even if Yahoo sells off its search division, Yahoo's other properties probably wouldn't be too affected by this, as they get most of their traffic from Google's search anyway. Only Yahoo Image Search, Games, Maps, and News get most of their traffic from Yahoo Search.

Is Email In Danger?

Human history is one of progressive improvement in communication. From the 20th century mail was a fundamental form of communication. The invention of electronic mail (email) changed two things. It became cheap to send mail, and delivery was instant. Email became favored for both corporate and personal communication. But email faces increasing competition. Chat, text messages, Twitter, social networks and even lifestreaming tools are chipping away at email usage. In this post we take a look at what's happening and assess if email is in danger.

Mobile Web To Get Standards

A group of mobile operators have just unveiled a new initiative they're calling "BONDI" whose goal is to encourage development of new mobile web applications while not compromising customers' security. BONDI was created by members of the OMTP (Open Mobile Terminal Platform), an industry group that includes participants from all parts of the mobile world and whose members include operators like AT&T, Hutchison 3G, Orange, Telecom Italia, Telefónica, Telenor, T-Mobile and Vodafone.

SEE MORE WEB TRENDS COVERAGE IN OUR TRENDS CATEGORY

That's a wrap for another week! Enjoy your weekend everyone.

]]>Discuss]]>
http://www.readwriteweb.com/archives/weekly_wrapup_30_june-4_july_2008.php http://www.readwriteweb.com/archives/weekly_wrapup_30_june-4_july_2008.php Weekly Wrapups Sat, 05 Jul 2008 05:00:00 -0800 Richard MacManus
Does Microsoft + Powerset Beat Google? What can the plan be with Microsoft's purchase of hot startup Powerset? The 3-year old company, founded by Dr Barney Pell, recently launched a semantic search experience for Wikipedia.

It is doubtful that Microsoft bought the company just to enhance Live Search. Possibly the plan is to replicate the Wikipedia solution, then incorporate Powerset into Internet Explorer. In this post we look at what the thinking behind the acquisition might be.

]]>Sponsor

]]> Most initial reviews found the Powerset product release underwhelming. Critics appreciated the innovative semantic UI and recognized its potential, but believed it didn't vastly improve Wikipedia. So in view of the lukewarm reviews, the acquisition by Microsoft was unexpected. The 100M price tag is around 5x the 12M Series A + 8M investment put into the company. Microsoft execs must believe Powerset can be a weapon in its battle with Google.

What Powerset is today

Given a set of unstructured information, Powerset applies Natural Language Processing techniques to extract concepts and the key semantic concepts out of the text. It then builds a semantic index (similar to Google's) as well as a conceptual graph of relationships between entities. This graph is typically expressed in RDF triples.

One of the Powerset innovations is surfacing of semantics to the user interface. The contextual gadget is overlaid to help navigate the unstructured information.

Many thought Powerset to be a generic semantic search engine, but its first product is limited to Wikipedia. It is not trivial to scale the technology to the entire web.

Why Powerset is Powerful

When semantic technologies emerged a few years ago, people started talking about how semantic web and/or semantic search might be a Google killer. The talk was supported by logic that semantic search can deliver more relevant results because it "knows" the content.

Industry realizes that isn't the case. Semantic search has no huge advantage over the statistical approach used by Google. We discussed this in the post Semantic Search - Myth and Reality.

What is powerful about Powerset? Precisely that it doesn't try to search the web as a whole. Right now, the solution works on Wikipedia, but the infrastructure is generic, so any other site could also be enhanced. The contextual outline developed can be used to navigate any content.

Instead of dealing with the whole web, the idea may be firstly to build solutions for specific sites.

Head-on with Google?

Powerset as it is today is no Google killer. At this point only something with huge traction and momentum would stand a chance.

In the search market, Google has a strong hold - potentially stronger if the Yahoo deal goes through. People are conditioned to Google: it's simple and, yes, imperfect, but it's good enough and the results are still better than Live Search.

If Microsoft bought Powerset with the goal to incorporate it into Live Search, then it's likely to be another acquisition to make little impact on the bottom line. In fact, the announcement on the Live Search blog states just that. The number one reason is acquiring talent; the second is the belief that NLP and semantic algorithms will be able to patch holes in today's search.

Today Powerset brings only interesting technology; it doesn't bring traction. So what were they thinking up in Redmond? There may be more subtle play, leveraging the fact Powerset works well on knowledge sets like Wikipedia.

Possibly Microsoft plans to deploy Powerset across its own sites, then perhaps incorporate Powerset into Internet Explorer.

Imagine going to Wikipedia and having a semantic overlay on each page. Now imagine scaling this experience across major information sources around the web.

Providing contextual, semantic experience allows Microsoft to retain eyes longer, shaving off the time people spend searching Google.

This is an important point because Google doesn't make money on search - it makes money on advertising.

Can Microsoft ever beat Google in Advertising?

The real problem Microsoft is seeking to solve is advertising. Until now the web has figured out two fundamentals for advertising - portals and search.

Portals show ads on each page; the more people browse the content, the more ads are shown and the more money is made. The search model emerged as an alternative, now more successful, path to advertising dollars.

With Powerset and other semantic technologies, there's another model: contextual information exploration overlaid on existing content.

If Microsoft can figure how to keep eyes off Google's home page, the game will shift dramatically. The browser is one of Microsoft's most powerful tools - and the default box is Live Search.

If Microsoft wants to win over advertisers, it might just do more with the browser. Incorporating aspects of Powerset's semantic navigator into the browser by default could be a game changer. This is not a straightforward play. A large company with bureaucracy and execution problems is unlikely to be able to merge semantics into the browser quickly and elegantly.

Conclusion

The Powerset acquisition is an interesting move by Microsoft. This hot semantic startup was on everyone's radar.

What can the plan be? It is doubtful that Microsoft bought the company just to enhance Live Search. Possibly the plan is to replicate the Wikipedia solution, then incorporate Powerset into Internet Explorer.

That is a bold play requiring exact execution - not the kind Redmond has shown lately.

What do you think Microsoft is going to do with Powerset? What are the other applications of this technology that you can think of?

]]>Discuss]]>
http://www.readwriteweb.com/archives/does_microsoft_powerset_beat_google.php http://www.readwriteweb.com/archives/does_microsoft_powerset_beat_google.php Analysis Thu, 03 Jul 2008 01:39:30 -0800 Alex Iskold
Confirmed: Microsoft Acquires Powerset pset-livesearch.pngWe wrote about Microsoft possibly acquiring semantic search engine Powerset just a few days ago when it was still a rumor. Today, both Microsoft and Powerset have confirmed that they have reached a deal. When rumors about this acquisition first appeared, the price for Powerset was supposed to be somewhere around $100 Million, though neither company has disclosed the final prize so far.

]]>Sponsor

]]> In a statement about the acquisition, Powerset says that it needed a bigger partner to expand its product beyond its current state of only searching Wikipedia - something we had speculated about when the rumors of the acquisition first appeared. In its own statement, Microsoft stresses how useful Powerset's technology will be for improving Microsoft's own search products and to "take Search to the next level."

So far, none of the larger search engines have been able to capitalize on the promises of semantic search. Most of the innovations in the space so far have come from small start-ups and even those never made any real inroads in terms of market share when compared to the keyword driven search engines of Google, Ask, Yahoo, and Microsoft.

Powerset's technology might just give Microsoft the ability to differentiate its Live Search product from the competition.

]]>Discuss]]>
http://www.readwriteweb.com/archives/microsoft_acquires_powerset.php http://www.readwriteweb.com/archives/microsoft_acquires_powerset.php News Tue, 01 Jul 2008 13:50:25 -0800 Frederic Lardinois
Rumor: Microsoft to Acquire Powerset for $100 Million

Venturebeat reports that Microsoft might be close to acquiring the San Francisco based semantic search engine Powerset for about $100 Million. No announcement has been made yet by either party. We contacted Microsoft, but did not get an answer beyond "Microsoft does not comment on rumors or speculation." We will update this post once we receive more information.

Rumors about Microsoft's interest in Powerset had been swirling around the Valley since last month, when Dan Farber first brought up the possibility in a post on CNet.

]]>Sponsor

]]> Powerset launched The consumer-facing side of Powerset currently only searches Wikipedia articles, but Microsoft is most likely more interested in using the underlying technology for its own search products like Live Search. Powerset's specialty iproviding answers through natural language queries like "When was Henry VIII born?" Powerset licensed this technology from Xerox PARC.

Having backing from Microsoft could help the small company to expand beyond Wikipedia and start indexing more of the Internet. Powerset's technology is still unproven to work well for anything but Wikipedia, but if Powerset does manage to scale beyond this, then it would allow users to by-pass Google's keyword driven search in favor of just getting a direct answer to a large number of their questions.

live.png

Mircosoft's search products have struggled to gain any ground back from Google's search. Currently, Google has almost a 70% share of the search market, while MSN/Live Search has about 9.5%.

Powerset's capabilities have generally received very positive reviews and in his original piece on this, Dan Farber already argued that Powerset's ability to create connections between concepts, relationships, and meanings could give it a heads-up over Google's keyword and PageRank driven search.

We first reviewed Powerset vs. Google in May and at the time, Josh Catone's impression wasn't quite as positive and he concluded that "Powerset doesn't do a markedly better job of finding answers than Google for most queries."

Powerset was funded in a $12.5 Million Series A round by Foundation Capital, Founders Fund and various angel investors.

For a more in-depth look at the state of semantic search in general, see also Alex Iskold's article on the myth and reality of semantic search.]]>Discuss]]> http://www.readwriteweb.com/archives/rumor_microsoft_powerset.php http://www.readwriteweb.com/archives/rumor_microsoft_powerset.php News Thu, 26 Jun 2008 15:35:33 -0800 Frederic Lardinois Evri Beta Launches: Search Less - Understand More

Evri, a Paul Allen backed semantic search engine, is launching into a limited beta tonight. Evri was first shown publicly at the D6 conference. Evri's CEO Neil Roseman likes to talk about Evri in terms of organizing content instead of calling it a search engine. At its core, however, Evri definitely is a search engine, though it adds a very sophisticated semantic layer on top of its results that emphasizes the relationships between different search terms.

]]>Sponsor

]]> In its early stages, Evri is only going to start out with a limited set of results and possible search terms, based on what it considers to be the most popular terms and people. This approach of starting with only the most popular terms is reminiscent of Mahalo. However, unlike Mahalo, which relies on paid editors and volunteers to create its results, Evri completely relies on its algorithms to create connections between people, products, concepts, and events.

Evri especially prides itself for having developed a system that can distinguish between grammatical objects such subjects, verbs, and objects to create these connections. In his demo at D6, Roseman described the system as being similar to "an army of 7th grade grammar students graphing the Web."

evri-screen.png

Evri is entering in direct competition with a number of recent entries to the semantic search market, especially Powerset and Hakia. Powerset, however, only indexes Wikipedia articles, while Hakia tries to index all of the web, but focuses less on the relationships between objects and more on providing highly organized results for a given term.

You can sign up for invites to Evri on their homepage. The first wave of users should be receiving invites tonight.

For a more in-depth look at the state of semantic search, see also Alex Iskold's article on the myth and reality of semantic search.

]]>Discuss]]>
http://www.readwriteweb.com/archives/evri_beta_launches_search_less.php http://www.readwriteweb.com/archives/evri_beta_launches_search_less.php News Tue, 24 Jun 2008 21:01:00 -0800 Frederic Lardinois
Semantic Search: The Myth and Reality For a few years now people have been talking about semantic search. Any technology that stands a chance to dethrone Google is of great interest to all of us, particularly one that takes advantage of long-awaited and much-hyped semantic technologies. But no matter how much progress has been made, most of us are still underwhelmed by the results. In head-to-head comparisons with Google, the results have not come out much different. What are we doing wrong?

]]>Sponsor

]]> For example, when asked, What is the capital of France? both approaches come back with the correct answer - Paris. Also, a lot of queries that we are used to typing into Google in abbreviated form, come back with similar results if we type them using natural language. Clearly something is off. We all know that semantic technologies are powerful, but how and why? In this post we will show that the problem is that we are asking wrong questions.

The mistake is that semantic search engines present us with Google-like search box and allow us to enter free form queries. So we type the things that we are used to asking - primitive queries. It never occurs to us to type in What actor starred in both Pulp Fiction and Saturday Night Fever? or What two US Senators received donations from a foreign entity? We type simple questions, but this is not where the power of semantic search lies. Lets look at the spectrum of semantic technologies from Google, to SearchMonkey, to Powerset, and Freebase to understand what is going on.

What Problem Are We Trying to Solve?

The first confusion in the space comes from the fact that semantic search is being positioned as the answer to all possible problems - from modern search, currently dominated by Google, to problems that are computationally impossible. The situation is made more difficult by the fact that right now there is only a thin range of problems where semantic search can clearly do better. This range is complex queries involving inferencing and reasoning over a complex data set.

As shown in the diagram above basic queries are easily handled by Google. Sadly, natural language processing gives little advantage when it comes to this category of problems. Google correctly answers the question about Leonardo Da Vinci's birthday leaving no opportunities to improve the search by understanding the nouns and the verbs that user typed in.

Before looking at the problems that are perfect for semantic search, lets look at the hardest problems. These are computationally challenging problems that really have nothing to do with understanding semantics. The misconception has been perpetuated since early days of the Semantic Web that somehow, because we will annotate the web, we will be able to solve these super complex problems. This is simply not true. There are fundamental limits to what we can compute, and a class of problems that have an exponential number of possible solutions is not going to be magically solved because we represent data as RDF.

The good news is that there is a set of problems that are great for semantic search. These are the problems we have been solving so wonderfully with relational database. Way too often we forget that semantic technologies are here to help us represent relational data spread over the entire web - so it should be no surprise to us that it is relational queries that semantic search engines would excel at.

The Spectrum of Semantic Search Players

But semantic search is not just about the questions that we are asking. Because the web is just a bunch of unstructured HTML pages, semantic search is also about the underlying data. At its most structured extreme we find Freebase - the semantic database of everything. Freebase is accessible via free text search, but more importantly via MQL (Metaweb Query Language). MQL is essentially JSON with wildcards. Using it you can construct any query against Freebase and the result will be the same query with answers filled in.

Powerset, in a way, is just a relational database. It operates against certain, structured information. On the other end of the spectrum is Google, which is all about statistical frequencies and very little semantics. The recently launched SearchMonkey from Yahoo! is an interesting twist. It does not add anything to the result set, but instead uses semantic annotations to present a richer, more interactive and useful user interface.

Companies like Hakia and Powerset are probably working the hardest. These companies are trying to simultaneously build Freebase-like structures on the fly and then do natural language queries on top of them. The difference is that Hakia is using (likely similar) technology to query over the entire web, while Powerset has (probably shrewdly) chosen to restrict the search to Wikipedia.

Are Hakia, Powerset and Freebase All That Different?

This analysis brings up a question - which of these technologies are different and which are essentially the same? Lets get the easy one down first. Yahoo!'s SearchMonkey is no different from Google or any other search, as far as the core search technology is concerned. The difference is simply in the presentation layer. SearchMonkey is smart about creating a better user experience by letting publishers present the search results to the users in the best possible way.

But when it comes to Hakia, Powerset and Freebase the situation is much more complicated. On the surface all these products are different - Hakia lets you search the whole web, Powerset is restricted to Wikipedia (and Freebase!) and Freebase itself has two search interfaces - the search box and query language. Here is the problem - the natural language interface has nothing to do with the underlying data representation.

The fact is that all of these semantic search technologies allow people to type in arbitrarily complex questions and then interpret these queries and execute them against their databases. Fundamentally, Hakia, Powerset, and Freebase are databases. Fundamentally, all of them have some kind of Natural Language Processing that translates the question into a canonical query over the database.

To gain insight into all of this, think about Freebase and its query language MQL. Unlike natural language, which allows all sorts of constructs, MQL is non-ambiguous. This JSON-like language allows users to construct precise statements against Freebase. The fact that Powerset allows natural language queries does not mean that inside Powerset there is no database. For sure, though, there is a similar kind of database as there is beneath the Freebase search box. What is really different about Freebase and Powerset is the data gathering approach and user experience.

Back to the Future: It's All About UI

Probably the most striking revelation about the semantic search space is User Interface. First, to go on the tangent, Powerset got it right by realizing that semantics needs to be surfaced in the UI. After a user searches Powerset, a contextual gadget, aware of the semantics of the results, helps the user complete the search experience.

Yet the biggest mistake that I think Powerset is making is also in the UI. The search box that everyone is familiar with via traditional web search engines needs to go. Having a simplistic search interface hurts Powerset and Hakia, and to a lesser extent Freebase, which is not positioning itself as generic search.

Think about the recent launch of Powerset. The company released a vastly better way to interact with one of the most important sources of information on the web - Wikipedia. But what did the critics say? Lets see if this is a Google killer. And the answer to that is "no."

But what if Powerset restricted what can be searched? What if instead of a search box there was another interface or what if they told users not to look up things that they can find easily on Google? Why is it that new companies are expected to improve on the algorithm that has ruled the web for over a decade? Instead, the expectation should really be to solve the problems that can not be solved by Google today.

Conclusion

Semantic search is an upcoming technology that has set the expectations way too high. We have all been misled into thinking that these technologies are here to dethrone Google by delivering better search results. Neither of those things are true. What is true, however is that semantic search is going to be big and it is going to help us answer questions that we simply cannot answer today - complex, inferencing queries asked over the entire web as if it was a database.

In order for these semantic search technologies to make a dent in the market, they need to clean up their messaging and most importantly, their user interface. Presenting a search box is both misleading and detrimental, as people associate it with the simplistic questions that Google solves without any problems. To really showcase semantic search, these companies need to come up with innovative UIs that will help users to understand the power that is being put at their fingers.

As always, please tell us what you think. What should semantic search companies do to gain their place in the marketplace?

]]>Discuss]]>
http://www.readwriteweb.com/archives/semantic_search_the_myth_and_reality.php http://www.readwriteweb.com/archives/semantic_search_the_myth_and_reality.php Trends Thu, 29 May 2008 14:15:01 -0800 Alex Iskold
Powerset vs. Google: The Completely Premature Head-to-Head As our network blog AltSearchEngines reported this morning, the long-awaited and much hyped natural language processing search engine Powerset launched this morning. Kind of. For now, the search service only uses Wikipedia and Freebase as source material for answers to your query. So it's not really fair to compare it to Google yet, but this is a search engine, and that means it will always be held to the gold standard set by the market leader.

]]>Sponsor

]]> Comparing the two is tricky, since Google searches the entire web and Powerset only processes two sites. The admittedly not very scientific method that we came up with was to compare a handful of searches on Powerset, to the results for the same query on Google restricted to "site:wikipedia.org."

Powerset does some interesting things with general queries, such as displaying "Factz," which is an ontology showing various concepts related to your query and how they relate to one another, or "Dossiers," which are a summary of key information about your query. Sometimes it yields some odd results (such as this query for "ants" for which the key finding is that ants are "a fictional race from the video game Crash Twinsanity.") However, the real promise of NLP search engines, in our opinion, is that users will be able to make search queries using natural language -- or in other words, by asking a question. So we chose a few questions at random -- things we knew Wikipedia would have answers for -- and threw them at both Powerset and Google.

Query: Who invented dental floss?

Powerset's answer for this query was curious. The number one result comes from the Wikipedia entry for dental floss and highlights this line: "It was around this time, however, that Dr. Charles C. Bass developed nylon floss." Charles Bass, however, is not the correct answer. Earlier in the same article is this line, "Levi Spear Parmly, a dentist from New Orleans, is credited with inventing the first form of dental floss." Why didn't Powerset find it? It's second results, which comes from a Wikipedia entry on scientific achievements from the year 1815, correctly highlights Parmly as the inventor.

Google performed poorly for this query. The same 1815 article is identified in the sixth spot on the results, with the sentence mentioning Levi Spear Parmly highlighted, but the first few results aren't even close. Even though that's not as impressive as Powerset's results, both would require a user to click through to the article to verify the answer (because Powerset returned two different answers), and is scrolling to the 6th spot really that taxing? Taxing enough to make you switch to a new search engine? Interestingly, this query set loose on all of Google does quite well, returning the correct answer in a link to a trivia site in the first result.

Query: What is the capital of France?

Not surprisingly, both Google and Powerset nail this one. Both point to the Wikipedia entry on Paris, France in the number one spot with the sentence, "Paris is the capital of France" highlighted.

Query: Where is Paris?

This is a fundamentally more challenging query, because there are a large number of cities and towns called "Paris" in the world. And not surprisingly, neither search engine gives what we would call a "perfect" result.

Both return the article on Paris, France first. On Google, that's followed but a handful of other articles about the city and one about Paris, Tennessee. On Powerset, the second article is about Paris Hilton -- um? -- followed by one about Paris, Texas, and in fourth place the most helpful article it could have returned, the disambiguation page on Wikipedia for Paris. (Oddly, with the question mark, the query returned "Paris, Missouri" from Freebase, and without the question mark it returned "Paris, Texas.")

On Google at large, the results focus almost exclusively on Paris, France.

It would seem that both search engines generally understand that "where is Paris" means that Paris is a place (though upon reflection, perhaps we could have been searching for the location of Paris Hilton...), but neither recognize very well that it could mean any number of different places.

Query: Who is Joey Tribbiani?

Both Powerset and Google correctly call up the article about this fictional character in their first spot, but Google actually does a better job of highlighting who he is. Compare:

  • Google: After the 2003/2004 final season of Friends, Joey Tribbiani became the main character of Joey, a spin-off TV series, where he moved to L.A. to polish his ...
  • Powerset: In the end of the series, Joey was the only Friend that ended up without a lover or a spouse, even though he is the one that dated the most women. ... Joey becomes good friends with an attractive female attorney named Alex, who, along with her husband, a travelling [sic] musician named Eric, is Joey's landlord.

Google has the name of both shows in which the character appears in their excerpt, while Powerset's excerpt is made up of information about the series' that only someone who already knew the character would understand (without clicking through to read the full article) -- and it doesn't differentiate between the two -- before the ellipses the excerpt is talking about "Friends" and after it is talk about "Joey."

Google at large also finds the Wikipedia article first with the same excerpt -- it also finds clips of the show on YouTube, and the actor's (Matt LeBlanc) IMDB entry, as well the official site for the spin-off "Joey."

Conclusion

This was really just a very quick and informal test, and we barely put Powerset through its paces. But our first snap impressions are that Powerset doesn't do a markedly better job of finding answers than Google for most queries. Some might argue that we didn't play to Powerset's strengths and frame our queries properly, or search for things obscure enough to notice any differentiation. But the promise of natural language search is that people don't have to learn how to search -- they can just ask questions as they normally would. We also can't expect that everything they're going to look for will be obscure and hard to find via traditional search engines -- more often than not, they probably won't be.

Powerset will have an immense uphill battle to make any sort of dent in the search market. Google controls 67% of searches in the US, and the top 4 search engines make up about 98% of searches. If Google remains "good enough," Powerset will have a hard time convincing people to switch. It will be easier to make a judgment about the company's future as a real Google competitor once it is crawling more than two sites, however.

What do you think about Powerset? Impressed? Not impressed? Let us know in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/powerset_vs_google.php http://www.readwriteweb.com/archives/powerset_vs_google.php Search Services Mon, 12 May 2008 14:32:50 -0800 Josh Catone
Report: Social Media Challenging Traditional Media Universal McCann has released a new report on the impact of social media (such as blogs, social networks, online video) on the media landscape. It surveyed 17,000 Internet users worldwide in March 2008. The report found that social media, in particular blogs, are "becoming a more important part of global media consumption for internet users than some traditional media channels." The report also found that social media is a global phenomenon (29 countries were surveyed), although there are cultural differences in how people use it.

]]>Sponsor

]]> The report states that "video clips, blogs, podcasts, social networks and RSS are all essential components of the online media diet." Here are some of the key findings:

- 83% watch video clips, up from 62% in the last study in June 2007
- 78% read blogs, up from 66%
- 57% of internet users are now members of a social network
- RSS consumption is growing rapidly up from 15% to 39%
- Podcasts are now mainstream digital content, listened to by 48%

Social networks have been "a key driver for the growth of social media":

- 22% of social network users have installed a widget or applications
- 55% have shared photos
- 22% have shared their videos
- 31% have started a blog
- The world’s biggest social network is MySpace with 32% weekly reach followed by Facebook on 23%

The report also states that social media is a global phenomenon:

- Top markets for blogging – China 70% of internet users write a blog, Philippines 66% and Mexico 60%
- Top markets for social networking – Philippines 83%, Hungary 76% and Poland 76%
- China is the world's largest blogging market with 42m bloggers versus 26m in the US

Those last stats will be an eye opener for many, because the US web tech market gets most of the attention of the blogosphere and mainstream media. But with China having 42m bloggers compared to the US's 26m, there is large scope for social media to flourish there - even despite China's political issues with social media.

]]>Discuss]]>
http://www.readwriteweb.com/archives/report_social_media_challenging_traditional_media.php http://www.readwriteweb.com/archives/report_social_media_challenging_traditional_media.php Trends Mon, 28 Apr 2008 13:23:15 -0800 Richard MacManus
5 Places Your Opinion Counts - Debate Site Roundup While you're waiting for The Great Debaters to come out on DVD in a couple of weeks, there are a few places where you can put in some debate practice online in the meantime. One of the great things about writing a blog is that it is a platform for voicing your opinions. But it can also be rewarding to hear from the opposing side, and one thing we do often on this blog is ask for your views (as we did last week on the topic of video comments, for example). Below are 5 sites that organize debates around any topic.

]]>Sponsor

]]>

CreateDebate is the newest debate site to hit the web. It moved from private beta to public late this morning and offers an extremely slick interface for online debate. Debates on CreateDebate can take multiple forms. They can be open ended questions, such as "Who had the best NFL draft?" or they can be head-to-head debates, such as "Is drug abuse a criminal or health problem, Yes or No?"

Users can vote in two-sided debates and add arguments in each. Arguments are voted up or down Reddit-style with the top arguments displayed at the top of the page. Users can also add rebuttals to arguments which can be further voted upon. Debates that are time sensitive (such as "Who will win the Democratic nomination for president?") can be set to expire. CreateDebate can also be used for simple yes/no polling on non-contentious issues.

One unique feature of CreateDebate is that each debate has a "research" page that pulls in news from RSS or Atom feeds. Whoever creates the debate can add new sources to the research page and news stories can be automatically made into the focal point of a new debate.

Riled Up! is a more simple debate site that uses the head-to-head format. Debaters are asked simple yes or no, or X vs. Y questions and asked to support a side. Choose wisely, because once you've picked your side, you can't go back.

Similar to CreateDebate, users vote arguments up and down and can post rebuttals, which can be tagged as supporting, neutral, or opposing.

Wis.dm is really a question and answer site that many have compared to Yahoo! Answers, but because it favors yes/no questions, it is actually more akin to the debate sites here. Wis.dm is set up very simply : Someone asks a yes/no question, users vote, and people debate the answer in an unthreaded discussion forum below the question.

The free form nature of the actual debate makes it a bit harder to follow everyone's position than on more polished debate sites, but Wis.dm is easily the most used of the sites in the round up. Its simplicity makes it very approachable and probably contributes to its mainstream appeal.

outQuib is a social network focused on debate and discussion that we reviewed in January. Debates on the site take the form of a poll with multiple response and forum-style commenting. But the focus of outQuib is really the social aspect -- debates are used as a means of connecting like minded people who can form groups on the site.

Jyte is a product of JanRain, makers of MyOpenID, and I get the idea that it is really more of a proving ground for their OpenID products than it is a serious startup. Jyte allows people to make claims (like, "Tiger Woods is the best pro golfer of all time.") and then people can vote to agree or disagree.

Users can also add comments to the debate (arguments for or against) and give each other "cred" points in areas they think a particular user is especially credible -- though it appears that cred points don't really amount for much other than bragging rights.

Conclusion

With the US presidential election kicking into high gear over the summer and coming to a conclusion next fall (barring any repeat of what happened in 2000), debate sites can probably expect to see a bump in traffic as people head online in search of places to argue their opinions. Which of the sites above is your favorite? Did we miss any? Let us know in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/online_debate_sites.php http://www.readwriteweb.com/archives/online_debate_sites.php Products Mon, 28 Apr 2008 12:03:47 -0800 Josh Catone