natural language processing - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/natural language processing en Copyright 2009 Richard MacManus readwriteweb@gmail.com Tue, 24 Nov 2009 12:40:23 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Hakia Announces Semantic API Semantic search engine Hakia today announced a set of APIs that opens up their natural language processing and search platform to developers. Hakia's Syndication Web Services really comes in two parts: search queries, which allow developers to add web search functionality leveraging Hakia's five billion page index, and XML feed calls, which give developers access to Hakia's underlying natural language processing technology. The latter of the two is clearly the more compelling of the offerings.

]]>Sponsor

]]> Mobile video firm, Berggi, released Berggi Search, a mobile search application that lets users search Hakia's index via the API from mobile phones. Berggi is leveraging the part of the Hakia's API that lets developers lean on the company's search platform -- that, however, is not the part that really interests us.

What is more interesting are the XML feed calls that Hakia is offering that give access to their underlying NLP engine. Right now, only the "Summarizer" element is available. Summarizer, which Hakia says can be used to suggest tags or abstracts, analyzes and extracts meaning from large blocks of text or the contents of URLs. Other elements that are not yet available are Categorizer, which identifies "categorical phrases" in text, Characterizer, which "identifies and expands descriptive keywords or tags," and Text Meaning Representation.

Hakia has an XML testing form up on their Club Hakia page, and in our testing it seemed a little rough around the edges. Compared to our testing of Open Calais from Reuters (our coverage), the summaries and tags the XML testing form returned using the Summarizer element weren't very impressive. Mostly, it seemed to just return the headline or first sentence as the summary for articles we threw at it. And for RWW articles, Hakia Summarizer would suggest as tags the tags that we entered by hand in MovableType.

Hakia's Syndication Web Services are free for up to 30,000 requests per day for search services (unlimited free queries for Quotes and Cartoons), and free for up to 1,000 requests per day for XML feed calls. Have you had a chance to play with Hakia's new semantic API? If so, what did you think? How does it compare to Calais or Semantic Hacker? Let us know in the comments below.

Full Disclosure: Occasional ReadWriteWeb contributor Emre Sokullu is a technology evangelist at Hakia.

]]>Discuss]]>
http://www.readwriteweb.com/archives/hakia_announces_semantic_api.php http://www.readwriteweb.com/archives/hakia_announces_semantic_api.php Semantic Web Thu, 19 Jun 2008 12:56:42 -0800 Josh Catone
Powerset vs. Google: The Completely Premature Head-to-Head As our network blog AltSearchEngines reported this morning, the long-awaited and much hyped natural language processing search engine Powerset launched this morning. Kind of. For now, the search service only uses Wikipedia and Freebase as source material for answers to your query. So it's not really fair to compare it to Google yet, but this is a search engine, and that means it will always be held to the gold standard set by the market leader.

]]>Sponsor

]]> Comparing the two is tricky, since Google searches the entire web and Powerset only processes two sites. The admittedly not very scientific method that we came up with was to compare a handful of searches on Powerset, to the results for the same query on Google restricted to "site:wikipedia.org."

Powerset does some interesting things with general queries, such as displaying "Factz," which is an ontology showing various concepts related to your query and how they relate to one another, or "Dossiers," which are a summary of key information about your query. Sometimes it yields some odd results (such as this query for "ants" for which the key finding is that ants are "a fictional race from the video game Crash Twinsanity.") However, the real promise of NLP search engines, in our opinion, is that users will be able to make search queries using natural language -- or in other words, by asking a question. So we chose a few questions at random -- things we knew Wikipedia would have answers for -- and threw them at both Powerset and Google.

Query: Who invented dental floss?

Powerset's answer for this query was curious. The number one result comes from the Wikipedia entry for dental floss and highlights this line: "It was around this time, however, that Dr. Charles C. Bass developed nylon floss." Charles Bass, however, is not the correct answer. Earlier in the same article is this line, "Levi Spear Parmly, a dentist from New Orleans, is credited with inventing the first form of dental floss." Why didn't Powerset find it? It's second results, which comes from a Wikipedia entry on scientific achievements from the year 1815, correctly highlights Parmly as the inventor.

Google performed poorly for this query. The same 1815 article is identified in the sixth spot on the results, with the sentence mentioning Levi Spear Parmly highlighted, but the first few results aren't even close. Even though that's not as impressive as Powerset's results, both would require a user to click through to the article to verify the answer (because Powerset returned two different answers), and is scrolling to the 6th spot really that taxing? Taxing enough to make you switch to a new search engine? Interestingly, this query set loose on all of Google does quite well, returning the correct answer in a link to a trivia site in the first result.

Query: What is the capital of France?

Not surprisingly, both Google and Powerset nail this one. Both point to the Wikipedia entry on Paris, France in the number one spot with the sentence, "Paris is the capital of France" highlighted.

Query: Where is Paris?

This is a fundamentally more challenging query, because there are a large number of cities and towns called "Paris" in the world. And not surprisingly, neither search engine gives what we would call a "perfect" result.

Both return the article on Paris, France first. On Google, that's followed but a handful of other articles about the city and one about Paris, Tennessee. On Powerset, the second article is about Paris Hilton -- um? -- followed by one about Paris, Texas, and in fourth place the most helpful article it could have returned, the disambiguation page on Wikipedia for Paris. (Oddly, with the question mark, the query returned "Paris, Missouri" from Freebase, and without the question mark it returned "Paris, Texas.")

On Google at large, the results focus almost exclusively on Paris, France.

It would seem that both search engines generally understand that "where is Paris" means that Paris is a place (though upon reflection, perhaps we could have been searching for the location of Paris Hilton...), but neither recognize very well that it could mean any number of different places.

Query: Who is Joey Tribbiani?

Both Powerset and Google correctly call up the article about this fictional character in their first spot, but Google actually does a better job of highlighting who he is. Compare:

  • Google: After the 2003/2004 final season of Friends, Joey Tribbiani became the main character of Joey, a spin-off TV series, where he moved to L.A. to polish his ...
  • Powerset: In the end of the series, Joey was the only Friend that ended up without a lover or a spouse, even though he is the one that dated the most women. ... Joey becomes good friends with an attractive female attorney named Alex, who, along with her husband, a travelling [sic] musician named Eric, is Joey's landlord.

Google has the name of both shows in which the character appears in their excerpt, while Powerset's excerpt is made up of information about the series' that only someone who already knew the character would understand (without clicking through to read the full article) -- and it doesn't differentiate between the two -- before the ellipses the excerpt is talking about "Friends" and after it is talk about "Joey."

Google at large also finds the Wikipedia article first with the same excerpt -- it also finds clips of the show on YouTube, and the actor's (Matt LeBlanc) IMDB entry, as well the official site for the spin-off "Joey."

Conclusion

This was really just a very quick and informal test, and we barely put Powerset through its paces. But our first snap impressions are that Powerset doesn't do a markedly better job of finding answers than Google for most queries. Some might argue that we didn't play to Powerset's strengths and frame our queries properly, or search for things obscure enough to notice any differentiation. But the promise of natural language search is that people don't have to learn how to search -- they can just ask questions as they normally would. We also can't expect that everything they're going to look for will be obscure and hard to find via traditional search engines -- more often than not, they probably won't be.

Powerset will have an immense uphill battle to make any sort of dent in the search market. Google controls 67% of searches in the US, and the top 4 search engines make up about 98% of searches. If Google remains "good enough," Powerset will have a hard time convincing people to switch. It will be easier to make a judgment about the company's future as a real Google competitor once it is crawling more than two sites, however.

What do you think about Powerset? Impressed? Not impressed? Let us know in the comments below.

]]>Discuss]]>
http://www.readwriteweb.com/archives/powerset_vs_google.php http://www.readwriteweb.com/archives/powerset_vs_google.php Search Services Mon, 12 May 2008 14:32:50 -0800 Josh Catone
BooRah: I Could Give up Yelp For This boorahlogo.jpgBooRah is a semantic and natural language processing aggregator of restaurant reviews. The service pulls in reviews from numerous review sites and a substantial list of restaurant review blogs, then analyzes the emotional tone of the reviews it finds. Good reviews ("Rahs") and bad reviews ("Boohs") are collected concerning food, service and ambience.

It's a small but interesting site and the basic premise here is something that could be expanded beyond restaurants alone, something the company says it intends to do. I like it a lot.

]]>Sponsor

]]> Headquartered in Mountain View, CA, the company launched with information gleaned from over a half million online restaurant reviews in San Francisco, Los Angeles and New York. Last week it expanded to include a total of 20 cites, though information can be found on the site about restaurants almost anywhere in the US and in some cities internationally. The company is adding in-depth coverage of about 1 city a week it says and is now powering restaurant reviews on the directory site AmericanTowns.

BooRah uses affiliate services to display menus, make reservations and offer big discounts for restaurants in a long list of cities. These added features are a very nice touch, especially the menu display from AllMenus.com.

Semantic Analysis

The reviews that get processed are identified by semantic analysis identifying food blogs among 100,000 blogs being indexed. That number could be bigger, but it's unclear what percentage of those indexed blogs are in fact food blogs.

Inside the review excerpts you'll find food terms, like a particular dish, identified and linked out to a search results page displaying that same item in the same location you're currently looking at. That's really nice, so if I'm reading a review that says some place's dolmas are alright but aren't the best in town - I'm one click on the word dolmas away from finding out where in town is said to have better ones. Yelp lets you search for terms in a city of course, but making it one click automatically is nice.

I wrote a review this morning and the parsing is a little funky. The key term in my review is "raw," which should be discernible since the culinary category is "organic." Instead, BooRah pulls out a link to "cooked stuff" for searching. That's the opposite of what a user would want in this, admittedly niche case. Food, like many other niche topics, needs strong long-tail analysis - doesn't it? Maybe it's unrealistic to expect semantic analysis to be strong in outlying, long-tail use cases - perhaps full text search ala Google is going to serve said user better. I hope not, though.

boorahscreen2.jpg

Yelp doesn't do a lot of what BooRah does. The final bit of semantics I found on the site was a "semantic cloud" for selected cities. That gives you a good idea what kinds of foods and issues people are talking the most about for a given location and lets you click through to read those reviews.

Further Differentiation

The site searches for reviews across a lot of different sources, depending on the location. Yelp is not included, which is a real shame, but sites like CitySearch, Yahoo Travel, Tripadvisor and many more are included. In some locations the local newspaper website is included in review sources. You can easily filter between sources or chose to just look at food review blogs.

Reviews can also be written on the BooRah site itself. When you sign up for an account you're prompted to select between 3 different charities, presumably ad revenue you generate will be shared with those charities. That's a nice touch. I don't see Yelp doing that, do you?

RSS feeds for new reviews of restaurants in a particular city? I'll subscribe to that! I'd like to have some more granular control of such a feed: new reviews, new restaurants or new restaurants with 3 or more reviews. Yelp has pretty limited RSS feeds.

Finally, the Boos and the Rah's are probably the biggest differentiator here. It is hard for systems like this to recognize things like sarcasm or other peculiarities of human communication - but BooRah seems to be doing a fairly good job in the little bit that I looked around it. I really like the way it pulls out emotive quotes from reviews. My initial skepticism has subsided, but I'll be keeping a close eye on this feature as I use the site more.

Seeing positive and negative reviews around three different parts of a restaurant (food, service and ambience) really is far better than just seeing a number of stars. This method of displaying reviews scales for the individual user, far better than stars and full text reviews do.

The Down Sides

BooRah has been around for a little while but it still feels like its database could be better fleshed out. The user experience is very good, but (for example) the slideshow viewer is broken right now. I don't know about on the iPhone, but on Windows Mobile the site is effectively unusable for me. That's a real shame, as Yelp Mobile is fantastic.

Not including Yelp in the reviews being indexed seems like a pretty big downside. Maybe most of the world doesn't need to read the musings of the yuppie restaurant-philanderer 2.0 crowd, but as one of those myself - I like Yelp reviews. At the same time, it is nice to read what the rest of the world has to say too. In fact, I'm going to try using BooRah instead of Yelp for awhile - when I'm at home on my laptop at least.

Shortcomings aside, combination of semantic indexing and natural language sentiment-processing is a very interesting one. I look forward to BooRah getting better and bringing the same strategy and feature-richness to other niche topics.

Disclosure: I have a consulting relationship with a somewhat related, still-unlaunched, service provider.

boorahscreen.jpg]]>Discuss]]>
http://www.readwriteweb.com/archives/boorah_semantic_restaurant_reviews.php http://www.readwriteweb.com/archives/boorah_semantic_restaurant_reviews.php Reviews Mon, 28 Apr 2008 10:24:57 -0800 Marshall Kirkpatrick