This is a guest post by Nitin Karandikar, author of the Software Abstractions blog.
Recently I was looking at the log files for my blog, as I regularly do, and I was suddenly struck by the variety of search queries in Google from which users were being referred to my posts. I write often about the different varieties of search - including vertical search, parametric search, semantic search, and so on - so users with queries about search often land on my blog. But do they always find what they're looking for?
All the major search engines currently rely on the proximity of keywords and search terms to match results. But that approach can be misleading, causing the search engine to systematically produce incorrect results under certain conditions.
To demonstrate, let us take a look at three general use cases.
[Note: The examples given below are all drawn from Google. To be fair, all the major search engines use similar algorithms, and all suffer from similar problems. For its part, Google handles billions of queries every day, usually very competently. As the reigning market leader, though, Google is the obvious target - it goes with the territory!]
1. Difficulty in Finding Long Tail Results
Take Britney Spears. Given the current popularity of articles, news, pictures, and videos of the superstar singer, the results for practically any query with the word "spears" in it will be loaded with matches about her - especially if the search involves television or entertainment in any way.
Let's say you're watching the movie Zulu and you start wondering what material the large spears that all the extras are waving about are made of. So, you go to Google and type in "movie spears material" - this is an obviously insufficient description, as the screen shot below shows.

What happens if you expand on the query further - say: "what are movie spears made out of?" - does it help?

The general issue here is that articles about very popular subjects accumulate high levels of PageRank and then totally overwhelm long tail results. This makes it very difficult for a user to find information about unusual topics that happen to lie near these subjects (at least based on keywords).
2. Keyword Ordering
Since the major search engines focus only on the proximity of keywords without context, a user search that's similar to a popular concept gets swamped with those results, even if the order of keywords in the query has been reversed. For example, a tragic occurrence that's common in modern life is that of a bicycle getting hit by a car. Much less common is the possibility of a car getting hit by a bicycle, although it does happen. How would you search for the latter? Try typing "car hit by bicycle" into Google; here's a screen shot of what you get. [Note the third result, which is actually relevant to this search!]

3. Keyword Relationships
Since the major search engines focus only on the keywords in the search phrase, all sense of the relationship between the search terms is lost. For example, users commonly change the meaning of search terms by using negations and prepositions; it is also fairly common to look for the less common members of a set.
This takes us into the realm of natural language processing (NLP). Without NLP, the nuances of these query modifications are totally invisible to the search algorithms.
For example, a query such as "Famous science fiction writers other than Isaac Asimov" is doomed to failure. A screen shot of this search in Google is presented below. Most of the returned results are about Isaac Asimov, even when the user is explicitly trying to exclude him from the list of authors found.

All of the searches shown above look like gimmicks - queries designed intentionally to mislead Google's search algorithms. And in a sense, they are; these specific queries can be easily fixed by tweaking the search engine. Nevertheless, they do point to a real need: the value of understanding the meaning behind both the query and the content indexed.
That's where the concept of semantic search comes in. I attended a media event earlier this year at stealth search startup Powerset (see: Powerset is Not a Google-killer!), at which they showcased a live demo of their search engine, currently in closed alpha, that highlighted solutions to exactly this type of issue.
For example, type "What was said about Jesus" into a major search engine, and you usually get a whole list of results that consist of the teachings of Jesus; this means that the search engine entirely missed the concepts of passive voice and "about." The Powerset results, on the other hand, were consistently on target (for the demo, anyway!).
In other words, when you look at just the keywords in the query, you don't really understand what the user is looking for; by looking at them within context, by taking into account the qualifiers, the prepositions, the negatives, and other such nuances, you can create a semantic graph of the query. The same case can be made for semantic parsing of the content indexed. Put the two together, as Powerset does, and you can get a much better feel for relevance of results.
What about Google? I'm sure the smart folks in Google's search-quality team are busily working on this problem as well. I look forward to the time when the major search engines handle long tail queries more accurately and make search a better experience for all of us.
Update: for an expanded version of this article with real-life user queries, see my blog.
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
Hi,
For the search : "Famous science fiction writers other than Isaac Asimov"
of course, you know that you should have typed, with the brackets
"famous science fiction writers" -Asimov
where "-" replaces "other than".
Everything is explained in the advanced search, that is for searches that are not gibing "correct results"...
John :)
Posted by: John | January 9, 2008 1:55 AMJohn makes a good point. Using advanced search options would solve that problem.
The bicycle problem is solved by using the terms:
bicycle hits car
Of course, from a business perspective, I guess you have to figure out some way to deal with the fact that most people are not going to do advanced searches or learn any kind of keyword discipline on their own.
I think the push to develop natural language search engines makes sense but it also seems to be leading to a place where folks don't actually develop any knowledge or tools for understanding information search and retrieval.
Posted by: Clyde Smith | January 9, 2008 2:37 AMAnother good and actually more simple example of the second problem is the query "police state". Even if you are searching this as an exact query you wil get results with "state police". Actually the problem is worse on Yahoo than Google, where state police results are appearing only on the second result page first. However it is pretty clear, there is a lot to improve in search yet...
Posted by: Endre Jofoldi | January 9, 2008 2:47 AM“The general issue here is that articles about very popular subjects accumulate high levels of PageRank and then totally overwhelm long tail results.”
Yes, but –britney overcomes that, if you insist that you need to search for the word “spears.” The bigger issue is the assumption – that the movie spears are made out of something fake. Well, what are spears made out of, period?
http://www.google.com/search?hl=en&q=how+do+you+make+a+spear
http://www.google.com/search?hl=en&q=how+do+you+make+a+zulu+spear
There are some answers, and Britney doesn’t come up once. Still, doesn’t give what you want. If I had more time, I’d try some searches involving props and other terms. Frankly, I’m not sure the answer is actually out there.
On the car thing, you can control word order at least within quotes:
http://www.google.com/search?q=%22bike%20hit%20car%22
http://www.google.com/search?hl=en&q=%22cyclist+hit+car%22&
That brought up one relevant match. After doing some other searches, I suspect the bigger issue is simply that it doesn’t happen that much.
Having said that:
http://www.google.com/search?hl=en&q=hit+car+on+bike
Actually gets some correct results, word order understood. But I agree, there have been many occasions when I wish I could indicated that a word order should be followed without resorting to quotes. Maybe someday. It shouldn’t be a hard command to implement, and I’m pretty sure other search engines offered it in the past.
On authors, I can easily solve your problem:
http://www.google.com/search?q=famous+science+fiction+authors
Pick whatever list you want. Asimov will be on it. So what. The majority of those listed are exactly what you want, authors who are famous and are NOT Asimov.
Posted by: Danny Sullivan | January 9, 2008 5:52 AMGood - the first commenter got to it before I did. Each google search in this article could be improved by using the slightly advanced search features google offers, like phrase searching (inserting quote marks around phrases), etc.
You say "all sense of the relationship between the search terms is lost" - but with simple quotes around multiple keywords (ie., "science fiction writers" instead of science fiction writers without the quotes), you suddenly DO provide a powerful relationship between the search terms - one of proximity!
Or, you could just get one of those sexy librarians this blog has mentioned lately to do the search for you :-)
Posted by: david lee king | January 9, 2008 6:08 AMAll this semantic search stuff is utterly useless, and I will explain why -
The only people not adept enough to do the kinds of "semantic" search examples Nitin presents are people who would fall under the less technically adept category. I speculate that this demographic (probably older people, technophobes, uneducated people, etc) will increasingly *cease to matter*, especially as the newer, younger generations will all be 'fluent' in current-day web navigation and search methods.
Google-esque search is entrenched in the new wave of Internet users, and even if a GREAT semantic search engine is ever brought to light, it will be too little and too late.
Posted by: Paul Capestany | January 9, 2008 6:47 AMThese types of examples, and there can be many more, just go to show how incredibly complex search is, and even better use of natural language in search is still not going to solve them, although hopefully will improve things.
Compare it with learning another language. I have friends from non-english speaking countries, who have reasonable english, but it is still their second language. It is so easy and common to have misunderstandings between us, because of the subtleties of language, and the importance of context to provide meaning.
Another example: if you were sitting around with a group of friends discussing movies, Hollywood or similar and 'Paris Hilton' came up, as a human you can expect it to mean the person. If a week later you were discussing travelling to Europe and said 'Paris Hilton' a person would guess you were interested in a Hilton hotel in the city of Paris. But how to get a computer to work out the difference?
What then if you were talking about fashion - where potentially you could be referring to either the person or the city hotel.
Searchers also have to do their part in thinking about what terms they enter, and using more advanced search techniques to get themselves better results.
Posted by: Robert | January 9, 2008 6:55 AM"natural language processing" has been a real holy grail for many years, as much as real A.I.
I don't see it coming yet, though, but i'm pretty sure that google with its buying power and intellect potential is making heavy researches into it, as well as microsoft, ibm, any-other-you-name-it-player
the reasonably cheap processing power will also be of help, but the killer algorithm is still far from making it to the market i'm afraid, so the search engines are far from being perfect as well, not to mention that the internet articles they are based on, as well as us, human beings, are far from perfect either :)
Posted by: Esdee | January 9, 2008 7:33 AMMany people over think searches. Use natural language. For instance the phrase "cyclist hits car" returns many on target articles:
Halifax, The Daily News:
Cyclist hits car A cyclist was sent to hospital after a collision with a car yesterday afternoon. Shortly after 1:30 p.m., the cyclist and a vehicle were ...
Cyclist hits car and pedestrian in Blackpool...
Dangerous cyclist collides with car minutes after hitting pedestrian on pavement while cycling carelessly or aggresively at Lytham Road and Highfield Road
Use tricks like the synonymous search using the tilde.
For example, if you want to build a treehouse in your back yard try:
treehouse ~build
This gives you sites including plans, building treehouses and designs.
I would hope that many students today have the advantage of learning search techniques from a qualified teacher-librarian and will leave school knowing how to search like a librarian!
Posted by: Lesley | January 9, 2008 2:52 PMThose of us who are online business owners down in the long tail will appreciate the day that the search engines get better...I think!
Posted by: Don Jones | January 9, 2008 5:30 PMFor the folks who pointed out the advanced search options in Google:
[@John, @David Lee, @Paul, ...]
Yes, you're right - I realize that Google has search options that allow you to do exactly some of these things. In the future, I wouldn't be surprised if there is an API that allows increasingly complex/nested power searches. But my point is that most users don't know or care to use them; nor do most people use a librarian for a majority of their day-to-day searches.
In real life, users tend to use the simplest possible terms for search. In fact, it's the current mainstream search engines that have trained searchers to use "keyword-ese".
The shorter version of this article posted here, makes it look like I'm just picking on Google! Actually, I started by analyzing real search queries used by actual users to get to my blog. (See the link at the bottom of the post for the expanded article).
@Endre: Terrific example - I like it!
@Robert: Again, excellent example! The meaning of your search term changes completely with the context.
@Danny:
Great comment.
On the Britney Spears issue: I agree, if you play around with the searches long enough, you can probably figure out a way to get at the information, essentially by excluding/negating the keywords that cause the popular search results. But that just seems to bolster my argument that long-tail search results are overwhelmed by the mega-popular ones, such as "spears".
In this case, that's easy to change to "spear", but users may not always have that luxury, say if the word was jargon or a proper noun.
On authors - I'm looking for authors who are NOT Isaac Asimov. This is a trivial example, but in this case, the search engine is definitely not solving my problem.
Don't get me wrong - I think PageRank is a tremendous step forward in search. I'm just arguing that if the search engine could also recognize modifiers, prepositions et al and use that information to improve the relevance of results, that would be yet another step forward.
Posted by: NitinK | January 10, 2008 1:19 AMHi again Nitin,
when you write in 11:
"But my point is that most users don't know or care to use them; nor do most people use a librarian for a majority of their day-to-day searches"
"In real life, users tend to use the simplest possible terms for search. "
That means that "Most users" when they are googling for "SPEARS" they want to have result on "Britney Spears".
The person "watching the movie Zulu and starting wondering what material the large spears that all the extras are waving about are made of" is NOT the average person... ;)
Posted by: John | January 10, 2008 8:12 AMThe real average person is just eating the popcorn not thinking!
Best
John
According to Google researchers, only 5% of users ever use quotes-- by far the most popular advanced feature. The use of other advanced features is much much lower. So, those 95% of users who might possibly be people who "fall under the less technically adept category [who] will increasingly *cease to matter*" currently do constitute 95% of the market. 95%. By the time the "newer, younger generations [who] will all be 'fluent' in current-day web navigation and search methods" constitute the bulk of the market, search methods that depend on page rank and obscure notation will have been replaced by newer methods.
Posted by: Elder | January 10, 2008 11:18 AM