Written by Alex Iskold
Earlier this week we wrote about The Race to beat Google. In that article we discussed various approaches that startups are taking trying to unseat the web giant. In this post we are going to zoom in on one of the companies - Clusty and their search clustering technology. Before looking at the specifics of Clusty, we will discuss the issues with search at large and will give an overview of clustering.
What is perfect search?
It is interesting to ask: What do we expect when we enter a term into a search box? Ideally, we'd like to get the perfect answer right away. Often, we have an idea what that perfect answer should be, and when computer does not get it for us we are disappointed. But are we being reasonable? Can we expect the "perfect" answer all the time?
Consider for example, our interactions with an Information clerk at the mall. When we ask for a location of a store, she may or may not give us the "perfect" answer. She might not know where this store is, she might not understand us or we may not understand what she said. So for many reasons we may not get the "perfect" answer right away.
What is qualitatively different between our experience with the Information clerk vs. a search engine is that with the clerk we have a dialog. When she does not understand what we asked, she has a chance, to say Excuse me, what do you mean?. Google does not do that, it just gives us the results. If we do not like the answer we have to start from scratch.
The problem is that human interactions are fundamentally iterative, while our interactions with computers are mostly stateless. Perhaps we could get to the perfect search results if we could have a dialog with the computer? Clustering technologies, particularly the one offered by Clusty, give computer a chance to clarify: Excuse me, when you searched for Alex Iskold, did you mean to look for Read/Write Web or AdaptiveBlue or perhaps you where looking for static analysis tools that Alex worked on while at IBM?.
What is clustering?
Clusters are very common phenomenon both in nature and in human society. The examples of clusters include cities, galaxies, a family and of course web sites focused on a similar topic. At its core, clustering is simply a similarity grouping. A good visual way to think about clustering web sites is by picturing a network, like the one shown below.

The image above is from Bradley Huffaker Research
There are many clustering techniques and certainly the exact ones used by Clusty or other search engines are a secret. Here is however, a simplistic view of how clustering works. Each web page is run through a statistical frequency analyzer, that outputs a list of most commonly occurring words and phrases. Each word and phrase then becomes a node in the network.
When two words occur in the same document, the link between them is formed. If the two words co-occur again, the weight of the link between them increased. This processed is repeated iteratively with billions of web pages. The result is a network, or more mathematically speaking, a weighted graph. Since some words gravitate to each other more - this weighted graph will be clustered.
The Basic Web Search with Clusty
It is remarkable that the clusters formed in this way capture meaning. For example, pages where Alex Iskold is the founder of AdaptiveBlue will be distinct from the pages where Alex Iskold is described as a Read/WriteWeb contributor. Clusty takes advantage of this and uses clusters to refine the search. Every time when we perform a search, Clusty pulls together the data from other engines like Ask, MSN and Wisenut. It then organizes the search results in a way that helps us navigates away from ambiguity towards specific cluster of results:

The clusters appear in the left navigation bar while the main results are shown in the center section. The clustering performed by Clusty is hierarchical, so within each cluster there are sub-clusters that user can drill into. This is a good idea because it allows the user to further refine the results. As the user clicks on the link the results in the main section reload. All this is great and positive about Clusty, but there are also things that need to be improved.
First, the names of the clusters need to be normalized. For example, when I drill into AdaptiveBlue, under the results for Alex Iskold, I get:
which is not intuitive. This may not be an easy thing to fix, but as is, its just very difficult to understand. Another issue is rather cosmetic, but it also has a negative impact on the user experience. Pages reload every time when the user clicks on a cluster link, using Ajax here would make experience much more pleasant.
Beyond the basic search
Clusty technology is generic, since the allows the user to perform vertical search for Blogs, Images, News, Jobs and Wikipedia. In addition to the same representation of results they all have the feature called Find in a cluster. This essentially is a secondary search, which allows the user to slice the results by another criteria. I particularly like the implementation here, which highlights the matching clusters:

Another thing that we found interesting is the application of clustering to building tag clouds. In 2006 we saw a lot of sites offering tag clouds to help users navigate through popular topics. Clusty applied their technology to generating the cloud that can be used on any site:

Does Clusty have traction?
Clusty's technology is certainly interesting, but is it popular? The company has been around for a while, but has not really been able to sway away people from Google. Clusty's Alexa rank is slightly above 5,000 now, but a quick comparison with Snap over the last year does not point to a bright future:

Conclusion
So what do we make of this company and the clustering approach overall? We think that the approach has a potential if done well and Clusty is on the right track. The idea of being able to "have a dialog" with a computer by drilling into a subset of results is a good idea. However, the current implementation of Clusty needs to be perfected and polished before people will be willing to spend more time with it. So in principle this can work, but Vivisimo, the company behind Clusty, needs to figure how to make it flawless.
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
I don't use Clusty, but here's a superficial comment in their favor--they have an awesome logo.
Posted by: Adam Jusko | January 5, 2007 11:36 AM
I prefer Quintura. Their UI is superior to Clusty's IMHO, and so I placed Quintura in the Top 10 of my Top 100 Search Engines. See www.quintura.com For a list of the entire Top 100 Search Engines for 2006, including the Top 10 and the #1 Search Engine of the Year; plus the Top 10 to watch in 2007 and more, email me at Charles@CharlesKnightSEO.com. Thanks!
Posted by: Charles Knight | January 5, 2007 12:15 PM
Thanks for the analysis. You might take a look at Firstgov.gov, the US govt search engine. It uses Clusty on top of MSN/Live search. The "private label" approach might be the best direction for Vivisimo; rather than try to compete by offering Clusty as a general search site it could position it as a way to improve search results on other large scale sites. Of course there's lots of competition in the corporate search space too...
Posted by: Steve Fleckenstein | January 5, 2007 1:53 PM
The most visually advanced cluster engine is Grokker. The Java driven Zoomable Map View blows my mind. It's worth a look-see: http://live.grokker.com/grokker.html.
Don't forget these
Infocious - http://search.infocious.com
QueryServer - http://www.queryserver.com/QServerExe/QServer.exe/web.ini
DumbFind - http://www.dumbfind.com/
Mooter - http://www.mooter.com
Posted by: rickdog | January 5, 2007 4:37 PM
I agree with Charles Knight above - Quintura uses the same paradigm of clustering results, but has a supercool UI that allows you to dynamically move between the various clusters.
Metamojo addresses this problem of richer specification of search criteria in a slightly different way - the user can specify a "Category" to qualify the search results, and the engine also provides a rich results set (video, reference lists, blogs etc.).
Posted by: NitinK | January 5, 2007 6:00 PM
I would not draw any conclusions from Alexa data... If you google (clusty?) the latest research you'll find that the alexa rank can be easily 'gamed'/influenced.
That said: I like clusty, and sure hope it will survive.
Posted by: John Smythe | January 5, 2007 9:53 PM
I agree, Clusty really works, I like it.
Posted by: Emre Sokullu | January 6, 2007 12:20 AM
Hi, could you explain the following sentence a little?
"Since some words gravitate to each other more - this weighted graph will be clustered."
How is it clustered? I thought by your explanation that the space of the above diagram is the words in the set of documents and each word will be a point in the space. Lines are drawn from one point to another if the corresponding words are in the same document. The more frequently the words occur together the bolder the line between them, so that the boldness of the lines will indicate strong relationships or clusters of related words. But visually this kind of map does not really produce a "cluster". I could see that if you move words closer to each other rather than increasing the weight of the line that this would produce clusters but not as was describe above.
Thanks.
Posted by: Willie | January 6, 2007 2:24 AM
Seriously, I do believe that Google has too much of a big jump...too much cash...and way too much branding. In this particular niche of search...some of these companies would do good in heeding the advice. Its simply a case of picking your battles carefully....
Posted by: Adrian keys | January 6, 2007 3:22 AM
One thing I like about clustering is that even if you don't use the clusters it still gives you a ton of info about your subject at a glance.
Alex, I'm surprised you didn't mention Google's use of labels when you perform certain searches, like for video games. It seems that Google is willing to adopt the clustering mindset, yet they are using human intelligence to do so.
Posted by: Hashim | January 6, 2007 4:13 AM
@Willie
Imagine that the nodes in the graph are physical tennis balls attached by springs. The boldness of the line equals the strength of the spring.
Alex
Posted by: Alex Iskold | January 6, 2007 5:15 AM
when it comes to search, the future is visual
google just acquired a visual recognition, face/objection company
the future of the search should be --> www.liveplasma.com
Posted by: Adeelnkhan | January 6, 2007 7:00 AM
Another thing about Quintura. In addition to presenting keywords in a word/label cloud, it is possible to add words/labels directly to the cloud. This changes both the cloud and the result set. See my blog entry on the subject at http://nightcleaner.blogspot.com/2007/01/google-is-boring.html
Interactive clustering is perhaps more interesting than Clusty clustering.
Also, it may be more satisfying to work with Quintura clouds than with Clusty clusters because they are visual. Also, the clouds will become 3-d before too long which will be very cool.
Posted by: nightcleaner | January 6, 2007 7:16 AM
I like clusty very much, but their management is very slow to implement. Since the dawn of clusty/visimio, the entire transition has been negative.
Clusty was originally suppose to be used for boasting their search product, instead they are focusing on monetizing clust through ads, big mistake, it has now become a second hand engine with no consistency to the search standards.
Great technology but bad marketing = great engineers, but bad management.
....
Posted by: John Doe | January 6, 2007 2:10 PM
I think clustering could help people to start obtaining information from web search engines, and not just webpages.
[...] For more details I refer to this Overview of Clustering and Clusty Search Engine on Read/Write Web. [...]
Posted by: Franciov | January 6, 2007 6:20 PM
[...] For more details I refer to this Overview of Clustering and Clusty Search Engine on Read/Write Web. [...]
:P
Posted by: Franciov | January 6, 2007 8:25 PM
You may also want to take a look at the Open Source clustering engine at:
http://www.carrot2.org
For more details about the Carrot2 project and the source code, please see the project website at:
http://project.carrot2.org
Posted by: Stanislaw Osinski | January 7, 2007 2:27 AM
Clusty segments a result set. Pick a cluster and the result set is narrowed/refined.
Quintura grows/flows a result set. Hover over a label in the current cloud and you preview a new cloud/result set.
Select the second label to go with the original keyword, and you "AND" the two clouds and their result sets.
This raises the question whether Clusty, in fact, adds knowledge to search. Or, alternatively, if it is simply boring.
Posted by: nightcleaner | January 7, 2007 4:12 AM
Clusty is my preferred search engine. It works wonderfully with single-word searches. The clusters allow me to look at only the subsets that interest me. However, results become fractured, and rather muddled, when given multiple term searches and terrible when using quotation marks.
Posted by: PyramidView | January 7, 2007 4:55 AM
A search on the term "exchange" seems to be too hard for Quintura to cluster. Clusty handles this ok. Of course, this is just a single data point.
Posted by: jason | January 7, 2007 2:50 PM
Nice write up. I couldn't agree more that "stateless" query results, while an important tool in addressing a user's navigational search needs, are not very good at addressinig "discovery oriented searches" where a computer must simulate a more iterative or interactive dialog with the user to provide a meaningfull response. Clustering can help in solving this problem. I've provided some additional examples of the problem and possible implications for Google in a recent post on the Lightspeed Blog(see link above).
Posted by: Ravi Mhatre | January 7, 2007 3:56 PM
Vivisimo's (and Clusty's) clustering is secret, but it doesn't depend on offline clustering of words taken from web pages. Instead, it clusters search results based on the overall similarity of one search result (title and snippet) to another. That's why it works regardless of the language (English, Japanese at Clusty.jp, etc.) and why good results are obtainable on content that's never been seen before (as on corporate intranets).
I'm curious: why are the results under AdaptiveBlue unintuitive?
Posted by: Raul | January 8, 2007 2:36 PM
@Raul,
The set that I got was:
# Demo
# Interview, CTO
# Alex Iskold is the founder and CEO of AdaptiveBlue
# Tagged, Technorati
I do not see this as a list that I can easily comprehend.
Alex
Posted by: Alex Iskold | January 8, 2007 6:26 PM
Clustered search engines are very old, they're just a directory displayed in a different fashion. Sorry clusty, but no luck.
Posted by: Phill Midwinter | January 11, 2007 6:41 AM
Phill, it's not how they are displayed, it's how they are built. I suppose you could call a clustered search engine a "directory" with a different interface, but directories are typically built with direct human involvement. Clustering is done automatically, and is centered around the pages matching an arbitrary search term, rather than some editors top-down view of how the world should be ordered.
Posted by: eas | January 12, 2007 2:16 PM