Based on log file evidence from a friend who runs a personal website, Rich Skrenta claims that only 11 search startups are actually crawling the web. He wonders where all the alt search engines are? For some reason, Rich doesn't link to Charles Knight's Top 100 Alt Search Engine List in asking that question, but to Don Dodge's post linking to us. Nevertheless, this brings up some interesting questions: why are only a few of the hundreds of alternative search engines crawling? Are many of them using a licensed index? Are many of them using alternative ways to get their data?
AltSearchEngines editor Charles Knight has asked his many contacts for more information on this, so we will report back soon on the results. Meanwhile Yakov from alt search engine Quintura (a sponsor of AltSearchEngines.com) says in a comment on Skrenta's post that "having its own index is a necessity for search startup". In another comment, Tailrank's Kevin Burton points out that some alt search engines have a limited scope: "Well with Spinn3r we only crawl blog content so we shouldn't show up on a historical site. I wonder if other crawlers/startups have similar limitations." Also Rafael Cosentino says that his service Congoo uses feeds to gather content, so they don't need to crawl websites. FAROO uses a special kind of distributed crawler, which is crawling "below the radar".
Rich Skrenta clarifies in a comment that he's talking about "web scale" search engines, not niche ones. Even so, it is indeed strange that only 11 crawlers showed up in his friend's website logs.
Do R/WW readers have any more information about this?
Image: changturtle
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/1509
Comments
Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts
(1) Because they only work on a subset of the web until their IT and processing scales well enough to go full monty...
(2) Because they are special-purpose and only look at specific hand-picked sites...
(3) Since they use alexa's index directly and don't need to crawl as someone else does it for them (much better for everyone's bandwidth)...
(4) Since they are "meta" and use other SE's search results...
Lots of possible reasons. (3) makes a lot of sense IMO - if crawling/indexing is a service you can just pay for instead of crawling yourself and building up the whole infrastructure it saves a lot of resources and you can direct your efforts towards _search_ instead of _indexing_.
Posted by: Oli | August 6, 2007 12:51 PM
I've corrected the missing attribution to Charles Knight's original article.
Posted by: Rich Skrenta | August 6, 2007 1:02 PM
Because they want to attract investors, and not being useful to visitors ....
Posted by: hombrelobo | August 6, 2007 1:57 PM
Perhaps search engines are making a distinction of some kind. For example, in the last 6 days TeamDirection has been crawled by 33 spiders. The total number of unique crawlers in July was 58-- and not all of them are duplicates of Rich's list.
What I do notice is a vast disparity on the number of hits each crawler registers. It could be that Rich's metric of 1000 web hits is filtering out the alternatives.
Posted by: John Milan | August 6, 2007 4:56 PM
It is very hard for the small search engines to have an infrastructure to crawl the entire web corpus which increases every second. Only a few companies have the "SCALE" to deal with this.
Posted by: valleyblogzine | August 6, 2007 5:06 PM
John -
Greg's site is about 7k pages, I figured fewer than 1000 hits over 3 months meant that the site basically wasn't being crawled to much of any extent with any reasonable interval.
Agree that it's all about scale...
Posted by: Rich Skrenta | August 6, 2007 6:04 PM
Thanks for the article and the links back to the Top 100. Great list!
Posted by: Larry | August 6, 2007 10:35 PM
Thanks for the great piece. It's inspired my blog post for today (www.garystew.com). Migoa (my company) is a vertical search engine, so I guess technically we're not really covered by the article. But our information suggests that of the European vertical search players, only we and Extate have proprietary crawlers. Of course, this is only based on publicly available data, so it's possible that there are more European vertical search players with proprietary crawlers. In any case, thanks for the great article.
Posted by: Gary Stewart | August 7, 2007 7:45 AM
For those of us in the human-powered space, it might be because we actually visit sites ourselves instead of sending a spider out to do the dirty work.
Posted by: Adam Jusko | August 7, 2007 8:06 AM
I can't speak for them all, but personally... we spend more time in development phases where a lot of crawling doesn't go on comparitively speaking.
Other search startups/alts may not report a user agent, or possibly use other peoples indexes.
Posted by: Phill Midwinter | August 7, 2007 8:58 AM