Disclosure: the writer of this article, Emre Sokullu, joined Hakia as a Search Evangelist in March 2007. The following article in no way represents Hakia's views - it is Emre's personal opinions only.
Google is like a young mammoth, already very strong but still growing. Healthy quarter results and rising expectations in the online advertising space are the biggest factors for Google to keep its pace in NASDAQ. But now let's think outside the square and try to figure out a Google killer scenario. You may know that I am obsessed with open source (e.g. my projects openhuman and simplekde), so my proposition will be open source based - and I'll call it Google@Home.
First let me define what my concept of Google@Home is. Briefly, Google@Home is an open source, distributed clone of Google. We already have many open source search engine projects - Apache Lucene (which is composed of Nutch and Hadoop distributed file system sub-projects) being the most credible one. So this Google@Home concept can be based on one of those open source search engines. Of course it will have a long way to go before reaching Google's utility and reach. But more importantly, Google@Home will be a distributed, decentralized system. What this means is that our desktop computers' idle time will become a part of this new search engine's computational power. In effect this allows it to compete with Google's beefy data centers. This is not a new concept either, SETI@Home and Folding@Home are 2 well known scientific projects that use the same grid computing idea in their cores. Indeed Google itself is the biggest supporter of Stanford University based Folding@Home, by dedicating the resources of their toolbars to this project.
The distributed nature of the engine is what makes it different from Wikipedia co-founder Jimmy Wales' Wikiasari project, which is an open source wiki-inspired search engine. While Wikiasari's power may come from Wikipedia, its weakest chain is too much human dependency; the power of masses worked well in the open, community driven encyclopedia project, Wikipedia. But vandalism has still been present - albeit at a manageable level. I'm not sure if this can work so well in search engines though.
Well the concept is clear, but you may wonder about the motivation behind it - why would anyone, an organization or a loosely formed group of people, unite around such a project; and why would people dedicate their computer's' idle time to this? Here are some reasons:
Who would create an open source Google clone?
Perhaps, Google itself. Or Google competitors such as Ask or Yahoo. Also it might be something that P2P kings Niklas Zennstrom and Janus Friisk are up to - besides their Joost project. Everything is possible, but in my opinion the most plausible option would be a joint attack by direct competitors. Indeed perhaps the best fit would be the classic "closed source" company Microsoft!! This could be a mirror response to Google, who up till now has leveraged most of its PR towards Microsoft's 'evil' closed source approach (i.e. the subtle 'do no evil' mantra of Google). Stranger things have happened.
Another idea, this Google@Home project can make more use of power of masses in its core - Google is still reluctant to use the direct power of masses idea in its search. Yahoo, on the other hand, with their new unified Social Search Unit seems more ambitious in this arena. As a total underdog, Google@Home would be more open to such innovations and could probably profit from these new paradigms.
How could you support this type of search engine with a complementary distributed and open source ad network? Baris Karadogan has more about this in his blog. (I met him at a conference last week and it turned out that surprisingly we hatched and blogged about these similar concepts at the same time!)
Yes. this is my 'Google killer' scenario. There are many open questions though - some of them are:
Let us know what you think, and also your 'Google killer' scenarios too!
Disclosure: Emre Sokullu now works for Hakia, as a Search Evangelist. He started at Hakia in March 2007.
Listed below are links to blogs that reference this entry: Building An Open Source, Distributed Google Clone.
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/2174
Below is what happened in search today, as reported on Search Engine Land and from other places across the web:... Read More
Here is a summary of the week's Web Tech action on Read/WriteWeb. Apologies for missing last week's wrapup, due to travel. Note that you can subscribe to the weekly wrapups, either via the special RSS feed or by email:... Read More
Colloquy: IRC, SILC & ICB Client (tags: irc osx ) ‰∏ÄÊÆµÈî?ÂÆö±?ÂπïÁöщª£Á†? from osxhints (tags: osx script ) 14 Read More
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
Well, not that the idea of a distributed search engine isn't nice (even though it really has no direct relation to open source) - but if the problem in competing with Google was the computing power or the size of their data centers, this market wouldn't have looked the way it does. Google simply implements search better - it brings better results (or at least most users act as if they do) - and that's what counts. It's the algorithms, not the computing resources.
Now, for the algorithms. Needless to say, Lucene is ages away from Google - and there's no reason to believe that making it distributed would change that. MS and Yahoo! both have many very smart people working for them, and I'm sure they can equal Google at some point (maybe they have, at least at some point in time - but I don't search there so I wouldn't know ;-) ) - it's safe to assume the search algorithms used by these giants are pretty dynamically evolving. Would such a (momentary) advantage make users switch? Not likely, because Google is good enough. It has to take really a radical advantage to make people switch. Personally, I haven't seen this disruptive search technology or product yet.
A minor point is also that open source development is, well, open. If there's such a great idea of a researcher working on this Mega-Lucene, Google can implement it too - there's nothing stopping them, because the GPL doesn't apply here (the software runs on Google servers), let alone other Open Source licenses. And patents probably wouldn't be used to protect this hypothetical researcher's ideas - they're not really compatible with the open source way of thinking.
(Note the difference in this aspect between this case and the case of Windows vs. Linux).
So IMHO - no, this is really far from being a "Google killer". A "Google [search] killer" needs to at least redefine the search problem as we know it ("I give you some keywords, you give me links") - and it still, of course, wouldn't kill Google :-)
You may have a better chance in conjuring up a "Google ads killer" - that's where the real killer potential is.
Posted by: Chas | May 10, 2007 5:24 PMKindly omit 30 Boxes from the broken dreams caused by Google Calendar. It simply isn't true. We continue to crush their product every day.
Posted by: Narendra | May 10, 2007 6:12 PM@Chas - see the link in revenues section, Baris Karadogan has what you want.
Posted by: Emre Sokullu | May 10, 2007 6:14 PM@Narendra - sorry it was just a misinterpretation of my thoughts; i just meant Google Calendar disrupted the online calendar industry with the ubiquity of Google, that's all.
Posted by: Emre Sokullu | May 10, 2007 6:21 PMVery good idea! In fact so good that it already has been implemented see http://www.yacy.net/yacy/ or just do a search on 'p2p search engine' on Google. Not in Google quality but anyway...If you from Haika and YACY would bundle knowledge and resources and user base, it might become something more than an idea...
Posted by: Gert-Jan van Engelen | May 10, 2007 10:43 PMYes Gert-Jan, good to see that, but this doesn't even have a user interface - seems to geekish but can be used as a base; the one I talk about should have the same accessibility with current search engines. Average user Joe should not even feel that it is P2P.
Posted by: Emre Sokullu | May 10, 2007 10:51 PMNothing new here, Emre.
- distributed search engine has been done.... what was the name... Gruby? Looksmart bought the small team + sw many years ago, obviously did nothing with it.
- I know there is one active search engine like that somewhere in UK. Can't rememebr the name now, obviously nothing spectacular.
- Jimbo's SE project might be distributed after all - check the list archives for May.
- Comparing Lucene to Google makes no sense. Lucene is a low-level library. It's up to the application to make clever use of it.
Posted by: Otis Gospodnetic | May 10, 2007 11:13 PM@Otis: a quote from the article:
"Of course it will have a long way to go before reaching Google's utility and reach"
Thanks for informing about these projects but unfortunately we don't have anything people can really **use** - we obviously lack such a project.
Posted by: Emre Sokullu | May 10, 2007 11:24 PMBrilliant. :)
Posted by: Remi | May 11, 2007 12:03 AMEmre, it's great that you are both with Charles started this topic of Google's killer. I agree with Chas that a search problem first should be redefined
Posted by: Yakov | May 11, 2007 12:43 AMIf you really want it to happen, there is the classical open source comment: show me your code.
Yes, it is a good idea, but it will stay a good idea until there is some code out there.
There is a lot of search engine code published with a free license, and some experiments on distributed crawling, but I have not seen anything that scales up to the size of Google. And it is not a novel idea out there, it is just very hard to climb up to the scale of Google.
Posted by: Patrik Wallström | May 11, 2007 1:01 AM@Patrik - The scale problem can be solved with P2P approach but as you say, the movement should start somehow, someone should lead it, just like Linus did years ago. Then, trust and backing of big corporations should come... The pieces are out there (Nutch, dmoz, Otis' propositions, there was an open source P2P framework too - I just don't remember the name), but someone should glue these and make it a product. I would like to do that by myself and show you the code :-) but I don't have time unfortunately... that's why I speak out and make a call actually.
Posted by: Emre Sokullu | May 11, 2007 1:41 AMThis topic has been one of my personal obsessions for the past year now. Google's search result quality has reduced significantly in recent times (starting around last summer) and I've left the little used multitasker at the back of my brain to keep thinking about this topic :)
We already have many open source search engine projects - Apache Lucene (which is composed of Nutch and Hadoop distributed file system sub-projects) being the most credible one. So this Google@Home concept can be based on one of those open source search engines.
I follow Chas's sentiments on this. Lucene has its place, but Google was successful primarily because they developed a great algorithm (which is beginning to get stale, granted) and then developed an architecture designed to use that algorithm at its full potential. Google is an optimized grid running a grid algorithm. Creating a grid and then running a generic algorithm would be ineffective and a poor design.
Building great things takes pain (if only the architects of more modern e-mail systems realized this before SMTP became so prevalent) and Google's founders took the initial pain of developing a reasonably risky, unproven algorithm into a system which then became a business. They weren't look for full results on day one, month one, or even year one, but started with the principles they thought were right and then built the whole system up from there.
Starting with something like Lucene almost means throwing away ideals and many new ideas and concepts merely to get a time bonus. Torvalds, Brin and Page didn't do it that way with their projects, and I'm not sure an open source "Google killer" could do it that way either.
Chas is right in that we need to redefine the problem, and the potential solution, although I am not entirely convinced that redefining it must mean it would change significantly. There are a lot of poor two-bit search engines out there that seem to live in the smugness of being different.. and that's another mistake to avoid.
Posted by: Peter Cooper | May 11, 2007 1:55 AMIts possible also, to make the project profitable (i.e. text ads) and give the profit to charities. This will be huge incentive for people to contribute in the project. Imagine, all google Income being distributed yearly. The money can also be used to push open-source projects and educate people is poor countries.
Posted by: Essam Alzamel | May 11, 2007 3:21 AMI totally agree. But you will also need a linguistical Database as in http://www.infocodex.com
Posted by: Zeno Davatz | May 11, 2007 6:09 AMHave a look at FAROO, a peer-to-peer web search engine (although not open source).
Posted by: Wolf | May 11, 2007 8:02 AMAm I correct if I say "depending on masses (distributed) is not always so successful to kill big companies(products)"! Linux, itself will take decades to kill windows. Unless google makes mistakes it is on its way to be a major player in the future.
Posted by: Sagar | May 11, 2007 9:10 AM@Wolf, thanks for sharing, these Faroo thing looks promising - it has successive o's in the brand name - that's it, it will surely do it :-) Kidding but IMO it
had better to be open source, otherwise PR, collaborative development and everything become difficult.
@Sagar, you have to be patient, this is not an easy job and will definitely take time!
@Zeno, the good thing is if this was an open source system, you could create this add-on by yourself and let everyone use it; or perhaps you could fork the project too :-)
Posted by: Emre Sokullu | May 11, 2007 10:30 AMEmre:
Posted by: Otis Gospodnetic | May 11, 2007 4:46 PMMy point was - there is a reason why all those past and current attempts at distributed search engine are not succeeding. Crawling part is cheap. Delivering search results is expensive. Crawling is easily distributed. Search is not (latency).
I have thought of a distributed search engine, and I foresee a few serious problems.
Google is constantly defending itself against search engine optimizers who use dubious techniques to get their (clients) websites higher in the search results. Google manages to some extent by keeping their PageRank algorithm secret. This is one of the few examples where I think security through obscurity works.
A distributed search engine has two problems: first, someone has to manage some kind of ranking algorithm - but who? And second, by placing part of a web index database on my computer, it places me in a position to manipulate the data.
Without these problems solved, the distributed search engine is bound to fail because of manipulation by search engine optimizers.
I haven't come up with an idea to solve these problems. I'm very curious if they can be solved.
Posted by: Onno Zweers | May 11, 2007 4:59 PMGoogle Killer... dreaming..!
Posted by: 0neway | May 11, 2007 5:19 PM@Onno, the data manipulation problem could exist in any P2P system like Kazaa but encryption algorithms can solve it with compromise in speed and latency as Otis say.
@Otis, crawling is not cheap, consider 10MB sitemaps that webmasters include in their web directory; just for one site, parsing 10MB of sitemap and crawling these pages, extracting data out is a tremendous job and P2P can come to help here. I agree with your latency point, that's the problem that should be tackled, but at least, this can be solved with a semi P2P approach - a Wikipedia, P2P hybrid.
Posted by: Emre Sokullu | May 11, 2007 6:04 PMEmre:
Posted by: Otis Gospodnetic | May 11, 2007 8:46 PMcost(crawl)
@Otis - you can **cache** the crawled bits extracted from the background processes (which are the heaviest part of it) via P2P; then serve them using inexpensive open source solutions - just like Wikipedia does. Serving the content should not be a problem. And that's the first thing that comes to my mind, it's a semi P2P solution though. But if we delve into it, we can find more.
Posted by: Emre Sokullu | May 11, 2007 11:17 PMThis is not a new idea. It's been tried before and failed.
Back in 2003 LookSmart (through Grub http://searchenginewatch.com/showPage.html?page=2177021) once used user's unused computer resources in a grid to spider sites, but it did little more than compress the pages for the indexing process. This has been tried and it has failed.
That doesn't mean that it's not possible (I've been dreaming about it myself for a long time and it's been one of my dream projects) but there are inherent problems with it.
Firstly the biggest problem search engines face is "search engine spam". "Social" components to a search engine are weaknesses in this fight for search relevancy.
Furthermore the "social" components have yet to show benefits that can scale. For example, the rating of sites works well in limited scope (like a digg, a delicious or a stumbleupon) but do a piss poor job with web-wide searching. Link-based algorithms still dominate the quest for search relevancy and nothing else has proven to do as well just yet (not saying it won't just that it's been tried thousands of times without success, indicating a degree of difficulty).
But most importantly, people underestimate the effect that the latency of p2p has on usability. Nobody wants searches to take as long as they do on p2p platforms, where the query is passed to peers in sequenced handshakes and rare results come back minutes (or more) later.
Look, I want to beat google as much as the next guy. Both because I dislike the power they have over the web as well as my inherent interests in building leading technologies.
But the p2p platform isn't the solution. Open source is. I've followed nutch and lucene since their inceptions with hopes to build decent search engines out of them but they aren't ready yet. But there's hope:
The biggest reason I haven't put my passion for search and my skill with algorithms to use in the quest for a better search engine are due to problems with foundational issues that bore me (distributed filesystems etc) as well as scale issues that impede me (one example is the bandwidth needed to spider the web and keep it fresh).
The hope comes from next-generation services, the likes of which companies like Amazon are currently realizing. For example they offer their "300 terabyte" index as a service to build search engines upon. They also provide cloud computing as a service as well as a storage service.
The evolution of services like those will drive overhead down for problems of the scale of web search and more algo guys like me who think they can tweak their way to better relevancy (and competition for google) will have at it.
I summarize by saying that standardizing protocols (it'd be nice if the index api's protocol was a open standard so you could switch from alexa to another provider if needed) you'd avoid lock in that might discourage developers (I'm not ready to invest a boatload of time on someone else's search service just yet) and by making open source foundations for search with easy abilities to tweak the ranking algos you will see more engines out there.
p2p can only help with some of the spidering and indexing, serving the serps needs to be done from centralized servers hosting the indexes for speed reasons. With future services out there ready to help small developers scale what's really missing is something like lucene but with easy ability to customize the indexing and ranking algos.
Give me that and the maturation of cloud-computing as a service and I'll give you a very relevant search engine.
P.S.
Here's a very basic primer on making a search engine:
Why Writing Your Own Search Engine is Hard
Posted by: Craven de Kere | May 12, 2007 4:10 AMhttp://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=143
You forgetting one thing, The alogrithms are a best kept secret for a reason, so people cannot manipulate search results if we had the full alogrithm we could just create a bullshit wesbite about "viagra" make it comply to the alogrithm and be set to make alot of money but sadly it isnt that simple.
Posted by: Dan | May 12, 2007 6:48 AM@Dan - I think this is a lie, we more or less know how PageRank like algorithms work; the ways of preventing spam sites are obvious, spam sites know them well too, they try to change viagra with v1agra for example, human-readable modifications - but the algorithms that can detect this are pretty obvious too. I don't believe in legacy of this argument.
Posted by: Emre Sokullu | May 12, 2007 4:37 PMWikia Search _is_ Google@Home. Check it out for yourself!
I just wanted to make sure that it's perfectly clear that Jimmy (Jimbo) Wales' (yet-to-be-named) Wikia Search project is already discussing in some detail along these lines, (primarily on the search-l discussion list).
The distributed computing thread there has been active for some time.
Posted by: Nathan Braun | May 13, 2007 6:01 PMEmre:
"you can **cache** the crawled bits extracted from the background processes (which are the heaviest part of it) via P2P; then serve them using inexpensive open source solutions" -- this is *super* vague! :) I don't even see how this relates to Wikipedia. Use of open-source Media-Wiki? That'd be a pretty weak comparison, I think.
Oh course you "cache" crawled content. That's essential for creating a searchable index.
I think Craven at #25 put it well. He mentioned the same thing I mentioned in my original comment above.
You also mention PageRank in #27. PageRank is but *one* component of the scoring algorithm. An *old* one, too, predating all those PhDs hired by Google since the PageRank algo was published. As far spam, I don't think defeating spam is easy. How many spam emails did you get today? I got at least 500. You may not see as much spam in SERPs as in your inbox because of the nature of SERP/queries/ranking and Inbox/time/freshness/sorting, but you know it's there, eating Google's resources.
Posted by: Otis Gospodnetic | May 13, 2007 8:33 PM@Nathan - this is good news, yes I don't follow Wikia mailing list, but it would be great indeed if they take this track.
Posted by: Emre Sokullu | May 13, 2007 10:30 PMOh yes, Wikia may be up to something like that; here are some discussion links:
http://lists.wikia.com/pipermail/search-l/2007-May/000418.html
http://lists.wikia.com/pipermail/search-l/2007-May/000351.html
Posted by: Emre Sokullu | May 13, 2007 10:35 PMwww.open-search.net
Posted by: erik | May 14, 2007 2:17 AMEmre:
Posted by: Otis Gospodnetic | May 14, 2007 8:43 AMRe: #30/#31 - heh, didn't I mention this already? Yes, look at point no. 3 in comment #7. But that's nothing concrete, just one of the ideas of possible approaches so far, as far as I know.
Nothing against Google really, but I have been talking about an open source distributed search engine among colleagues on several occassions...think of it as a similar implementation to p2p and other non-central data-crunching projects.
Posted by: Anthony Ettinger | May 15, 2007 4:26 PM