ReadWriteWeb

Sponsor Post: The Limits of Tweet-Based Web Search

Written by RWW Sponsor / September 22, 2009 5:00 AM / 7 Comments

Editor's note: we offer our long-term sponsors the opportunity to write 'Sponsor Posts' and tell their story. These posts are clearly marked as written by sponsors, but we also want them to be useful and interesting to our readers. We hope you like the posts and we encourage you to support our sponsors by trying out their products.

Many of the recent real-time search engines are based on Twitter. They use the URLs enclosed in tweets to discover and rank new and popular pages. In this post, we'll take a look at the quantitative structure of the underlying foundation, to determine the feasibility and limits of this approach. We'll also look at how to overcome these limitations by using the implicit Web.

You may have seen recently the interesting visualization of Twitter statistics. It essentially proves that, as with other social services, only a small fraction of users actively contribute.

But it also shows another fact: that those people who contribute publish an even smaller fraction of the information they know.

Both of these factors account for the huge difference in efficiency between implicit and explicit voting. Explicit voting, as the name implies, requires users to actively express interest in a page; for example, by tweeting a link. Implicit voting requires no deliberate action on the part of the user; a simple visit to a Web page would count as a vote.

A Quick Calculation

Twitter now has 44.5 million users and delivers about 20,000 tweets per minute. If every second tweet contained a URL, that would be 10,000 URLs shared per minute.

According to Nielsen, the number of visited Web pages per person per month is 1,591.

Twitter's 44.5 million users visit 1.6 million Web pages per minute and explicitly vote for only 10,000 per minute. That is to say, implicit voting and discovery generates 160 times more attention-getting data than explicit voting.

This means that 280,000 implicit votes could provide as much information as 44.5 million explicit votes. Put another way, as many Web pages are implicitly discovered during one day as there are Web pages explicitly discovered during half a year.

This dramatically shows the limits of Web searches based solely on explicit votes and mentions, searches whose potential could be leveraged by using the implicit Web.

Beyond the Mainstream

This becomes even more important if we look beyond mainstream topics and the English language. Then it becomes simply impossible to achieve the critical mass of explicit votes needed to have statistically significant attention-based ranking or popularity-based discovery.

Time and Votes Are Precious

Time is also a crucial factor, especially with real-time search. We want to be able to discover new pages as soon as possible. And we want to assess almost instantly how popular those new pages are. If we fail to reliably rank a page quickly, it will get buried in the noise. But the goals of speed and votes conflict with the fact that the number of votes a page gets is inversely proportional to the time it took to be viewed.

Again a much higher frequency of implicit votes would help.

Relevance vs. Equality

We could also improve on explicit votes. But we should not treat them as being equal because they are not. We trust some of them more than others, and our interests overlap with some more than others, for the very same reason that we follow some people and not others. This helps us get more value and meaning out of that very first vote.

FAROO is moving in this direction by combining real-time search with a peer-to-peer infrastructure.

A Holistic Approach

Discovering topical, fresh, and novel information has always been an important aspect of search. But the perception of what "recent" is has changed dramatically with the popularity of services such as Twitter, and it has led to the emergence of real-time search engines.

Real-time search shouldn't be a silo, but rather should be part of a unified and distributed approach to Web search.

The era of purely document-centered search is over. The equally important roles of user and conversation, both as targets of search and as contributors to discovery and ranking, should be reflected in the infrastructure.

A Distributed Infrastructure

As long as both source and recipient of information are distributed, then the natural design of search is distributed, too. P2P offers an efficient alternative to the ubiquitous concentration and centralization of search we find today.

A peer-to-peer client allows every visited Web page to be implicitly discovered and ranked according to attention received. This is important, because the majority of pages in a real-time search are in the long tail. They appear once or not at all in the Twitter stream and can't be discovered or ranked through explicit votes.

With real-time search, the amount of indexed data is limited, because only recent documents (those that have gained a lot of attention and a high reputation) are accounted for in the index. This allows for a centralized infrastructure at a moderate cost. But as soon as search moves beyond the short head of real-time search and aims to fully index the long tail of the entire Web, then a distributed peer-to-peer architecture provides a huge cost advantage.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. I think it would be a good idea to identify the author of these 'sponsored articles' and explain what their role or title is. This would probably also be good for the author and good for the company (sponsor).

    Posted by: Brice Dunwoodie (CMSWire.com) | September 22, 2009 5:24 AM



  2. 20k tweets per minute? way to go.

    Posted by: ITrush | September 22, 2009 6:10 AM



  3. The “if twitter were 100 people” graphic is wrong. There should be overlap between the blue and purple groups, and the green group should only be thirty users, not fifty users.

    See the original post with stats. I think it is important to vet visualizations for accuracy, no matter how pretty they are. Anyways thanks for sharing the information.....

    Posted by: usb verlangerung | September 22, 2009 6:48 AM



  4. For the stat "Twitter now has 44.5 million users" is misunderstood. It's 44.5m unique visitors, not 44.5m users. How can you deduce like that ?

    Posted by: nuphero | September 22, 2009 7:15 AM



  5. @nuphero: Yes, the figures are approximate and simplified to keep it compact.

    To be precise there are 44.5 million users accessing Twitter from the web in June 2009. The user number is continuously changing, the measurement has its deviations and covers only web based access to Twitter.com.
    And there are Twitter users who are subscribed to the service, but not active in that month. One may argue whether they are really users of a real time micro blogging service, if they are not active during a month.

    But all those additional factors do not touch the post's quintessence. We have a baseline consideration here. If additional twitter users are considered for a constant amount of tweets, then the efficiency ratio per user of a pure tweet based search would be even lower, which would even more emphasize the benefit of implicit voting.

    Posted by: Wolf Garbe | September 22, 2009 9:57 AM



  6. You have some great points in the article.
    Buuuut,
    your search just doesn't deliver.

    There isn't even any results when you search for faroo with sort order timeline (this article should at least be there)

    http://www.faroo.com/search?q=faroo&qt=timesearch&start=0&language=en&fq=mentionLanguage:en%20&tr=en.xsl

    Posted by: slopfry | September 22, 2009 12:42 PM



  7. @slopfry: Well, it's there ;-)
    http://www.faroo.com/search?q=limits+tweet+based+search+&language=en&fq=mentionLanguage%3Aen+&tr=en.xsl&start=0&qt=timesearch
    While the the complete article is indexed, the search is per default limited to the title for relevancy.

    Posted by: Wolf Garbe | September 22, 2009 1:22 PM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  
RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS