ReadWriteWeb

Larry Page on Real Time Google: We Have To Do It

Written by Marshall Kirkpatrick / May 19, 2009 10:16 AM / 47 Comments

Is Google interested in searching the Real Time Web? Are they at all threatened by Twitter? Are Google spiders already so fast that this emergence of Real Time is old news to them? Further fodder for pondering these types of questions was offered by Google co-founder, Larry Page, today at the Google Zeitgeist conference in Hertfordshire, UK.

Page says that Twitter has demonstrated that real time search is essential. Loic Le Meur, founder of microblogging service Seesmic and European tech conference Le Web (where this year's topic is the real time web), asked Page today what he thought about Twitter. Page's response was interesting.

"I have always thought we needed to index the web every second to allow real time search," Le Meur quotes Page as saying. "At first, my team laughed and did not believe me. With Twitter, now they know they have to do it. Not everybody needs sub-second indexing but people are getting pretty excited about realtime."

Page's statement comes less than two weeks after Google execs told reporters that the company is looking at ways of integrating microblogging capabilities, such as those popularized by Twitter, into its search product.

It's clear that Twitter in particular, and the real time web of status updates in general (most popular on Facebook), is changing the direction Google is going. Google execs probably prefer to talk about Twitter instead of Facebook because they are on friendlier terms with the smaller company and Facebook is closed to outside search. Neither company has a clear corner on the real time market, though.

For a look at one type of real time search functionality Google might aim for, check out the newly relaunched OneRiot. (Our review.) Or FriendFeed search. Much more is possible than simple "most recent" search ala search.twitter.com. Relevance has to be figured out on top of timeliness.

For an in-depth look at the real time web, see our recent overview titled Introduction to the Real Time Web and yesterday's post on Search Engine Land, where real time and circles of influence (social search, essentially) were identified as Google's two primary weaknesses and likely directions for the future.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. It will be interesting to see how Google deals with the issue of relevancy. Merely indexing faster is not going to cut it; real-time requires a fundamentally different measure of relevancy than Pagerank can provide.

    I think companies like bit.ly have a better chance of solving the relevancy problem than google does by merely running pagerank in real time.

    I've posted about this in more detail here: http://realtimethoughts.com

    Posted by: Tony Haile | May 19, 2009 10:52 AM



  2. I can't imagine why I would want to search twitter in real time. Seriously. I'm not a luddite, and I do follow a few people on Twitter, but there is literally no reason for me to need or want to know about their posts in real time.

    After all, as generally worthless as blogs and forum postings are, at least those writers usually spend more than 20 seconds typing up their thoughts. Maybe even a whole minute or two, like this posting.... But Twitter feeds... well, it's just low quality, low urgency stuff, for the most part, and the small fraction that isn't the "most part" just isn't worth wasting time, energy, attention and bandwidth on for instant indexing.

    Google wants to index and access all information, so naturally they will want twitter as well as everything else -- eventually. But I think pretty much every other form of information and communicatoin known to humanity would rate a higher priority.

    Posted by: Miramon | May 19, 2009 10:55 AM



  3. Miramon, you should check out the new OneRiot.com and Mark Carey's Twitter on Google greasemonkey script and see how those implementations of real time treat you. They are quite useful.

     Posted by: Marshall Kirkpatrick Author Profile Page | May 19, 2009 10:57 AM



  4. Twitter on Google script rocks.
    I would also love to see a Google + Friendfeed integration. Greasemonkey or otherwise.

     Posted by: Eric Author Profile Page | May 19, 2009 11:33 AM



  5. Real time search is addicting. I'm excited to see Google integrate this implementation of the 4th dimension into their products.

    Posted by: Andrew Mager | May 19, 2009 12:07 PM



  6. 2 years ago at Searcholoy, I was fortunate enough to be sitting at the same table as Larry Page at lunch and he mentioned that ultimately content would be indexed as users are typing it in to a given medium, see:

    http://www.webanalyticsworld.net/2007/05/google-searchology-lowdown-part-1.html

    Posted by: Manoj | May 19, 2009 12:23 PM



  7. I actively use Twitter but use Twitter Search infrequently. Most search use occasions are not ultra-time-dependent. I see real-time search it being useful for supplemental entertainment/info at planned events (conferences, concerts, sports games, TV shows, earnings calls, etc.) or as a substitute news source for unplanned events (e.g. natural disasters, accidents, etc.).

    Posted by: Ryan | May 19, 2009 12:37 PM



  8. Searching Twitter etc. in real time might not seem like a good idea to some of you, but that's a far cry from not being able to index it in real time.

    Trending topics, people retweeting links to articles that took "longer than 20 seconds" to write and the like can be valuable sources of information.

    There are a lot of topics where freshness is important and pagerank results often pass out more established information first because more people have linked to it. Google News begins to solve that problem in one direction by offering search results for recent news stories.

    For example, today I wanted to find an article that Bruce Sterling linked to a couple of days ago. No combination of words in google would get me that result anywhere near the top. A result that combined relevance and (where appropriate) fresheness could be very useful indeed.

    Posted by: Tim Maly | May 19, 2009 12:53 PM



  9. Is real time really important, or just the latest big thing? We've had websites capable of polling for updates for many years but the term "real time" seems to be growing as Twitter, Facebook and other web properties implement the automatic F5.

    Indexing content in real time is interesting but presumably it would still take time for a page to move up the SERPs as they gain authority and energy from inbound links.

    I guess I'm just not sure what real time really means for Google (and search) and whether or not I should be excited about it.

    Posted by: David | May 19, 2009 2:02 PM



  10. Marshall said...
    Are Google spiders already so fast that this emergence of Real Time is old news to them?

    You can deploy software agents to crawl and watch each and every site or new sites that emerges on the web, so this is not the main problem.

    Marshall said...
    Page says that Twitter has demonstrated that real time search is essential.

    Page is mistaken here. If Page & researchers at Google can dig further to what Twitter is doing, such as benchmarking the Twitter search against the same dataset that Google has indexed, they would be in for a surprise. The benchmarking would widely expose the so called Twitter real-time search. Inefficient and low recall & precision.


    It appeared real-time because it indexes the raw data only and I suspect that the searches is done on this data but actually it is not real-time feature crunching.

    What's the difference?

    If you index only the raw data (of course you have some pre-processing steps that goes into it), then you're not imposing any huge penalty on retrieval time, since the cleansed data, is there all to it. You're not crunching the cleansed data at all in real-time because it is impossible to do it in current technology.

    But why do you need to crunch the cleansed data in order to index?

    Well, first the data may contains redundant features which needs to be eliminated and a clear example of this is when you have 2 documents that are almost the same. If you don't do feature crunching , then you won't be able to eliminate redundancies .Second, the size of the dataset must be reduced into a smaller dataset that represents the original massive dataset but it has been compressed. This is important for fast retrieval compared to using the original massive dataset itself. Algebraicly, we say that the 2 datasets are similar in their properties, but different in sizes, so it means that the original information, is almost completely retained even it was reduced via the feature reductions process (although there is some loss that is insignificant). Third, the search is done on the reduced index data rather than the original massive dataset, because in this reduced domain, one can clearly see (visually if plotted) the patterns of the related-concepts and how close the terms (words) are in the feature space (reduced data). Automated algorithms is easier to be applied to this reduced dataset which has higher precision & recall in comparison to searches that is done on the raw indexed data.

    Why bother with feature computation or feature decomposition at all?

    Well, when someone sees the sunlight everyday, he sees whitelight. Once you pass that sunlight thru a prism, then you see the 7 primary colors. You cannot see the 7 primary colors without using any filters such as prism, water-vapour/steam, etc... In information retrieval terms, it is hard to see any patterns in the original raw indexed data (white sunlight), but when you applied feature computation or feature decomposition to the raw indexed data (ie, white sunlight), then you will see patterns emerged such as similar concepts (7 primary colors). Terms such as 'car' and 'vehicle' finds their position very close each other in the new reduced feature space.In summary, what is not obvious in the original dataset, will become obvious when feature decomposition is applied to the data.

    There are other techniques in search that can be used, but I am using a popular search engine technique here called LSI (latent semantic indexing), which Google is reported to use LSI in conjunction with its PageRank.

    A good reading on what I have described above, is shown in the following paper which is freely downloadable (PDF).


    Using Linear Algebra for Intelligent Information Retrieval

    So, here is what you call real-time. I will use a simple analogy here, with a concept that we all familiar with called moving average, which one can extend the analogy to search engine which is pretty much the same except that search engines works on vectors (a row or a column of numeric data) that accumulates into a matrix, while moving average works on a single number that accumulates into a vector. Here is the analogy, suppose that you start with a data-series of 2 elements (ie, 2 numbers) and the numbers gets accumulated as time goes on. Lets say that our data is a monthly sales data from company XYZ. The moving average used here is not the same as the one applied in technical analysis of the financial markets, however they're similar.

    day-1 = [2, 4] ==> moving-average (ma) = [3]
    day-2 = [2, 4, 1] ==> ma = [3, 2.33]
    day-3 = [2, 4, 1, 6] ==> ma = [3, 2.33, 3.25]
    day-4 = [2, 4, 1, 6, 2] ==> ma = [3, 2.33, 3.25, 3]
    day-5 = [2, 4, 1, 6, 2, 9] ==> ma = [3, 2.33, 3.25, 3, 4]
    ...
    ...

    XYZ sales moving average goes on till it has filed for bankruptcy and we assume that XYZ will keep trading indefinitely so that the data series keeps on indefinitely.

    To explain the moving average, it is ma=3 for day-1, because you add 2 and 4 then divide by 2. In day-2, there is only a new number at the end of the series which is 1. You sum up all the data in day-2, ie 2 + 4 + 1 which is 7, then divide by 3 to get the second element of ma=2.33 and note that the moving average from the day1 is still there, which is 3 and the only new entry is 2.33 as in ma=[3, 2.33] to indicate that the moving average of the new day is has moved on with the arrival of the new sales data for that specific day.

    Now, imagine that the moving average runs into day-100,000,000 or even more? You have to recompute the moving average everytime a new data enters or arrives. Computer memory, this is massive to compute in real-time, since you have to recompute everything starting from the data that arrived/collected from day-1.

    Now, lets squeeze the time scale of our example to milli-seconds instead of days as shown below:

    millsec-1 = [2, 4] ==> moving-average (ma) = [3]
    millsec-2 = [2, 4, 1] ==> ma = [3, 2.33]
    millsec-3 = [2, 4, 1, 6] ==> ma = [3, 2.33, 3.25]
    ...
    ...

    This is a realistic time-scale for the rate where new documents arrive on the internet that in turn Google has to recompute & index the whole PageRank starting from the beginning. We assume here that Google PageRank is not on-line (ie, only computes the new arrival and not the whole dataset again), because I am not aware that Google has made the PageRank online yet. To the best of my knowledge PageRank is still an off-line algorithm (ie, it recomputes everything from the beginning).

    The Google PageRank algorithm was reported to have reached a data matrix of 2 billions by 2 billions in 2003. Today I roughly estimated that PageRank crunches a matrix size of more than 10 billions by 10 billions of rows by columns of data.

    In our moving average example, our data grows in a one dimensional fashion, ie, you add numbers to the end of a single row. In Google PageRank or LSI, the data grows in a 2D fashion, because whenever a new document appears on the internet in every milliseconds, it is formatted into a row/column matrix, which grows both horizontally/vertically.

    Here is the catch, the Google matrix is 10,000,000 rows by 10,000,000 columns. There are 10^14 (base 10 raised to the power of 14), number of double floating point elements in this matrix. I will leave it to programmers out there to work it out how many bytes are there in this huge matrix that it requires to hold in memory while the feature computation is taking place. Google doesn't compute the PageRank whenever a new document appears on the internet? WHY? Because it has to recompute the PageRank each time and the memory requires (despite 100s of thousands of clusters) is unachievable, ie, physically unachievable.

    Ladies & gentlemen, this is the reason why no one has been able to achieve a real-time search in today's environment. In saying this, it doesn't mean that it is undoable, but the technology will be there soon, but not yet.

    So, I say to Twitter or anyone who worships Twitter, that twitter is not real-time (in the sense of feature computations, LSI, PageRank, etc,...). I believe that they're misleading their followers and readers. I say it again, don't fall into the hype of Twitter real-time search because it ain't real-time. I have explained why realtime can't be true by summarizing the process involved.

    Posted by: Falafulu Fisi | May 19, 2009 2:08 PM



  11. Search experts are really smart dudes - that's why my bet is on Google, and why Google requires PhD's for those positions. It is heavy math, heavy computation, and extremely strong theory. It's very tough not to get that in school. His comment is beyond my knowledge, but also why I don't try to do search.

    Posted by: Jesse Stay Posted on FriendFeed   | May 19, 2009 2:36 PM



  12. @Falafulu Fisi

    Dude, regardless of all the crap you just wrote, one fact remains:

    If i do a search for #realTimeSearch on twitter, it will show me the most recent tweets containing that hashtag. It also alerts me when new updates come in. To test, I can tweet something with that hashtag, and it shows up in a matter of seconds.

    If I write a post on my blog, then a few seconds later go to google and search for the entry I just made, I don't see it.

    When I can search twitter for 'earthquake' and read news about a recent earthquake before it shows up on news sites, or in google. That is a real time search. Despite what your math says.

    Posted by: RTS | May 19, 2009 2:42 PM



  13. Perhaps, I should try to clarify my position here. I am not disputing that Twitter is indexing raw data real-time, because the problem doesn't lie in indexing data in real-time. The problem lies when you do feature computation on the dataset in real-time. You can just index the raw data without doing feature computation. The cons is that your retrieval precision & recall is worse which is less useful. If you do feature computation, then your retrieval relevancy increases, but this comes at a cost, ie, you can't do it in real-time, unless someone has developed a quantum computer to do the crunching. I guess what twitter is doing is just indexing the raw data itself and the searches runs on this data, where relevancy is not important as long as new documents are being captured by the system. On the other hand, Google uses Power iteration method to do the PageRank feature computation, which is impossible to run it in real-time.

    Posted by: Falafulu Fisi | May 19, 2009 2:54 PM



  14. I agree with Falafulu Fisi, it's unfeasible to recalculate some "classical" search engine measures such as PR. But, in the other hand, real time search could point in another direction.

    Real time search is only "critical" in pieces of information which depend heavily on the time they were submitted. As an example, imagine you write in twitter about a TV show you are watching. It only matters in a short period of time (for example, the time the TV show is on air) and it's not interesting enough to calculate some measures (i.e. PR or duplicity) which are time consuming. The information matters now or never.

    Of course, a lot of problems arise. If we only search text and don't calculate anything... How can we filter? How can we rank? I think this is the real challenge. Adding to a database and searching is not the most dificult part of the problem but How much can we refine the results and still be in "real time"?

    Another challenge is how can we distinguish between the information which must be treated this way from the document we need to perform some extra algorithms (i.e. usual web documents).


    And, of course, it can be just a substitute of a real search algorithms execution which can be performed later to refine even more the results.

    @RTS

    In which position of the rank must appear your post blog? In the Internet being the later does not mean being the most useful or important resource.

    In twitter the answer is easy (ordered by the time it was written) because twitter information has relevance for a little lapse of time, but Google must show web documents written some years ago which are perfectly valid.

    Posted by: Brenes | May 19, 2009 2:59 PM



  15. Http://www.yauba.com works well for me.

    All the benefits of Google and all the benefits of twitter search in one click.

    Posted by: Benares Joe | May 19, 2009 3:07 PM



  16. RTS ,

    It is obvious that you're a Twitter worshiper.

    You say...
    If i do a search for #realTimeSearch on twitter, it will show me the most recent tweets containing that hashtag. It also alerts me when new updates come in.

    But that's not what search is? Alert is not search.

    You say...
    To test, I can tweet something with that hashtag, and it shows up in a matter of seconds.

    Again, that is not what search is, you've just described what instant message is.

    You say...
    If I write a post on my blog, then a few seconds later go to google and search for the entry I just made, I don't see it.

    Again, this is not search. You have to differentiate real-time indexing and real-time feature computation. See my message above.

    You say...
    When I can search twitter for 'earthquake' and read news about a recent earthquake before it shows up on news sites, or in google. That is a real time search.

    No, that is real-time indexing, because the matter of the fact that you're not benchmarking your Twitter results to Google and find out if there are more relevant documents Twitter might have missed. Search is about relevancy, not recency.

    If you want recency, then tune in to a radio.

    Posted by: Falafulu Fisi | May 19, 2009 3:07 PM



  17. Falafulu, I would argue that: real time search = recency X probably a new form of relevancy X probably a new form of authority

     Posted by: Marshall Kirkpatrick Author Profile Page | May 19, 2009 3:10 PM



  18. @Falafulu Fisi

    "Page is mistaken here. If Page & researchers at Google can dig further to what Twitter is doing, such as benchmarking the Twitter search against the same dataset that Google has indexed, they would be in for a surprise. The benchmarking would widely expose the so called Twitter real-time search. Inefficient and low recall & precision."

    Well, recall and precision are usually seen in an inverse type relationship, so saying that both are low might be misleading to readers; furthermore, you can't state that Twitter search has low recall and low precision without knowing how they (Twitter) quantify recall and precision.

    If the goal of Twitter is to return a word match query, which it appears to be doing, its recall and precision are very high. Do a Twitter search for 'google' and you will find everything in its universe that has the word 'google' in it. By definition, that's high precision and recall.

    Comparing precision and recall on different datasets, Twitter v. Google, with the same criteria for what constitutes precision and recall is a terrible mistake. What is required of precision and recall in a super clean 140 character piece of text, and a multi-kilobyte poorly formatted HTML document are two totally different universes.

    "So, I say to Twitter or anyone who worships Twitter, that twitter is not real-time (in the sense of feature computations, LSI, PageRank, etc,...). I believe that they're misleading their followers and readers."

    Another misleading and mostly pointless argument... Saying that Twitter is not real-time in doing something it ISN'T doing is... pointless. What Twitter is doing in real-time is exactly what they, and the users, expect/want.

    However, for the most part, I agree with the rest of your analysis. Google will have a tough time doing feature crunching in real-time.

    Thanks for the interesting discussion though :)

    Posted by: ebusta | May 19, 2009 3:18 PM



  19. Marshall, yes I agree that recency is very important, and this is why Larry Page is focusing researches at Google, but they're not going to do it in the simple way that Twitter is doing it, because that sacrifices relevancy. I wouldn't be surprised that they will try and add a time-stamp dimension to the PageRank in a single computational framework that will still achieve high relevancy and recency, rather than using relevancy output of PageRank (& LSI), then sorting the retrieval list afterwards according to time-stamp. One possibility is something that I have talked about it before here and elsewhere is the possibility of using tensor calculus. Google researchers already know/aware of tensors, because they attended the Stanford Workshop on Modern Massive Datasets of 2006 & 2007, where several papers on tensors presented. Google will do it in the proper way, ie, to derive a formal algorithm that don't comprise the relevancy while getting results that relevancy matters. I don't think that Google sees any real technology at all in what Twitter is doing. Put it in this way, the way Twitter indexing information in realtime is no different from an IR system in a public library that whenever the librarian enters new titles into the system, you get those new titles instantly whichever library terminal in any corner that you're using for your book search. I mean it is not something totally new. We only hear about it because someone else (Twitter) is doing what has been available in the public library IR system for years but only now it applies to Twitter. Google already knows how to do this and so every IR system software house out there. The only question now is how Google do it better by integrating document time-stamp into its PageRank.

    I find that Twitter is hyping its real-time for no reason other than up the expectation of other companies (that might want to acquire them) or its investors that they have found something that Google's (or Microsoft's) army of PhDs have missed. This makes people want to buy them for more than a billion $. If someone wants to buy them, it is because of the Twitter brand, but not Twitter technology.

    Posted by: Falafulu Fisi | May 19, 2009 3:43 PM



  20. @falafulu, et al. yes, it's not latent semantics, yes it's not clear what the precision/relevance trade off is, and twitter will never be as awesome as google... But.. maybe the problem is calling this 'search' in the first place. It only bears a family resemblance to what one would do on google, and is more akin to tuning and filtering than to searching a vast sea of possibilities. tweets aren't possibilities: they are ephemeral bits of conversation that i'm straining to listen to when i go to search.twitter.com. when i do a google search, on the other hand, i am trying to identify information from a vast heterogeneous set of media that are most certainly not ephemeral. these two possibilities are similar, but i would use them under completely different circumstances. comparing them from an engineering perspective is beside the point, and largely irrelevant considering that relevancy/precision and 'meaning' are going to be traded off with immediacy and pattern-finding based on the circumstances. it would be great to know...

    Posted by: Arvind Venkataramani Posted on FriendFeed   | May 19, 2009 3:45 PM



  21. Interesting. But I believe "already indexed 2 minutes later" is not so real-time. Fast, but don't call it real time.

    Posted by: MacStories | May 19, 2009 4:08 PM



  22. @MacStories

    I don't need to reach deep into my time theory pockets to remind you that "real-time" is a term that denotes a relative perceived currency. "Now" doesn't truly exist because "now" has already existed for some infinitesimally small amount of time. Nothing can truly be "real-time" but that doesn't stop us for qualifying it as something extremely recent. Now that we can say that "real-time" doesn't exist, we can move to more important discussions about what constitutes "extremely recent". Having experience in search algorithms and how these things work programmatically, 2 minutes is not that bad. It's not great, but it's pretty damn good for what is going on under the hood.

    Posted by: ebusta | May 19, 2009 5:26 PM



  23. funny story: i met with larry page in 2005 for about half an hour and one of the items i raised was the need for an "uber" interface that pushed - real time, as things are published - all media on the searched topic to the interface. this includes pictures, video, blog comments, chat room mentions, usenet references, etc. the ui might be something like what some financial news television channels display.

    he responded with a question: isn't that google alerts?

    now i rely on gooogle alerts for so much, and love the service. but, mmm, not really.

    Posted by: rick | May 19, 2009 8:55 PM



  24. Falafulu, you are spot on!

    Posted by: Marc | May 19, 2009 9:31 PM



  25. Yesterday I was quite surprised to post on Identi.ca and find my dent in Google results minutes later. So at least Google indexes the microblogging sites in near-real time - but I wonder how many other sites benefit from that sort of attention, and how they are selected.

    Posted by: Jean-Marc Liotier Posted on FriendFeed   | May 19, 2009 11:57 PM



  26. Going to real-time will be a tremendous change as Google has been heading to document and store the past of the world since their beginnings: Google Search, Google Earth, Street View, and all the characteristics of their users based on previous behavior.

    Instead of storing billions of pictures or unstructured data Google should have invested into real-time data.
    This is a big change in direction for Google.

    Posted by: LEADSExplorer | May 20, 2009 12:41 AM



  27. Ahh..linear algebra and LSI...What feature computation do we expect from a 140 char message ? What dimensionality reduction are we really looking at ? Why are we mapping one approach to a different problem ? Is Twitter search PageRank[link analysis] driven ? Do retweets/followers indicate "links"? And come on, LSI is no magic bullet the way someone here has repeatedly harping in his comments to earlier articles.

    Twitter performs a keyword match fairly well, use of temporal features is clearly seen. If you want to do the LSI stuff and compress features, I dont think its a difficult problem to do it offline and complement the "online" index gradually. Btw, I would appreciate a link to the report where Google has admitted using LSI. All that I found were many seo sites talking abt it.

    Posted by: 42 | May 20, 2009 1:25 AM



  28. Quick and dirty fix: Have a sort by relevance and a sort by freshness feature.

    The results seem way too different to be sorted using the same algorithm. But I've been wrong before. (fyi, twitter plans to twitrank it's tweets)

    Posted by: H M Elius Posted on FriendFeed   | May 20, 2009 2:30 AM



  29. @Falafulu Fisi

    I'm no twitter worshiper. What I'm saying, is that 'Real Time Search' is a concept that people care about, not what technologically is happening in the backend.

    All am end user needs to know about google is they type a keyword, and get results about that keyword.

    On search.twitter.com they have the same experience. They type a keyword, and get results about that keyword. To them, it is search. Its what differentiates a user from simply seeing all the most recent tweets, and having to manually read through all of them to find what they want.

    Posted by: RTS | May 20, 2009 10:56 AM



  30. Sort by freshness feature is not a bad idea for some search queries this could be useful.

    Posted by: Earrings | May 20, 2009 11:36 AM



  31. Great article.

    Posted by: Guias Local Posted on FriendFeed   | May 20, 2009 12:05 PM



  32. Wow, I think you hit that nail right on the head dude!

    RT
    www.whos-watching.se.tc

     Posted by: Spammy Author Profile Page | May 20, 2009 7:08 PM



  33. 42 said...
    What feature computation do we expect from a 140 char message ?

    It is not the number of characters, it is the number of messages available at a single time, which there is always a new one perhaps every second that appears. Multiply the available messages since Twitter started a few years ago with the total number of potential English words available in the dictionary. Lets say that there are (25 millions) 25,000,000 messages by 80,000 words ? Can you calculate the number of bytes that needs to be hold in memory during computation. It is huge isn't it?

    42 said...
    And come on, LSI is no magic bullet the way someone here has repeatedly harping in his comments to earlier articles.

    NO it is magic bullet, and also you entirely missed the point. The point is that LSI can outperform whatever Twitter is using, period. The fact that some on the internet, including Twitter users have compared Twitter real-time search as some form of break-thru which is none at all. One of the break thru today is tensor-based information retrieval, ie, state-of-the-art. Twitter real-time search is not state-of-the-art period.

    BTW, there are different variants of LSI (linear & non-linear) if you don't know that and it means these LSI variants perform better than others and also, there are LSI based algorithms which are online (update in real-time) and some are offline (batch update). Even your search engine is using one variant of LSI, I can still compete with you if I use a faster or more accurate variant of LSI.


    42 said...
    Btw, I would appreciate a link to the report where Google has admitted using LSI.

    I came across it here : Google Latent Semantic Indexing

    My argument here is if Twitter is real-time, then so what? All the talk about Google or Microsoft should buy Twitter because of that, is simply making it out that Twitter had invented something revolutionary that the army of PhDs at Microsoft & Google has missed. Anyone can do what Twitter is doing even Joe the Plumber.

    Posted by: Falafulu Fisi | May 20, 2009 10:16 PM



  34. I agree about the "real time" search hype and I am sure that folks at Google and MSFT are capable enough of seeing through the hype and I seriously doubt if they see "real time search" as one of the principle reasons to buy Twitter.


    Falafulu Fisi said: Large number of bytes..
    If the point is still about computing features, Do you really need to hold all of that "in memory" ?? We can precompute features from previous message index using Mapreduce, hadoop and augment the features every "n" intervals.

    Falafulu Fisi said: NO it is magic bullet
    Good to hear that. I am sure Google will crank up a superior Wolfram alpha using this "magic bullet" and its variants which are dimension agnostic, extract semantics thru classification, clustering and produce wonders. Perhaps like the "intelligent, ground breaking" suggestions that Google offers at the bottom end of the search results ? What broad in depth coverage of related concepts around the terms !! Real AI !! LOL. Guess they should take your help with your suggested variants in this process, at least u rattle off theory very well.

    My point about Twitter was that they seem to be doing well with the current search they have. Users do not tweet to "search" and they dont care about the backend. If twitter has been capable of pushing the marketing gimmick, so be it as long as it delivers "real time" fresh results.

    Is LSI really necessary ? Can we design solutions as per the features/peculiarities/constraints of the problem ? A tweet is about a "context", a short pointed message. What are ur thoughts on hashtags ? Given that the tweet is constrained to be short, you wont have the same problem as general tagging for web-sites and blogs. It is simple, stupid and could also function as an interesting feature to compress the T*D space.

    Posted by: 42 | May 21, 2009 12:58 PM



  35. 42 said...
    Do you really need to hold all of that "in memory" ??

    Not necessarily, but even with high performance & parallel matrix (block) decomposition that Google is running in their hundreds of thousands clusters, it is clear that they don't recompute the PageRank every second do they? If that's the case, then we would see their search results real-time?

    42 said...
    Guess they should take your help with your suggested variants in this process, at least u rattle off theory very well.

    They don't need my help, they're already aware of the potential of tensor. WHY? Because some of their researchers attended the following workshop at Stanford in 2006 where several papers on tensors were presented:

    Papers on Tensors and Their Use in Information Retrieval

    Another interesting paper on Tensor LSI is found here, if you're curious to what this beast called tensor. It showed that LSI has been tensorised and I wouldn't be surprised if Google is working on a tensorised PageRank (my pure speculation), where the time-stamp of document could be used as another dimension such as [inbound x outbound x time]. Such 3D or 3-way tensor PageRank can both capture document authority & recency simultaneously (as in one framework) .You should read it (free PDF), so that you can understand what I am talking about.

    Tensor Space Model for Document Analysis

    42 said...
    Is LSI really necessary ?

    Not necessarily, but if you care about relevancy, then one should use it. With the explosion of new messages in Twitter at every second, the relevancy of their current search will deteriorate over time, and they have no choice but to use LSI.

    Posted by: Falafulu Fisi | May 21, 2009 2:23 PM




  36. Real time search panel video including Google, Yahoo, Twitter folks

    http://jellytalks.yahoo.com/

    Posted by: Sean | May 21, 2009 5:19 PM



  37. Man, i really didn't know any of these what ever has been discussed in the comments above...But i wud say some real experts r on talks. Guess the article was less relevant than the discussion...
    Thanks for enhancing my semantics especially Falafulu Fisi !!!

    Posted by: Rock | May 21, 2009 11:25 PM



  38. Great article Marshall n interesting analysis by Falafulu.

     Posted by: Neo Author Profile Page | May 21, 2009 11:37 PM



  39. Happy to see that Google will be developing real time search. Consumers always want the latest breaking news and information, which can be easily found within search requests and having real time search will enable this. It will also benefit organisations in circulating the latest buzz more effectively.

    Posted by: Zoe Sands Posted on FriendFeed   | May 24, 2009 11:31 AM



  40. I definitely agree that Google really needs to make strides in this area.

    I would also like to see them make their custom search engine service as reliable with results as their main search engine.

    This would truly spawn a multitude of niche search engines.

    Posted by: Free People Search | May 27, 2009 11:08 PM



  41. I think that Google will start to feel pressure with Twitters new real-time searches.

    Posted by: Google People | June 15, 2009 5:42 PM



  42. we offer,Nepal trekking,hiking trips,white water rafting,jungle safari,bird watching,cultural tour,expedition,mountaineering, and we also deliver trip out to tibet,bhutan and india

    Posted by: shankar pandey | July 7, 2009 4:36 AM



  43. It's clear that Twitter in particular, and the real time web of status updates in general (most popular on Facebook), is changing the direction Google is going.

    Posted by: ugg | September 10, 2009 1:13 AM



  44. I think Google will certainly have to watch its back when it comes to Twitter, I think it will have to re vamp some of its searches to keep ahead of the competition.

    Posted by: hunger suppressants | September 24, 2009 3:32 AM



  45. Only One word to characterize such a great post “WOW” that was a very interesting read
    Uggs on sale

    Posted by: anny | October 23, 2009 2:53 AM



  46. Is there any other search engine which does it, I suppose scoopler is one, but then these are searching google and twitter databases only at the end of the day?

    Posted by: Neytri | November 16, 2009 9:58 AM



  47. I think Google will certainly have to watch its back when it comes to Twitter

    Posted by: 007.87id.com Author Profile Page | December 17, 2009 12:27 AM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  
RWW SPONSORS



FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS