After my post about
Personalized Megite, I got taken to task by both Gabe Rivera from Memeorandum and Nik Cubrilovic from OmniDrive - two developers who have had a lot of
experience trying to develop such systems. As Gabe wrote in
Scoble's comments:
"I agree with Nik that there’s a huge technical chasm to cross before a “personalized meme tracker” gets really useful. I think progress I make on the memeorandum engine is approaching that, but it’s still far off enough that I’ll pass on hyping it for now."
Gabe then proceeded to give me an earful in a Skype conversation about the issue ;-) I was also interested to read Greg Linden's thoughts on the matter, as he is another very smart developer with experience in this domain. Actually Greg seemed to like my suggestion of introducing clustering to Findory, which would definitely get me using it more.
Because let's face it, Personalization + Clustering is the next big step in RSS. If 2005 was about Aggregation, then 2006 is all about Filtering.
Nik wrote up his thoughts today, in a post entitled Memetracking Attempts at Old Issues. While he mentions lack of link data as being an issue, it seems to me the crux of the problem is this:
"generating a personal view of the web for each and every person is computationally expensive and thus does not scale, at all."
He goes on to say that "this is why you don’t have personalized Google results – we just don’t have the CPU cycles to care about you."
So it's mainly a computational and scaling problem. Damn hardware.
Nevertheless, there is a big demand for personalized clustering - among the edge cases, it must be said. And Megite and TailRank are both trying to capture that demand, which to be frank I'm very pleased about. I understand why Gabe and Nik don't want a bar of it, but there are lots of squeezed bloggers out there who are desperate for a good RSS filtering solution. The first web app that solves this, or at least gives me decent filtered RSS feeds, is going to get my business for sure.
TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/2610
Comments
Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts
Is this is where desktop apps trump hosted apps? I've got plenty of CPU cycles in my basement. Of course, the problem is gathering all the data worth analyzing.
Maybe I'm not quite understanding the term "Personalized Clustering". I've got a script [1] that scans through about 1000 RSS feeds from my subscriptions, extracts all the links, then groups and sorts the links by number of entries containing the links. This gives me a pretty decent overview of the day's memes from the perspective of people I watch.
Now, this script is simpleminded and doesn't pull in any other factors like "blog authority" or popularity amongst sites not included in my subscription list, but it's been pretty useful as my first daily stop.
[1]: a">a">http://decafbad.com/blog/2006/01/03/sharing-attention-while-reading-feeds">a script
Yep... it seems as if we had totally forgotten the hype before syndication blogs and rss: p2p. Of course it would be hard to make money from. At least it would be harder than with a simple web site but it would solve the CPU and the bandwidth problem. Do collaborative filtering with a p2p architecture. Let my peer swap data with others (similiar users, friends, FOAFs, etc.)
Saying that personalization is too computationally hard is a copout. You don't have to personalize for every single person who visits your site. You can personalize just enough. No site is going to have 100% unique users needing exact personalization. This is why sites like memeorandum work in the first place, because my interests intersect with a large group of other users. Just break that down a bit and determine if your user likes Blog news and Enterprise news, etc. and give them filtered results based on groups of interests.
I think they keys to personalized clustering are limited expectations and incrimental improvements. No, you can't get an app that reads memes for you, categorizes them and alerts you based on relevantance (those apps do exists today, but they have a phenominal cost in terms of hardware and software). But you can set up some more basic filtering and categorization tools and make improvements from there. Something is generally considered to be better than nothing.
I believe that there are less hardware intensive methods of personalization and cluster that can be developed without a huge resource investment. There have been some intriguing developments in data mining and data warehousing recently that give me hope that the massive amounts of data necessary to develop a personalized meme tracker aren't too far away. I think the biggest issue is that people look at the challenges from an established perspective and say "Eh...that's too difficult." If more people develop applications, it will inspire others to come along.
(It's early, I apologize for any typos or spelling mistakes. And I'm not sure how to add newlines in this forum.)
The desktop is the way to go. But needs to figure out a good business for that. Shareware may not work. Advertisement may not work either. Maybe sponsorship.
Alright this has been killing me every time I've seen a post on Memeorandum for the last few months, but I can't hold back any longer (as much as we have been trying to avoid creating more hype). I built this exact feature for Flock; it's been in the hourly test builds since the beginning of December and will be in the version released real soon now.
As for the problems you cite, you're absolutely right - link data isn't good enough for personalized feeds, so while we try to use it when it's there, the bulk of the clustering is done using text analysis. And because you can cluster any arbitrary set of RSS feeds, we do the clustering ON-DEMAND, so efficiency is a big deal (and we manage - for most views it happens on my laptop in a 1 or 2 seconds - feels more or less like loading any other web page).
I personally don't think the scaling problem is intractable, but it's difficult, and a desktop version is a liberating avenue to explore. Memeorandum is designed to run on the desktop, so I can with little effort release such a beast for all to personalize. The problem is that I doubt it will see much adoption outside of the memetracker-obsessed digerati, due to the problems associated with limited metadata and limited content. (Let's even forget for a moment that few people know what OPML is.)
Ari's doing text analysis in Flock. That definitely helps, but my guess is not enough. On a scale of 1-10 in "personal memetracker" usefulness, link analysis gives 2 points, state-of-the-art text analysis gives another 2 points, but that's only 4/10 when most people demand higher. Yes, I just made up those numbers, but it's one way to illustrate my point.
If I'm wrong, and "personal memetrackers" make the cover of Time Magazine, well, I'll just release my desktop client.
Seems like Flock are doing the two things that need to happen:
1. Pushing the calculation load onto the client
2. Using contextual analysis to overcome the lack of links problem
I look forward to trying it out as that approach seems to make a lot more sense, but I believe that a destination site such as memeorandum can live alongside having relevance in your desktop aggregator - two different things. Ari it seems you guys can pull this off well and create a decent aggregator (with 147 feeds in my bloglines, I usually skip whole sections of posts because I can't keep up).
Contextual analysis is needed in the current crop of memetrackers, look at TailRank today and you will see that the same story appears in a few seperate clusters, and both Megite and memeorandum sometimes get the same problem though some of them solve it better than others. Relying just on links isn't even working well for the bigger stories.
I find the quote "this is why you don’t have personalized Google results – we just don’t have the CPU cycles to care about you" a bit strange considering that Google has a service (beta, obviously) called "Personalized Search", which does the following: "Get the results most relevant to you, based on what you’ve searched for in the past"
More information here: http://www.google.com/support/bin/answer.py?answer=26651&topic=1593
Am I missing something subtle here?
"generating a personal view of the web for each and every person is computationally expensive and thus does not scale, at all."
I would guess these people don't have more then passing knowledge of database systems. I would say what is being proposed here is trivial.
I create a personalized view for each and every member on my site on each and every pageview. I peak out at 1 million pageviews/hour so what i'm doing is several orders of magnitude more complex then anything proposed here.
Great post. The key is scalability. I've given a great deal of thought to personalized recommendations over the years. Personalized recommendation is where I think TailRank is really going to shine and I do believe we've solved the scalability issue. Just because its a hard problem doesn't mean it lacks a solution. Of course I intend to put my money where my mouth is so stay tuned.
One reason some of the memetrackers have entered the market so fast is that they have a somewhat trivial problem. All they have to do is bake one static HTML page and serve the same page to all their readers. This is a somewhat easy problem. TailRank's architecture is a bit more complicated than this obviously and when memetrackers start to be used by more and more people TailRank will be able to scale.
Scalable search is hard but for the most part Google solved it.
Maybe TailRank is the Google of memetrackers. God I hope so :)
Anyway. Wanted to follow up with a comment but I'm pending a blog post on this soon. Just swamped with work right now to write the post :-/
Onward!
Kevin
I love it when developers try to 'out-complex' one another. My app is more complicated than yours, heh heh. Markus, your comment was particularly enjoyable.
I thought you'd like that :). It puts it in perspective though, what is being discussed here isn't that complex. When it comes to complexity google and many others have us beat hands down.
Well I think it's definitely a hard problem though Markus, otherwise I would have a plethora of personalized news trackers to choose from. :-)
Gabe: agreed...I like to think that we get a bit better than "4/10" but you're definitely right that it's far from perfect, and in fact I did some (fairly unscientific) testing on interannotator agreement and found that even humans don't quite do this perfectly (it was about 95% agreement).
Nik: I hope we can make a good aggregator, but don't get your hopes up too high, we still have a lot of work to do (even I still rely on bloglines...at the very least we need to finish adding in read/unread which we have code for but won't quite be ready for this release). You're definitely right that there is room for both Memeorandum-style and this personalized style to coexist: I'd say the Memeorandum style is more accessible (you don't need to go through and pick your own feeds to read, you just get a good view of what's going on as a whole) but for some people there is that need to read every post on a certain set of feeds you like which is where I think the personalized ones fit in.
Aapo: ugg, I hate this personalized search thing. Useful personalized search would be if you're searching for something that you saw once before and it says "here's that page you were looking at last week about this". Personalized search that a lot of people love to talk about is "you're a tech person, so when you search for Windows you want Microsoft". Dammit, people interested in technology still need to get their windows washed every once in a while. Personalization is not the solution to query disambiguation and I wish people would stop fixating on this. Yahoo has the right idea with it's query refinement suggestions: "Also try: jaguar cars, jaguar animal, jaguar parts, jaguar pictures".
It seems that everyone has glazed over the other issue. It is hard to come up with a good personalization model regardless of scalability. Maybe someone here that is in the know can describe an existing personalization model (within this context) that performs well.
Richard,
It really depends on how you define "hard".
1. This looks like a 1 person company, what are the chances that person has the nessary skills to make a kick ass database?
2. Hard in most online applications is rarely a expotential problem. Search\personalization has already been proven to be solved in linar time.
3. Is it "hard" to allow personalization and still make any kind of profit/revenue?
I think the reason we don't see a ton of personalized news trackers is because there is no money in it. I make that comment because I see no monitization on Memeorandum. If there is money in it, who cares about spending 10 grand and buying a bunch of servers?
Did someone say 'edge case'?
Markus:
You didn't describe a specific personalization method that could easily solve the problem.
It's easy in verticals! Check out http://dappr.com for personalized and user-filterable fashion industry news. Not totally relevant to most of your readers, but it should make a splash on 7th Ave.
It's not going completely live until Monday, but nothing beats being topical :)
Rich Collins,
Allow a user to define a filter containing blogs and subjects they don't care about. Then you can continue to expand on those filters. I get the feeling this is something you have to throw a lot of possibilities out there and then see what works best.