ReadWriteWeb

Collaborative Filtering: Lifeblood of The Social Web

Written by Muhammad Saleem / June 30, 2008 2:48 PM / 20 Comments

Collaborative Filtering (Wikipedia definition) is a mechanism used to filter large amounts of information by spreading the process of filtering among a large group of people. Unlike mainstream media where there is either one or very few editors setting guidelines, the collaboratively filtered social web can have infinitely many editors and gets better as you increase the number of participants.

There are two basic principles involved in Collaborative Filtering.

1. The Wisdom of Crowds and Law of Large Numbers suggest that as communities grow, not only does a large (diverse, independent, etc.) community make better decisions than a handful of editors, but the larger a community gets, the better its decisions will be. Therefore, we can hypothetically create a Collaboratively Filtered newspaper, television channel, radio station, etc., which would be better (for the community) rather than any other arbitrarily selected medium. In fact, as we will see, services like Digg, YouTube, and Last.fm, are trying to do exactly that - (CF) based media outlets.

2. The second principle of Collaborative Filtering suggests that in any such large community, with enough data on individual participants and on how the individual participants collaborate or correlate with each other, we can make predictions about what these users will like in the future based on what their tastes have been in the past, i.e. develop a collaboratively filtered recommendation engine. This, of course, relies on the fact that people's interests, preferences, and ideologies don't change too drastically over time.

The two aspects of the (CF) system result in two very different and important results.

The first gives you new, interesting, entertaining, and newsworthy information as judged by the community (in a way this is content that is the average of the interests of the entire community) and a good example of this is Digg's front page. Not all the content will be directly relevant to your tastes and in fact some of it will be completely irrelevant to you. However, as the community grows and becomes more diverse and independent, the average news story promoted to the front page will be of interest to the average community member. Not satisfied with averages? This is where the second aspect comes into place.

The second aspect of the (CF) system collects information on what kind of content and commentary you like and dislike, and based on your submission and voting habits, it does user-data-profiling. This user profile helps the site recommend content that has been submitted by users (or from sources) you generally agree with and find interesting, as well as topics that you usually vote up and tend to comment on. What this means is that by collecting enough information on how you interact with the site and with other users, the (CF) system can recommend content to you. The system finds the content and deliver it to you rather than it requiring you to scout for it. Furthermore, the more you use the recommendation system and vote up or down, the better it becomes with its recommendations.

The important thing, one that not many social sites realize, is that a (CF) system that doesn't automatically match content to your preferences, is inherently flawed. The reason for this is simple: Unless you can achieve perfect diversity and independence of opinion, one point of view will always dominate another on a particular platform. The dominant point of view on the social web is a left-leaning one, and without the ability to get the most appropriate pieces of content to the people that care most about them, the right-wing point of view gets buried almost every time.

A perfect example of this was the Ron Paul supporters and the ease with which they were able to manipulate the social news sites. Now if you could match the right-wing viewpoint to the right-wingers, and the left-wing viewpoint to the left-wingers, and get both points of views across to people that are interested in healthy debate rather than partisan politics, you're getting closer to the ideal system. A filtering system with preference-based recommendations, in essence, is the future of the social web.

Who is using what system?

The (CF) system is without a doubt the lifeblood of the social web. Even though different platforms apply it to varying extents, the system is still there at the core, and the social web would look more like rush hour in downtown Lahore if the community wasn't actively policing the traffic.

Social News

In the social news space, Digg and Propeller just use the system insofar as the front page is concerned (although Digg is set to release their recommendation engine this week). Once the content is promoted to the front page, the system's job is done. The system works in that you get rid of spam and unoriginal thought, but it isn't the best because it relies on averages rather than direct preferences of each participant. While these sites try to catch up and develop recommendation engines of their own, Reddit and StumbleUpon have leapfrogged them for a while now by having recommendation engines in place. These two sites also have similar concepts of a community front page (based on the average interests of the average community member) but they enhance your experience and incentivize increased participation by using your history of likes and dislikes to deliver the most high-quality and most relevant content to you. Furthermore, the normalization of Reddit's front page shown how a one-front-page-for-all approach forces conformity and dilutes the individual experience, whereas normalization ensures that each user controls how content is distributed to him or her.

Ultimately, even though there are some sites with little or no filtering (Slashdot, Fark, etc.), sites that use their (CF) based recommendation engines will continue to diminish the importance of active filtering from upcoming submission queues and improve the quality of user experience on an individual level.

Video Streaming and Sharing

Online video sites hosting and sharing sites are not much different. Site's like YouTube have multiple filtering mechanisms that often perform the same functions without requiring votes per se. Viewability, for example, is determined by:

1. Number of people currently watching a video

2. Number of comments on a video

3. User ratings and favorites.

The problem with impressions-based system (like the one used by now understandably dead content aggregator Spotplex) is that just because you viewed something or commented on something doesn't mean that it's good. In fact, there are dozens of YouTube videos that I click on, don't like them and then close the window (I see other people writing negative comments in poor English but I doubt that helps either). Some other sites like Break and Funny or Die use a StumbleUpon-like up/down voting system to determine what gets promoted to the front page. Again, while there are options to view similar/related videos and more videos from a user you like, there is no recommendation system using your rating and favoriting habits (and tags you like).

Blogging and Microblogging

For the most part, blogs use a combination of most viewed, most linked, most commented, and highest rated, as mechanisms for displaying content that you might like. While this is a better idea than letting people go through trial and error, it doesn't ensure that every visitor will be happy with what they see. For example, two very different posts on two entirely different topics can be the most viewed posts on your blog, and I might like one and not like another. At the same time, one has to wonder, at what point is it economical or time-efficient to start monitoring each individual user?

StumbleUpon solves this problem for the 'big guys' by letting you StumbleThru one site for the content that you might like the most. The feature, however, is not available for all sites yet.

Most Microblogging sites, unfortunately, have no filtering system at all. The signal to noise ratio debate rages on with respect to Twitter and its ilk. FriendFeed, however, launched a rudimentary recommendation feature that simple displays the top 'liked' and commented links.

Photo Hosting and Sharing

When I was thinking about the concept of (CF) systems, photo-sites like Flickr and Photobucket weren't even on my radar. Part of the reason for this is how most people I know use these sites, i.e., primarily for hosting and sometimes for finding creative commons images for embedding on their sites. I was, however, quite pleasantly surprised to see that Flickr has gone a long way to help people explore and discover excellent photography.

The feature that most people are probably familiar with is Interestingness. The feature is quite robust. It takes into account things like where the referral traffic to the image is coming from, who is commenting on it and when, who marks it as a favorite and how many people like it, among other more nuanced things. But in addition to that, the site also has other great features such as exploring based on geotagging on a map of the world, popular tags, subject-based and quality-based groups, camera finder, and most recent uploads.

The only thing left to add is a 'photos you might like' based on photos you have liked and commented on.

Music Streaming and Discovery

The best implementations of a Collaborative Filtering (CF) system along with a preference based recommendation/discovery system that I have seen are always on music streaming and discovery sites. The implementation on Last.fm for example, is almost perfect in my opinion. First of all, whether you use their online streaming widget or use their desktop software, they monitor every single song you listen to and aggregate that data. They also track how artists jump on and fall off your radar on a week to week basis. They use that data to make specific recommendations and automatically create a radio station for you that plays Last.fm's recommendations for you based on what you like.

While that in itself is more than enough, they don't stop there. They have another radio station for you that plays songs you usually like to listen, they show you what the entire Last.fm community is generally listening to, what your friends are listening to, and what your friends are recommending. It is a very robust system for aggregation, filtering, and recommendation. Here's how the recommendation engine works:

As you can see, they look at the musicians I listen to a lot and then recommend people that are either similar in sound or people who were influencers of or influenced by my favorites. These are followed by recommendations from friends and music-based groups on the site.

So, collaboratively filter and recommend or die?

These are only some of the major players that have embraced (CF) and personalized recommendations - Netflix and Amazon come to mind among others. As you can see from above, it is certainly possible to have a good collaborative filtering system without a recommendation engine (as seen in Flickr). It is optimal, however, for the users (because their experience is better) and your site (because users will participate more often and generally be happy with your product) if you throw in some recommendation system a-la Last.fm, the most robust of the lot by far.

This is a guest post by Muhammad Saleem, a social media consultant and a top-ranked community member on multiple social news sites. You can follow Muhammad on Twitter.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. nice article msaleem - very thorough as well. collaborative filtering not only lets us pick out the best content but easily filters out the spam and inauthentic content.

    Posted by: Adam Singer | June 30, 2008 4:27 PM



  2. Strange. You completely left out Loomia, Baynote, Aggregate Knowledge and basically every B2B company providing recommendations. Collaborative filtering is widely used in the B2B space to socialize ecommerce, media sites, etc... This should be in there. Amazon uses it as well. Pretty big segment.

    Posted by: John Wallace | June 30, 2008 4:55 PM



  3. Nice post :)

    Posted by: fritz | June 30, 2008 5:35 PM



  4. @John Wallace,

    "These are only some of the major players that have embraced (CF) and personalized recommendations - Netflix and Amazon come to mind among others."

    Thanks for your comment though, I would love to read your coverage of the rest. I'm not an expert on B2B.

    Posted by: Muhammad Saleem | June 30, 2008 5:54 PM



  5. Very nice overview of the current state of social media content recommendation. You make an important point when you say that websites like stumbleupon can "incentivize increased participation" by utilizing your voting data to learn what you like.

    I'm currently developing a social media website and have been trying to find the right balance between automated recommendations and maintaining a social/humanistic feel. Stumbleupon does a pretty good job with this. Automated social recommendations as the main content discovery mechanism with side social features, ability to explore user's favorites, etc.. seems like the right strategy.

    It will be interesting to see how the introduction of a Digg recommendation engine will affect digging-rates. I'm guessing that people will be much more inclined to digg stories and we will begin to see greater diversity within top dugg stories due to a huge demographic that now has an incentive to register and start voting.

    Posted by: Gabe Ragland | June 30, 2008 10:24 PM



  6. Muhammad Saleem said...
    ...recommendation engines will continue to diminish the importance of active filtering from upcoming submission queues and improve the quality of user experience on an individual level.

    Muhammad did you mean to say , recommendation engines will continue to diminish the importance of manual filtering rather than recommendation engines will continue to diminish the importance of active filtering?

    Automated filtering by the collaborative filtering engine is active, that is why it is called realtime filtering.

    Muhammad said...
    ...it is certainly possible to have a good collaborative filtering system without a recommendation engine (as seen in Flickr).

    Collaborative filtering system and recommendation engine are the same thing, ie, a very very minor difference but overall they are the same thing.

    Anyway, good educational article for readers here at RWW.

    Posted by: Falafulu Fisi | July 1, 2008 3:13 AM



  7. It s really great article, thanks

    Posted by: Yuce Zerey | July 1, 2008 4:52 AM



  8. you may want to check polymeme.com which just launched yesterday. it crawls "communities" of blogs -- economics blogs, law blogs, books blogs -- to apply a Techmeme-like mechanism to determine what's important/who's talking about it

    Posted by: Evgeny | July 1, 2008 4:54 AM



  9. Like others have said there are a lot of B2B companies providing recommendations. I have attempted to keep track of most of them on my site at http://www.tomprinty.com/commercial-recommender-systems/


    If you are a rec company please contact me I would like to include you on my list.

    Posted by: Tom Printy | July 1, 2008 6:11 AM



  10. @Falafulu Fisi,

    Active Filtering = when community has to actively participate in the filtering process.

    Passive Filtering = when it is more machine or algorithm based and the community only has to vote on articles recommended by the system.

    The difference between collaborative filtering and recommendations is that the collaborative filtering determines the homepage whereas the recommendations determine what you see individually, and has no impact on what others see on the front page.

    Posted by: Muhammad Saleem | July 1, 2008 6:19 AM



  11. Muhammad,

    The below statement is not entirely accurate:

    "The difference between collaborative filtering and recommendations is that the collaborative filtering determines the homepage whereas the recommendations determine what you see individually, and has no impact on what others see on the front page."

    Any recommender system theoretically can be used to dynamically display "personalized" content at the homepage. CF is just one means to achieve the required businesses objective - either more click throughs, greater time at site, increase add to cart, etc.

    Recommendations can be achieved via CF or other algorithmic approaches. CF is one of many recommendations strategies.

    Also, your distinction between B2B and B2C recommender systems is unclear to me. I don't think these classifications are helpful, but we could argue that there are "B2C destinations" employing recommender systems (however sophisticated) and whose revenue model is primarily advertising. Secondly there are SaaS providers that are B2B2C services. These B2B2C systems provide recommendations as a platform enabling business to learn from their user interactions and then distribute that intelligence across their enterprise, both in front of and behind the firewall.

    In the first camp I place - Digg, StumbleUpon, etc. Their goal is to resell users' attention to advertisers. With respect to the B2B2C players, (Baynote, Loomia, et alia) , a 3rd party provides the recommendation to the end user by way of a web service. In each case, the recommender system observes the behaviors of users (either explicitly submitted data like diggs, star ratings, such as Bazaarvoice or PowerReviews etc.) or implicitly calculated recommendations such as Baynote, Loomia. In some cases, the consumer sees a branded message in the form of "Powered By..." Arguably, this is a B2C brand message designed to communicate directly to the end users. I think the most important distinction here is how the services make money. In the case of a destination play - the goal is for them to sell advertising in a more targeted way. For the B2B2C players, they can take a % of sales lift, % of ad gain (more pages seen, more CTR), they can attemtpt ad buying arbitrage by buying up remnant inventory and then reselling it for a gain, or simply charge a fee to the displaying site (traditional enterprise software.)

    Also, recommender systems do not end at the home page or even the website for that matter. Recommendations can be extended into "web-enabled environments" such as mobile (Aggregate Knowledge and Baynote both claim to do this) or within email, Baynote also claims to provide this along with another player / competitor MyBuys. Recommender systems are also at work w/in kiosk and set top box environments as well. A more compelling use of these web services would be the horizontal application of these technologies across marketing channels enabling business (both consumer facing and behind the firewall) to extract insights and deploy those content and product recommendations wherever the add value. This is a very different use case than the Digg and StumbleUpons of the world. Here, recommendations are seen as an important marketing intelligence AND distribution vehicle. After all, what good is a recommendation for YESTERDAY'S customer? Or a great recommendation delivered late? The recommendation + the timing are critical here and distribution becomes an important counterbalance to the content itself.

    Lastly, you left out an important additional data source for recommendations and that is the search query itself. Social Search or recommendations embedded w/in the search results, employs various algorithmic processes to rerank search results as well as include relevant but non-indexed (due to linguistic variation and other reasons) in order to offer a social vector to enterprise search. The uses for this a multifold as you can image, groupware, team sites, project mgt, basic consumer-facing site search, etc. Moreover, the recommender system, when deployed as a platform enable the learnings from the community to be sourced, analyzed and deployed across the enterprise and across any distribution or marketing channel. I would categorize this approach toward recommendation, as B2B -- but I hesitate to do this b/c it implies that the end-consumer is left out of the equation. This is not true. The end-customer is at the heart of the recommender system b/c he/she is who the recommendation is attempting to serve. Both Collarity and Baynote claim to do this. Collarity attempts to help better target advertising, Baynote attempts to fix under and overinclusion w/in enterprise search results via "community wisdom."

    So all this to say, CF is but one way to achieve a recommendation. Where that "recommendation" is deployed shouldn't necessarily determine what algorithmic science one uses to display that content. Of course, one must be mindful of the environment (e.g. kiosk vs. handset) but conceptually, these are constraints in display and data gathering.

    Hope this helps.

    Posted by: VK Srinivassan | July 1, 2008 10:09 AM



  12. Muhammad Saleem said...
    Active Filtering = when community has to actively participate in the filtering process.

    I call that manual (passive), when humans is involved.

    Muhammad Saleem said...
    Passive Filtering = when it is more machine or algorithm based and the community only has to vote on articles recommended by the system.

    I call that active, since the algorithm is turned on & off. Once the application is turned on, the algorithm is alive/active which it has to do its job in realtime. There is no lapse at all, ie, there is no turning off today/this hour and turning on tomorrow/that hour. It runs continually to do its recommendation in realtime (active).

    Anyway, our differences in interpretation is semantic only.

    Posted by: Falafulu Fisi | July 1, 2008 4:15 PM



  13. Nice article on what is an increasingly expansive subject. It's tough to cover everything in a blog post. Here are a few additional thoughts.

    Weighting must play heavily into a CF system. By this I mean that a view of an item by itself carries less weight than a view coupled with a ranking and/or comment on that item. Any additional effort after a view typically implies a higher degree of interest in the item.

    Isn't tagging another important aspect to CF?

    What about the use of Attention data (APML) in CF for news readers and group blogging sites? NewsGator and others are doing some interesting things in this area.

    And one final point.

    "At the same time, one has to wonder, at what point is it economical or time-efficient to start monitoring each individual user?"

    CF and recommendations play a huge role in the highly-competitive social networking market. Maybe smaller sites can't justify the cost, but you can bet that the big guys see the ROI potential.

    Posted by: Chad Kieffer | July 1, 2008 4:18 PM



  14. Muhammad Saleem said...
    The difference between collaborative filtering and recommendations is that the collaborative filtering determines the homepage whereas the recommendations determine what you see individually, and has no impact on what others see on the front page.

    Muhammad, I think that why you got it wrong, because you think that somehow collaborative filtering and online recommendations are only applied to the web, which is why you brought up the front page.

    Collaborative filtering and automated recommendations existed before the web in the domain of knowledge management. A user could look up some info in a corporate database for certain topics/info, and the system would recommend certain guidelines or other related info that the user might be interested in exploring further. Another example, in the medical field, when data about a new patient (blood pressure, lab tests, temperature, etc...) is entered into a medical decision support system, the recommender system would guide the physician/doctor about a possible correct diagnosis & appropriate treatment. See, this has got nothing to do with a the front page or a browser.

    The internet just happens to bring the awareness about Collaborative filtering and automated recommendations technology to the mass, and Amazon is the most notable vendor that did that.

    I contend that Collaborative filtering and automated recommendations are more or less the same thing. I am not going to debate on the subject but if you want to read further about Collaborative filtering and automated recommendations, then there are lots of peer reviewed research papers here (more than 100 papers on recommender systems) which you can refer to.

    If you see a title that you're interested in from the link shown above, then search the title at Citeseer, because you can download (PDF or PS format) it from there for free, since the references (site for 100 papers) are only showing the publishers website (which isn't free).

    PS : I consult in this area which is one of my domain of interest and I also develop algorithms for automated recommendation engine. The technology is getting better and better as I have noted recent publications from the literatures about new algorithms that are more robust than existing ones.


    Posted by: Falafulu Fisi | July 1, 2008 4:57 PM



  15. Very well put together - this article pretty much encapsulates it all very well.

    From the users perspective, it's extremely rewarding to see increasingly relevant recommendations as you build up a legacy and invest more time into a site - ie. Amazon. It's not only a huge hook to revisit the site, but it makes you dependent on it.

    Posted by: Shammara | July 2, 2008 10:45 AM



  16. Muhammad

    This has produced a lot of interesting comment but I think you get to the nub of a distinction between recommenders.

    A lot of claimed personalization is really "most popular" and all visitors get the same list when viewing an item (e.g. people who viewed this also viewed). Only if different visitors can get different recommendations for the same item is it truly personalization.

    The difficulty is getting personalization from implicit information only and few systems offer this.

    Posted by: Paul | July 2, 2008 2:04 PM



  17. Muhammad,

    Solid round up of how consumer sites are using recommendations to help people discover content.

    One thing both you and Falafulu Fisi (in the comments) do though is equate Collaborative Filtering with Online Recommendations. As VK Srinivassan points out - this is only one technique that one could use to provide recommendations.

    CF is a good tool but you have to be careful to apply the right tool to the right situation.

    At Aggregate Knowledge we actually use a variety of different algorithms types to determine what someone might be interested in. For example taking into account seasonality, recency/frequency and lexical matches to show you things you're most likely to want to see.

    Posted by: Chris Law | July 2, 2008 5:21 PM



  18. Keep in mind there are various algorithms for recommendations and prediction engines. Some are based on collaborative filtering, clustering, supervised learning or some variations. Most of the engines are based on known and well studied methods. The prize-winning NetFlix solution was interesting, as it combined a few different approaches to maximize results. Ultimately what matters is the predictiveness and quality of the recommendations, which the end-user ultimately judges.

    Innovative approaches matter when they improve relevancy. Collarity introduces a totally new method for metric space construction based on natural topic emergence from spontaneous users’ behavior. Embedding users, communities and different types of content in this space provides a natural measure of similarity between those different entities, the Euclidean distance. This distance is used to improve relevancy of recommendations, search results or targeted ads. Collarity’s solution works for the content world, especially well for video where all this active tagging skews things otherwise.

    Posted by: Dan Tzur | July 8, 2008 8:07 PM



  19. Dan Tzur said...
    Keep in mind there are various algorithms for recommendations and prediction engines.

    True, but your comment is redundant. The link I quoted in my previous message, for references to peer review articles on collaborative filtering & recommender systems listed various algorithms in those publications.

    Dan Tzur said...
    Some are based on collaborative filtering, clustering, supervised learning or some variations.

    Redundant comment. Again, see the link from the first paragraph above, where you will find papers for collaborative filtering & recommender systems using clustering (un-supervised learning), neural network (supervised learning) and so forth.

    Dan Tzur said...
    Most of the engines are based on known and well studied methods.

    Redundant comment. We already know that, didn't I quote a link in my first message on this thread about peer review publications? Well studied methods are peer reviewed publications.

    Dan Tzur said...
    The prize-winning NetFlix solution was interesting, as it combined a few different approaches to maximize results.

    Irrelevant. Netflix ran a competition and its aim was to produce an algorithm that would be superior (as they hoped for with the incentive of $1 million) in performance (see my next comment below) compared to existing well studied algorithms. If the competition didn't produce anything with better performance than existing algorithms available from the literatures, then it was a pointless exercise, but I believe that the competition produced some algorithms that were shown to be better than anything which is already available in the literatures. There shouldn't be any surprised there, since the prize of developing the best algorithm was reported to be $1 million. Even if Netflix's competition produced an algorithm that is superior to anything available today, it is guaranteed that this superiority will not last forever, since the researchers from around the world are continually pouring out (peer review publications) new algorithms all the time that mostly, they outperformed the existing algorithms that are publicly available in the literatures. This is what I mean that the comment about Netflix is irrelevant to the subject of collaborative filtering / recommender systems, since Netflix is only one of many vendors which develop/adopt the collaborative filtering & recommender systems technology.

    Dan Tzur said...
    Ultimately what matters is the predictiveness and quality of the recommendations, which the end-user ultimately judges.

    No, wrong. It is not the user that decides the accuracy of the predictiveness of the quality of recommendations. The metric that determines this quality is called classification error. The recommender system with the lowest classification error is the best performance algorithm. Note that speed performance is different from precision classification error. Some algorithms are superior in both, however I believed that Netflix's competition specified that the winning algorithm must be the one with the lowest classification error.

    Posted by: Falafulu Fisi | July 9, 2008 2:08 AM



  20. It s really great article, thanks

    Posted by: promosyon şapka | July 26, 2008 11:55 AM



RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS