ReadWriteWeb

Aggregate Knowledge's Content Discovery - How Good is it, Really?

Written by Richard MacManus / March 19, 2008 9:20 PM / 20 Comments

Aggregate Knowledge, which operates a content discovery network under the brand name Pique, today announced a deal with BusinessWeek to deliver "user-driven content suggestions" on their website. It's the latest in a string of similar deals - Aggregate Knowledge powers "discovery" of both editorial content and product recommendations for over 100 websites, with a particular focus on retail and media. In this post we take a closer look at the implementation at BusinessWeek - and ask if the results come up to scratch.

At last year's Supernova, Aggregate Knowledge CEO Paul Martino referred to his company as the "world's largest implicit social network." The company told ReadWriteWeb today that media sites like BusinessWeek.com, WashingtonPost.com and LATimes.com are using Aggregate Knowledge's Pique Discovery Network "to help users discover new and exciting content on their site." The company has some high powered backing, including uber VC firm Kleiner Perkins.

How Well Does it Work?

Here's how Aggregate Knowledge describes the system for BusinessWeek.com:

"When a reader clicks on a breaking news story on the site, the Aggregate Knowledge Pique Discovery Window automatically provides user-driven content suggestions in the form of “More from BusinessWeek.” These suggestions are based on what visitors are actually reading across BusinessWeek.com."

I clicked some stories on the BusinessWeek.com homepage, and noticed a "More from BusinessWeek" list of links to the right of each story. However, none of these links seemed very relevant to the story. Check out this example from a story about Apple iTunes:

No Apple or even tech stories are linked to. Here's another example - about Russian police visiting BP offices. Curiously, this one lists an Apple story!

No Actual Content Analysis?

So based on my tests, it doesn't seem like there is much - if any - semantic analysis of the page content in order to come up with the "More from BusinessWeek" links. Reading between the lines of the AK quote above, this discovery system is based on clicks and not content.

It appears as if this is collaborative filtering - i.e. users who clicked X also clicked Y. This is basically the system that Amazon and Netflix use. For Aggregate Knowledge, collaborative filtering is still going to give interesting results. But how is it better than - for example - the 'Related Entries' plugin that we use here on ReadWriteWeb (which is based on tags, and so is much more closely aligned to the content itself). See bottom of this post for an example.

Surely for media sites a content discovery system that analyzes the content of a page, such as Reuters Open Calais does, would give better results. Please let us know your opinion in the comments.

Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Interesting. How to actually use the service? Widgets? :-)

    Posted by: 113.com | March 19, 2008 9:31 PM



  2. Richard, I asked myself the same question this morning i.e. How well does it work? With +100 sites in the Pique Network let's hope one is a reader of RWW...

    Posted by: Kevin McDonald | March 19, 2008 10:20 PM



  3. The problem Aggregate Knowledge is trying to solve has two very difficult components: 1) implicitly inferring user intent from actions (clicks), as opposed to explicitly from the entry of additional information in the form of either preferences or keywords, and then 2) mapping this inferred intent onto a set of documents and calculating the relevancies in order to produce recommendations. Furthermore, they must do all this in real-time. Collaborative filtering would make the task easier and faster, with a possible drop in accuracy, but it's still non-trivial.

    As such, I would suggest that we don't put the bar too high. There will, naturally, be a lot of misses, much like there are will all information retrieval systems, especially those employing implicit personalization.

    Nevertheless, implicit personalization is an extremely significant element of the future of IR. There is simply too much information on the internet, news or otherwise, to access it all using only keywords. IR systems must have additional information regarding users' intent, which can come in many forms, in order to successfully find relevant information. While doing all that in real-time might be the most challenging, it holds the most promise for dramatically improving the user experience.

    Posted by: Mark Cramer | March 19, 2008 10:36 PM



  4. On the other hand, Sphere is doing pretty good job, though it does poorly if articles are written in language other than English.

    Sphere
    http://www.sphere.com/

    Posted by: Alan | March 20, 2008 6:44 AM



  5. There are a few important points to make.

    (1) The AK system optimizes for click thru rate and total page views. The choice of algorithm (personalization, collaborative filtering, Bayesian inference, a mix of the above, or something else) is determined by a yield management calculation. Some algorithms involve lexical analysis and some have none whatsoever.

    (2) Different and serendipitous results happen as a result of this optimization. For example many results might end up going into a new category of stories that are not from the category of story being looked at. For example people will flow to Sports stories after reading Entertainment stories on a newspaper site. The objective is to continue to allow the user to discovery new and fresh content. Much of that content will be from different sections of the site.

    (3) Doing a pure lexical match leads generally to only more stories that are “much of the same”. Sometimes a system will converge to that answer but rarely. There is simply not enough serendipity in such a data set. Further that is the kind of data set that the editor him/herself could produce. The point of having an automated tool is specifically to get to relations that are not obvious to the editor. What AK is actually doing at the end of the day is turning the audience into implicit editors that help others find the interesting content.

    I appreciate your interest in what we are doing.

    Paul Martino
    CEO
    Aggregate Knowledge

    Posted by: Paul Martino | March 20, 2008 8:45 AM



  6. It appears as if the CEO of AK has Discovered himself in a fit of Pique!

    His explanation of approach, while mildly interesting, fails to address the central thesis of the article. In other words, what is the discrete value of the AK-driven recommendation? It would appear to me that articulating how ir-relevance was achieved or the "non-trivial" nature of achieving said "irrelevance" dodges the author's inquiry. The real customer / market question still stands: please explain how *these* results were useful to the readers of those specific BusinessWeek articles?

    Wallowing in academic minutiae and spinning distinctions w/out difference between collaborative filtering and Bayesian or non-Bayesian approaches seems to me, a redirection from BusinessWeek's and or the community's fair inquiry.

    Let us summarize the author's implicit question thusly:

    1. Were the returned AK recommendations in fact relevant?

    AND

    2. Did those AK-driven content recommendations lead to CPM or CTR gains above and beyond what editors and/or Most-viewed, Most-emailed, etc. could provide in a way that pays for AK's take -- its management overhead, opportunity costs from not running AdSense, DoubleClick display ads, etc.

    Posted by: Bob | March 20, 2008 12:06 PM



  7. When the business requirement is, as Paul said in #5, to afford the user new and fresh content, then collaborative filtering is definitely better.

    In that sense AK's suggestions to the articles selected by Richard are not curious at all, they do precisely what is advertised-- link to-- "More from this site". (It is what others are looking at right now.)

    Obviously others, including Richard, clicked on the Apple article, so it is not curious, to me at least, that the Apple article appeared in the list when viewing the second article.

    Given the diverging interests and almost voyeuristic nature of people, collaborative-filtering is certainly not absurd. It is better because it did not have to resort to tags or any sort of semantic or lexical analysis at all. However negligible, even tags take personal or editorial time to assign and structure for ease of use. Why do it when the business requirement does not need it?

    On the other-hand, content-filtering using tags or semantic analysis that delivers links that are related to the article can be useful and appreciated by users, as Richard suggested.

    I think whether collaborative or content filtering is used depends very much on the business requirement and the objectives of the owners of the web property. Surely there is a place for both.

    I also think there is a way to merge these divergent filtering methods that raises the possibility of giving the user the best of both. Suggestions could include both articles popular and relative to the one at hand as well as those that are perhaps less relative yet highly trafficked
    at one or more related web properties.

    -Ken Ewell

    Posted by: Ken Ewell | March 20, 2008 12:41 PM



  8. Ken,

    I defer to your expertise on the subject of semantic analysis and semiotics. And I am impressed by the intellectual firepower on this thread.

    "On the other-hand, content-filtering using tags or semantic analysis that delivers links that are related to the article can be useful and appreciated by users..."

    Agreed. I don't believe that point is in dispute. The initial query did not attack AK's approach or methodology but rather its own execution w/ BW. And taking that a step further, did those recommendations provide any measurable, demonstrable value to BW?

    Thus far, the simple query re: value remains, unfortunately, unanswered.

    Bob

    Posted by: Bob | March 20, 2008 1:12 PM



  9. I think AK's business model speaks to whether the customer receives any value from the suggestions. The way they are adding customers, it seems it does.

    The flaw in Richards article is that he made the erroneous observation that it was about relevance. It was not. It was about traffic. AK said their suggestions were based on "what visitors are actually reading across BusinessWeek.com". It does not imply whatever those articles are should be related.

    Even the title "More from businessweek.com" does not imply that the suggestions are related. Compare that with RWW's title of "Related Entries" where expectations are more specifically set.

    -Ken

    Posted by: Ken Ewell | March 20, 2008 1:52 PM



  10. Thanks all for your comments, it is a fascinating thread with some deep questions.

    Like Bob, I am not sure that the core questions are being answered. I accept Paul's observation that this system produces more "serendipitous results" - and apps like StumbleUpon have proven that there is value is such a system on the Web. However, it all seems a bit random to me...

    Ken said: "The flaw in Richards article is that he made the erroneous observation that it was about relevance."

    But surely that's the whole point of content discovery? To find other, relevant, articles? Paul himself touched on this when he said: "The point of having an automated tool is specifically to get to relations that are not obvious to the editor." Relations implies relevancy, at least a little bit anyway. Otherwise it is just pure randomness.

    I think it would help if AK told us a little more about how the algorithm works -- there's a fine line between randomness and serendipity and right now it's hard to see that line.

    Posted by: Richard MacManus Author Profile Page | March 20, 2008 5:24 PM



  11. Agreed. AK has refused to address the most simple of question which is: please lead us from "Recommendations To Revenue" for BusinessWeek.

    Let's redirect the inquiry a smidge.

    Ken -

    Your background, from what we can gather, is in lexical analysis, semantics, NLP, and the like. We find it curious, however, that you would jump in to this specific thread to defend a "serendipity" or "discovery" approach that, according to the CEO's position, disavows the need for semantic / lexical analysis. Paul's comments on stage, in press releases, events, etc. vociferously and aggressively denounce the need and use of lexical search. What gives?

    1. Disclosure requested: what is your relationship to Aggregate Knowledge?

    2. A comment on AK's co-opting of "Serendipity" and "Discovery"

    To those suffering from ADD, everything he/she "Discovers" has apparent "serendipitous" relevance. It would appear that Paul Martino wants us to suspend disbelief and retroactively imbue relevancy on to what is actually randomness -- or AK's flavor of randomness, anyway. Explaining how irrelevancy was arrived at is in itself irrelevant. Which perhaps explains the tediousness of this thread... but we digress.

    What would definitively put this issue to rest would be for AK to come forward and explain HOW these particular content recommendations were derived, and THEN how BusinesWeek readers found value from those recommendations. Secondly, and perhaps more importantly, AK should demonstrate how THEIR recs w/in their "Discovery Window" fare against, say, editors placing their own content recommendations within that same window. Do they have data on which links received more clicks, theirs, randomness or BW's editors? And of course AK would have to control for position of the links, ordering, etc. If they could draw a line from recommendation to revenue enhancement - above and beyond the cost associated with AK's engagement, managing said "network" etc. then we'd all be very impressed indeed.

    Until such time, AK's commentary remains simply noise.

    Bob

    Posted by: Bob | March 21, 2008 9:59 AM



  12. I think you are being too hard on AK and Paul Martino, Bob. Your are right though, I should not want to defend collaborative-filtering seeing as I am all about text analysis. And I am not speaking for Paul Martino.

    I believe I said that their was room for both methods depending upon business requirements. I do not see myself defending AK's methods or Paul Martino's lexical ambivalence. I rather feel that I am clarifying the key facts for this community of readers.

    You next raised the point of disclosure and I can say that there is a business relationship.

    My development team has been formalizing a foundation, algorithms and a software API that parses text for semantic entities and features and also maps them using (customer-determined) classifiers and topics (actuators). Because Readware methods are conceptual and indexical rather than lexically-bound, we can mine behavioral, sentimental, interpersonal and other social features and relationships as readily as catalog copy.

    Paul Martino found that interesting and we reached an agreement for licensing technology and working together. We are working with the scientific and research side of AK where we are exploring whether such analysis is useful for their customers, i.e., whether there is a direct line from conceptual analysis to increased revenue. That is what is driving our researches together. One way a company serves their customers is by conducting advanced research for better products. I can also confirm we have not yet built any products.

    This is what drew me to this article. I knew AK did not have any content filtering or analysis in their Pique product and when I scanned Richard's introduction, my interest was aroused.

    I commented because I felt Richard's observation that it was content discovery was confusing. I also did not think it was fair to AK, to compare the AK link's to the RWW links in that way. Which, as everyone reading can sense by now, is like comparing apples with oranges-- which is better?.

    With that said, I still think you are being too hard. There are these pesky little facts that are being glossed over or overlooked entirely.

    A: AK never claimed to be doing "content analysis" or "content discovery".

    You and Richard have repeated how Paul Martino goes everywhere saying there is no lexical component and no need for it in traffic analysis. Traffic analysis is not content analysis. I remember the AK one liner used to say: the message is in the traffic --or something like that.

    If you go to AK's web site, you find this introduction:

    The Pique Discovery™ Network by Aggregate Knowledge™ is the internet’s first discovery network. The Pique Network delivers highly targeted products and content based on what is actually being purchased and viewed on the web in real time.

    Notice it does not read: the Pique "Content Discovery" Network.

    In the first sentence of this article, Richard said:

    "Aggregate Knowledge, which operates a content discovery network under the brand name Pique, ..."

    So I pointed out this flaw in Richard's observation, in #9. Here I will make it more clear. It is not content discovery AK is doing, it is the discovery of network traffic patterns.

    B: AK stated the results were obtained - from "what visitors are actually reading across BusinessWeek.com".

    In Paul Martino's response #5, in the first sentence of the first point, he elaborated:

    "The AK system optimizes for click thru rate and total page views."

    That is not some sort of dodge. It is not content discovery it is traffic (page views and click-through) pattern discovery. And that is why, Richard, the links have no obvious or apparent relation to displayed content.

    The answer is they are not supposed to have any relation to the content. They are links to other items on the site. The could have serendipitous value to the reader. I can imagine, and I am only imagining here, I do not know AKs decision formula, that there are ads on the linked pages that have a high-click through. That is a way to maximize revenue, show the best ads to the most people.

    What remains unanswered is the question of Recommendation to Revenue.

    I can add from what little I know about AKs business, that AK makes a unique business proposition to their customer. Customers only pay when AK shows the increased revenue over what is already in place. They operate in parallel until the proof is in place. I do not know how the value accrues. As far as I am aware, the business depends on showing a direct and measurable increase in revenue.

    I don't know if that satisfactorily addresses the question of value for you Bob. It should be enough for this community of readers. Surely you are not expecting AK to open their books here?

    Let me add one last comment about the serendipity vs. randomness issue that Richard raised which is a very interesting one in itself. Bob also wondered about the serendipitous value of the "Discovery in #11.

    I was interested enough in the link "Society Counts on its MBAs" that I Googled it to read it. I enjoyed it, as it echoed my sentiments about the inept business managers that run government research and development programs. So in my case, the answer is that at least one of the links was a serendipitous discovery for me, that is to say it was fortunate and accidental.

    That is what I suppose that Paul is saying, the value to the reader, if any, is serendipitous. The value, and relevance to the business property is that it helps optimize the uptake among multiple streams of revenue competing for space and show time.

    I try to be careful, Richard, not to make judgments about relevance becasue sometimes the relevance is hard to grasp.
    For example, what might not be obvious to an editor is that columnist X is drawing 15% more visitors than any other columnist on Thursday through Sunday. The click-through on the ladies teddies ad is 14.4% greater in the evening than in the morning.

    Of course the trick is always to try and extract the relevant signal or sign from all the noise. Identifying those relevant actuators and effectively employing them in business processes and decisions is far more important than the type of analysis done in the first place.

    This has been engaging. Thanks.

    -Ken Ewell

    Posted by: Ken Ewell | March 21, 2008 2:21 PM



  13. bob,

    you have asked everyone who they are and how they are related. so who are you and what relationship do you have to any of the players here?

    Posted by: robert smith | March 21, 2008 9:00 PM



  14. Hey Bob, What about Bob?

    AK businessweek value comes from aggregating the knowledge over time so if they just did a business deal then the benefits might not be there in the short term. Richard that is why you might not have gotten the ah ah when looking at bus week site

    Posted by: What about Bob | March 22, 2008 9:39 PM



  15. Great post and thread here. The Aggregate Knowledge model is wisdom of crowds at its purest. What does the overall crowd look at. The Business Week example highlights that the crowd may not be a real good indicator of relevance, more of general interest.

    Ken Ewell's post defends AK against the notion that they're trying to provide relevance. But he includes this statement from their site:

    "highly targeted products and content based on what is actually being purchased and viewed on the web in real time"

    Maybe it's just me, but I hear 'targeted', and I somehow think 'relevant'.

    If Aggregate Knowledge could apply some sort of sub-group filtering, their model would be better. Not just the wisdom of crowds. But the wisdom of men ages 18-34 working in the legal industry crowd. Then the content discovery would be much more targeted.

    They need to narrow the pipe of implicit activity used for the content discovery. More thoughts on the subject here: http://tinyurl.com/ysg6sx

    Posted by: bhc3 | March 25, 2008 7:11 AM



  16. This entire market and discussion is circa 98'-99' when all of the blind content classification companies were in full swing. blind content classification is completely implicit in that it used no a priori information thus yields accurate perceptual event classification of content at a given instance or session. Sadly due to the bubble many companies went belly up or most were acquired. Many of the same companies look to be treading back over the same territory yet collaborative filtering can be gamed and only yields results as good as a mediocre crowd. No long tail outliers or truly interesting nuggets. The blind content classification can be tailored to the individual then deployed across a larger domain footprint which by definition can yield accurate operations for B2C not the usual B2B.

    I have heard that these folks are heading in that direction (given the dual nature of the name of the company):

    www.beliefnetworks.net

    Posted by: theodore | March 26, 2008 9:33 AM



  17. Great thread here and in my opinion also related to the mother of all questions around marketing campaigns : what is now the return for me. We know all that ROI analysis that is not linked to sales is always a subject to discussion as it is difficult to proof.

    And that's exactly what we have here.

    AK, and Netmining in Europe, are showing clearly that you can make a difference when you use behavioral engines to increase online sales or detect web leads to drive off-line sales. The ROI model is clear and straight forward.

    But do we have the same clear ROI model when it comes to delivering the right content to the right people ?


    Filip Lauweres
    COO NETMINING
    http://www.netmining.com

    Posted by: Filip Lauweres | April 3, 2008 3:55 AM



  18. The problems with social filtering are well noted here and have been known before - for example, Walmart had a case where "Planet of the Apes" was recommended next to Martin Luther King biography.

    I think very detailed content analysis is the only solid basis for both finding similar content and analysing an individual's interests. During the last seven years we've built a system that is based on a 65k category ontology, which links any topics in the text with first very specific, and then more and more general context. A full picture of each topic related to a content item and its relevance is then stored.

    This means finding the most closely related content even if none of the same words are used in the content. The content profiles are also used in analysing user clicks to build up a real time profile of user inteterest for automatic personalisation.

    IMHO social filtering is a good addition to content and user interest analysis, but it's a very limited basis for a recommendation system, which is clear from your examples.

    Petrus
    http://www.leiki.com

    Posted by: Petrus Pennanen | April 11, 2008 8:13 AM



  19. Petrus, I do agree that deep content analysis can be effective. However, I feel like content analysis can leave out important information. By purely comparing content, you will often miss out on content that might not appear to be related, but users have deemed it so.

    I feel that the only definitive similarity of content can truly be determined by observing user activity. By tracking user patterns, we can use successful past users to guide new visitors to content. Essentially, this simulates communication from users that found and consumed content to those looking for similar content. Baynote is one solution that takes this approach. Perhaps there are others?

    Posted by: greydout | April 12, 2008 12:23 AM



  20. I agree with several previous posts that in this context the discussion should be focused on business value. But the engineer in me had point out that there is a reoccurring confusion of terms.

    Collaborative filtering, content analysis, classification so on are techniques. Techniques can be used by themselves or combined to achieve personalization, i.e. recommendations.

    Bayesian inference, neural networks, genetic algorithms and so on are generally refereed to as algorithms (Though they are not in the pure definition of the term. Computational model is a better term. But that is a different discussion). Algorithms can be used by themselves or combined to implement one or several techniques.

    Posted by: Henrik S | April 14, 2008 11:08 PM



The ReadWriteWeb Online Community Management Guide
RWW SPONSORS


FOLLOW RWW ON TWITTER




RECENT JOBS



TEXT LINK ADS