We have recently written here about the
ongoing transformation of Web Sites into Web Services. In that post we noted that
with the rise of APIs, scraping technologies and RSS, web sites are really turning into
data services and collectively the web is becoming one gargantuan database. As such, the
web is quickly becoming a platform or foundation that powers new kinds of applications
that remix information in ways not possible before. The web is also becoming much more
connected, not not just on the level of links - but at a much more fundamental, semantic
level.
The big picture is always exciting and important, but the mechanics matter too. How exactly do we unlock and correlate information from separate web sites? Ideally, we'd like for all web sites to offer simple and elegant APIs - like Amazon, del.icio.us and Flickr do today. Alas, this is not feasible today and it isn't clear that something like this can be done quickly at all, on a Web scale. So in the meantime, solutions like Dapper that help you process unstructured information from HTML, clean it, transform it and re-emit as structured XML - these types of solutions are worth serious consideration. So in this post we take a close look at all aspects of Dapper: how it works, what can be done with it, the company's business model and legal implications of this service.
Dapper offers a way of turning any Web Site into a special kind of Web Service - a Data Service. The difference between a general Web Service and a Data Service is that the latter offers passive, read-only access to information. The former (general web service) may also offer ways to manipulate and change the underlying information. Nevertheless, Data Services are powerful because they unleash information that otherwise would not be accessible. Here's an illustration of this in Dapper:

The idea behind Dapper is to create an automatic, visual way of extracting information from HTML pages. It works by taking a few sample pages as input and then letting users visually specify the information that should be extracted. Each page is treated like a record in a database. For example, consider the transformation of a movie page from IMDB:

Dapper runs a quick similarity analysis between sample pages. Even though the analysis is very quick, there is a non-trivial tree-matching algorithm - fine-tuned for HTML - that powers this aspect of Dapper. After analyzing the pages, Dapper presents the user with a highlighter tool for selecting attributes of a record. For example, below you can see how to select a title, highlight a row with title and year, and then chop off pieces using parenthesis.

The Dapper team has worked hard to make text selection easy, but its interface is still somewhat confusing. Particularly, the top controls that allow refinement of a selection needs more work. Right now these controls allow the user to control the similarity matching algorithm. Since the user only has a vague idea of what this is, this control is not terribly useful. In any case, presenting this control using a pulldown with text - instead of a heatmap - would probably be more clear. The other controls are also unclear; and since there are no instructions, the only way to figure it out is by trial and error.
Still, a technical person can use Dapper fairly efficiently. Once you isolate the information that you want to be captured by a single attribute, you can then name the attribute and then move onto the next one. When you are done, the next step is to review and group the content (if you wish). You can then save this application and start using it in variety of ways.

So how can this be used? The first use is straightforward - you can use a "Dapp" to process a different URL. For example: if instead of Babel, you pass the IMDB link to Departed, you will get back the information for that movie instead. So this Dapp can be used to turn any IMDB page into a movie record.
You can also output results into many other formats. Among them you can get results in RSS, Email and HTML output - which to me do not seem as useful for a single record, but become much more interesting when you are looking at a set of records. For example, using the above Dapp and a bit of PHP, you can build an application that generates a formatted RSS feed of new movies shown on the IMDB home page. In addition to the movie title, the feed would include information about release year, director, stars and keywords.

You can also imagine applications that combine different Dapps together. For example, movie information from IMDB can be combined with movie information from Netflix to deliver extended information of a film. Going back to our discussion of the Web as a Database, this is essentially like doing a join between two tables.
The problem that these applications will face is identity. How can you know that two movies - one at IMDB.com and another one at Netflix - are actually the same movie? There are various ways of determining this, but all boil down to establishing an identifier for a movie that is different from the URL. For example, the combination of a title and director would be a good candidate for such a unique identifier.
So in a nutshell, once the information is extracted, it can be remixed and presented in many new ways. Freed from HTML presentation, raw information from a web page is basically the same database record. And we know how powerful relational databases are - for the past twenty years they have been the backbone of enterprise IT.
Clearly what Dapper is powering is interesting and useful technically. But the business and legal questions are pressing. Is this monetizable? And more importantly: is this even legal? Content scraping is a shady area. Some people claim that it is flat-out illegal. Others say that it is fine, because the content is out there anyway. My take is that it all depends on how the content is used. If the content is scraped and then reused without attribution to the original content provider, that is a straight copyright violation. If on the other hand, the attribution is preserved and the content is remixed in creative ways that still drive traffic to the original source - then it is probably fine. In any case, this is an area without much legal infrastructure - so all players need to be careful.

Now Dapper's approach to the problem is entirely different - the company is attempting to both monetize and legalize scraping by acting as a marketplace that connects content owners with companies that want to remix the content. This is both ambitious and a clever play that might just work. The owners of the content often do not have the technical resources and business channels to sell their content. They are not against it in principle, they just do not have the means to do it. On the other hand, the companies that want to leverage existing content are weary of scraping - it just seems like the wrong way of doing it. No one would question calling an Amazon API, but parsing the data out of HTML just does not sound clean.
So Dapper's answer seems to be spot on - connect the content owners with content consumers. In the process, establish rules for content distribution, track how it's used and help content owners monetize the content. And yes, of course - as with any good pipe - take a cut in each transaction. So while technical purists would argue that the whole notion of scraping is a hack, business people and pragmatists would recognize that Dapper's approach to the problem has all the ingredients - that might just make it a successful solution to a real problem.
Will Dapper succeed? It is not obvious and perhaps too early to say. There are a few things that are playing against it. Firstly, ease of use - which the company is rapidly solving. This is something that they control directly and should be able to fix. The second problem is competition. Yahoo! Pipes, Teqlo and Kapow are close enough to be a threat and to cause confusion in the market. But beyond that, is what Dapper is trying to do a good idea? It seems to me that the answer is resounding yes.
Clearly Dapper is not an ideal scenario for exposing the world's information. But it is a top-down, unintrusive and perhaps the fastest way of turning any web site into a data service. As such, its power and potential exceeds its drawbacks. We will see what happens and in the mean time, let us know what you think about the technical, business and legal aspects of this fascinating company.
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
Alex, thanks for the excellent review of Dapper!
As noted, we're always working to improve Dapper's ease of use and would greatly appreciate any feedback RWW readers have.
A couple notes:
1) Dapper can be used to create "web services" as opposed to "data services." For instance, you can create a Dapp that sends a message to a friend on a social network. To do so, go to the relevant site in Dapper, go to the feature you're interested in, and indicate that the input is a variable. This will allow your Dapp to take input and will mimic whatever it is you do on the website.
2) Dapper can be, and is often, used in conjunction with tools like Yahoo! Pipes. Pipes is a great service and combined with Dapper is exceptionally powerful.
If anyone here goes off and tries Dapper, please send us your feedback (positive or negative) - it's very valuable to us.
Posted by: Jon Aizen | May 2, 2007 1:07 AMThanks Jon, I removed the link to the Dapp.
Good point about combo with pipes, if you emit RSS from Dapper you can send it into pipes. A question about this, is there a way to create an RSS feed of changes in a number of pages?
Alex
Posted by: Alex Iskold | May 2, 2007 5:49 AMJon, regarding web service vs. data service, I did not put it quite right. I meant to say that there is no way to use the Dapper to change the underlying data.
Alex
Posted by: Alex Iskold | May 2, 2007 5:53 AMAlex, good overview. Dapper, Teqlo, Kapow, Pipes and now Astonia from Microsoft indicate pretty serious interest in these new kinds of web services. I'd also add to this general category DabbleDB, Coghead, us at SnapLogic, and to some degree even Salesforce.com's AppExchange. I know of several other unannounced projects working on some form of this problem as well.
One thing about these systems that is a little confusing to me is the target consumer. End user? Developer? Some new hybrid?
And where does the solutions fit in the overall solution stack? Physical, network, data, presentation, or application layer?
Call me old fashion, but whenever something traverses too many of these I'm skeptical the approach is going really gain widespread adoption. Give me the best of each and I'll put them together.
As for screen scraping, I totally agree. There might have been a business in screen scrapers back in the 3270 days, but that was because you couldn't touch the mainframe. What company that's serious about monetizing their data couldn't provide an RSS feed?
Posted by: Chris Marino | May 2, 2007 6:42 AMChris, you are raising a few good points. My take:
- All these are developer tools. There is light chance business folks would use this.
- I do not view RSS as a structure format in its raw form. It can be used for passing structured data, but by default its not. Dapper emits structured data.
Alex
Posted by: Alex Iskold | May 2, 2007 7:01 AMExcellent review, quite educational as well.
In developing our service, Second Brain, we have experienced some of the problems and weaknesses of existing APIs. Basically, they don't work as well as promised and support is lacking, except for a few notable exceptions already mentioned in the article.
But, I truly believe in the internet as a big computer. The problem facing users now is services and social networking overload - and we want to help them by bringing in content from all the services they use in a simple library - kind of like a windows explorer for the internet. But there are several challenges left to make this work both on the user end and on the service end - I think Dapper has an important role in solving this.
Posted by: Lars Teigen | May 2, 2007 7:06 AMAs far as Chris' question: "is the target consumer End user? Developer? Some new hybrid?"
One of the concepts I heard talked about recently at ETech or Web 2.0 Expo (they tend to blend together) is that, in the Web 2.0 era, consumers are now becoming developers. So, the notion of a "hybrid" target market here seems a fitting point.
Posted by: Graeme Thickins | May 2, 2007 8:22 AMThe concept of UI-driven manual tagging as a method for screen-scraping has been around since (at least) the late 90s. Many similar tools were used internally by job board and e-commerce aggregators.
In essence, tagging produces a large regular expression into which the pages of the site can be fed. However, most such tools have run aground on the shoals of maintenance. Small changes to the target site require re-generation of the regex, which rapidly becomes a nightmare.
There are also subtler issues that relate to normalizing data, e.g. different pages might represent dates differently.
The smart money seems to be on domain-specific screen-scraping powered by semantic intelligence. This is way more scalable than site-specific regexes.
Posted by: Ranjit | May 2, 2007 10:01 AMI don't think there's any question that, for 99% of the web that doesn't explicitly give permission to copy its content, using Dapper to scape it is going to be a copyright violation. Copyright law doesn't care about attribution or driving traffic to the author - it's about permission. The only exceptions are those provided by legal precedent under "fair use", which again isn't about attribution, driving traffic to the author, or even commercial nature.
To put it simply, copying any significant part of a webpage without permission is a violation of the author's exclusive right to make copies of a work. Dapper may not be liable as it's only acting as a proxy, a service provider, like your ISP; I don't know how that'll hold up. But the end user will definitely be making unauthorized copies by displaying the scraped content on their own site without permission.
The only legal solution is, of course, to only use content from websites that give permission to copy that content.
Posted by: Dan Grossman | May 2, 2007 10:05 AMDan, how do we define 'substantial'? Say there is a movie review page on IBM and we scrape movie title, director and stars. Is this substantial or not?
Alex
Posted by: Alex Iskold | May 2, 2007 11:18 AMA movie title, director and stars are facts. They're not an original creative work. They're not protected by copyright in the first place, so there's no issue. The review someone writes, however, is protected. Fair use still dictates what you can do with that review and what constitutes fair use is pretty well established.
Posted by: Dan Grossman | May 2, 2007 11:38 AM> The only legal solution is, of course, to only use content
> from websites that give permission to copy that content.
So Google's cache is illegal? The wayback machine is illegal? Or am I missing something about your interpretation?
Posted by: Pete Wrden | May 2, 2007 1:53 PMThat is actually a really interesting point Pete. Would any excerpt of the text considered to be a violation of the copyright?
To me, pointer back to the original content is an attribution. Its like quoting from someone, a track back basically.
Alex
Posted by: Alex Iskold | May 2, 2007 2:05 PMYes, they are both violating copyright law. Even the Electronic Frontier Foundation attorney Fred Lohman admitted that when CNET ran an article about it in 2003. The only reason it happens is that copyright holders are allowing it. If you sued, you might win, but what would you win -- how much damage can you prove a Google cache did to you by making that copy?
I can imagine it being much easier for someone to create a site with Dapper that scrapes content and presents it in a better way than the original site, such that users of that site start switching to the better, and there'd be real damages to base a lawsuit on there.
Posted by: Dan Grossman | May 2, 2007 2:06 PM@Alex: Quoting an attributed excerpt as part of journalistic reporting, review, and criticism is protected fair use under copyright law. It's completely legal. Dapper makes easy so much more that isn't legal. I'm not against Dapper at all, even conceptually, and I'm not a lawyer, although its application to the digital era is an interest for me so I spend some time following it.
Posted by: Dan Grossman | May 2, 2007 2:10 PMIt's basically a screen-scraping type service, but its no longer needed seeing as how RSS feeds have become mainstream and are quite popular.
Posted by: Live tv | May 2, 2007 4:45 PMMy understanding of the Google Cache legality was that there are three requirements for fair use:
Posted by: Ranjit | May 2, 2007 6:37 PM1. Visible statement that the document is a copy and not the original.
2. Link back to the original source, clearly labeling it as such.
3. Timestamp indicating when the copy/cache was made.
Dapper is a cool tool and so is Open Kapow its cousin , but aren't these tools just making plagiarizing content more easier. People who make RSS know that people will syndicate their content . The ones who dont, need to have that control of not choosing to give out information . I discuss this issue and more in my post
Posted by: Ritesh | May 2, 2007 11:38 PMhttp://riteshnayak.com/blog/2007/04/08/automated-portals-can-there-be-a-turing-test-for-such-sites/
#13 "Would any excerpt of the text considered to be a violation of the copyright?"
#14 "Yes, they are both violating copyright law."
WOW, I thought for a second there that my 10K del.icio.us tags were gonna send me to the slammer (especially since I have it rolling into an unedited blog every night for a little more domain traffic), but then I saw
#17 "Link back to the original source, clearly labeling it as such." which of course every delicious save does by default.
Posted by: BillyG | May 3, 2007 4:21 AMI saw a one-on-one demo of this (and Kapow) and the Web 2.0 conf and was very impressed.
It will fill a need and allow better mashups. The legal side of it ... it will work itself out ... if an appl gets useful they will eventually need 2 go back 2 source and look 2 strike some deal that makes sense 2 both sides.
Posted by: Brendan Lally | May 10, 2007 9:01 PMLal