ReadWriteWeb

Web 3.0: When Web Sites Become Web Services

Written by Alex Iskold / March 19, 2007 12:11 PM / 45 Comments

Today's Web has terabytes of information available to humans, but hidden from computers. It is a paradox that information is stuck inside HTML pages, formatted in esoteric ways that are difficult for machines to process. The so called Web 3.0, which is likely to be a pre-cursor of the real semantic web, is going to change this. What we mean by 'Web 3.0' is that major web sites are going to be transformed into web services - and will effectively expose their information to the world.

The transformation will happen in one of two ways. Some web sites will follow the example of Amazon, del.icio.us and Flickr and will offer their information via a REST API. Others will try to keep their information proprietary, but it will be opened via mashups created using services like Dapper, Teqlo and Yahoo! Pipes. The net effect will be that unstructured information will give way to structured information - paving the road to more intelligent computing. In this post we will look at how this important transformation is taking place already and how it is likely to evolve.

The Amazon E-Commerce API - open access to Amazon's catalog

We have written here before about Amazon's visionary WebOS strategy. The Seattle web giant is reinventing itself by exposing its own infrastructure via a set of elegant APIs. One of the first web services opened up by Amazon was the E-Commerce service. This service opens access to the majority of items in Amazon's product catalog. The API is quite rich, allowing manipulation of users, wish lists and shopping carts. However its essence is the ability to lookup Amazon's products.

Why has Amazon offered this service completely free? Because most applications built on top of this service drive traffic back to Amazon (each item returned by the service contains the Amazon URL). In other words, with the E-Commerce service Amazon enabled others to build ways to access Amazon's inventory. As a result many companies have come up with creative ways of leveraging Amazon's information - you can read about these successes in one of our previous posts.

The rise of the API culture

The web 2.0 poster child, del.icio.us, is also famous as one of the first companies to open a subset of its web site functionality via an API. Many services followed, giving rise to a true API culture. John Musser over at programmableweb has been tirelessly cataloging APIs and Mashups that use them. This page shows almost 400 APIs organized by category, which is an impressive number. However, only a fraction of those APIs are opening up information - most focus on manipulating the service itself. This is an important distinction to understand in the context of this article.

The del.icio.us API offering today is different from Amazon's one, because it does not open the del.icio.us database to the world. What it does do is allow authorized mashups to manipulate the user information stored in del.icio.us. For example, an application may add a post, or update a tag, programmatically. However, there is no way to ask del.icio.us, via API, what URLs have been posted to it or what has been tagged with the tag web 2.0 across the entire del.icio.us database. These questions are easy to answer via the web site, but not via current API.

Standardized URLs - the API without an API

Despite the fact that there is no direct API (into the database), many companies have managed to leverage the information stored in del.icio.us. Here are some examples... 

Delexa is an interesting and useful mashup that uses del.icio.us to categorize Alexa sites. For example, here are the popular sites tagged with the word book

Another web site called similicio.us uses del.icio.us to recommend similar sites. For example, here are the sites that it thinks are related to Read/WriteWeb.

So how do these services get around the fact that there is no API? The answer is that they leverage standardized URLs and a technique called Web scraping. Let's understand how this works. In del.icio.us, for example, all URLs that have the tag book can be found under the URL http://del.icio.us/tag/book; all URLs tagged with the tag movie are at http://del.icio.us/tag/movie; and so on. The structure of this URL is always the same: http://del.icio.us/tag[TAG]. So given any tag, a computer program can fetch the page that contains the list of sites tagged with it. Once the page is fetched, the program can now perform the scraping - the extraction of the necessary information from the page.

How Web Scraping Works

Web Scraping is essentially reverse engineering of HTML pages. It can also be thought of as parsing out chunks of information from a page. Web pages are coded in HTML, which uses a tree-like structure to represent the information. The actual data is mingled with layout and rendering information and is not readily available to a computer. Scrapers are the programs that "know" how to get the data back from a given HTML page. They work by learning the details of the particular markup and figuring out where the actual data is. For example, in the illustration below the scraper extracts URLs from the del.icio.us page. By applying such a scraper, it is possible to discover what URLs are tagged with any given tag.

Dapper, Teqlo, Yahoo! Pipes - the upcoming scraping technologies

We recently covered Yahoo! Pipes, a new app from Yahoo! focused on remixing RSS feeds. Another similar technology, Teqlo, has recently launched. It focuses on letting people create mashups and widgets from web services and rss. Before both of these, Dapper launched a generic scraping service for any web site. Dapper is an interesting technology that facilitates the scraping of the web pages, using a visual interface.

It works by letting the developer define a few sample pages and then helping her denote similar information using a marker. This looks simple, but behind the scenes Dapper uses a non-trivial tree-matching algorithm to accomplish this task. Once the user defines similar pieces of information on the page, Dapper allows the user to make it into a field. By repeating the process with other information on the page, the developer is able to effectively define a query that turns an unstructured page into a set of structured records.

The net effect - Web Sites become Web Services

Here is an illustration of the net effect of apps like Dapper and Teqlo:

So bringing together Open APIs (like the Amazon E-Commerce service) and scraping/mashup technologies, gives us a way to treat any web site as a web service that exposes its information. The information, or to be more exact the data, becomes open. In turn, this enables software to take advantage of this information collectively. With that, the Web truly becomes a database that can be queried and remixed.

This sounds great, but is this legal?

Scraping technologies are actually fairly questionable. In a way, they can be perceived as stealing the information owned by a web site. The whole issue is complicated because it is unclear where copy/paste ends and scraping begins. It is okay for people to copy and save the information from web pages, but it might not be legal to have software do this automatically. But scraping of the page and then offering a service that leverages the information without crediting the original source, is unlikely to be legal.

But it does not seem that scraping is going to stop. Just like legal issues with Napster did not stop people from writing peer-to-peer sharing software, or the more recent YouTube lawsuit is not likely to stop people from posting copyrighted videos. Information that seems to be free is perceived as being free. 

The opportunities that will come after the web has been turned into a database are just too exciting to pass up. So if conversion is going to take place anyway, would it not be better to rethink how to do this in a consistent way?

Why Web Sites should offer Web Services

There are several good reasons why Web Sites (online retailers in particular), should think about offering an API. The most important reason is control. Having an API will make scrapers unnecessary, but it will also allow tracking of who is using the data - as well as how and why. Like Amazon, sites can do this in a way that fosters affiliates and drives the traffic back to their sites.

The old perception is that closed data is a competitive advantage. The new reality is that open data is a competitive advantage. The likely solution then is to stop worrying about protecting information and instead start charging for it, by offering an API. Having a small fee per API call (think Amazon Web Services) is likely to be acceptable, since the cost for any given subscriber of the service is not going to be high. But there is a big opportunity to make money on volume. This is what Amazon is betting on with their Web Services strategy and it is probably a good bet.

Conclusion

As more and more of the Web is becoming remixable, the entire system is turning into both a platform and the database. Yet, such transformations are never smooth. For one, scalability is a big issue. And of course legal aspects are never simple. 

But it is not a question of if web sites become web services, but when and how. APIs are a more controlled, cleaner and altogether preferred way of becoming a web service. However, when APIs are not avaliable or sufficient, scraping is bound to continue and expand. As always, time will be best judge; but in the meanwhile we turn to you for feedback and stories about how your businesses are preparing for 'web 3.0'.



12 TrackBacks

TrackBack URL for this entry: http://www.readwriteweb.com/cgi-bin/mt/mt-tb.cgi/2060

Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Scraping is terribly unreliable and inefficient. I can't believe Web 3.0 is about unreliability and inefficient. I anything, it should be the opposite.

    Posted by: Tinus | March 19, 2007 12:55 PM



  2. Well the efficient thing is when there are full APIs available, which is what one hopes to see...

    Posted by: Richard MacManus | March 19, 2007 1:08 PM



  3. Don't forget microformats and similar things that will allow regular sites to offer up properly formatted data without significantly changing workflow. Services in Yahoo Pipes' ilk will be able to use data provided by microformats very accurately in future.

    Posted by: Peter Cooper | March 19, 2007 1:11 PM




  4. Dapper is a key technology as you say. 'Self-tagging' data was all the rage in the scientific computing community back in 1983 or so. The result of that era was a number of standards (netCDF and hdf5 particularly) that enable remote access to large data sets. This distributed paradigm is much more web-friendly than one-server data centralisation, such as Google base or the (as yet unreleased) Freebase.

    A 'data mashup' service like dapper allows retrieval of remote data sets via URLs, and can understand a handful of scientific formats. Dapper is especially strong in having a 'free text' adapter for CSV/TSV style formats -- pretty much anything you can do with Pipes. :)

    I think presenting a set of canned services with some of the nice Geo data sets out there would be a good 'Web 2.0' move -- much less waiting for Web 3.0 to materialise.

    Array languages like Yorick or even more esoterically the LBL product PDB (which integrates with Yorick) allow a quick and dirty way to munge netCDF and similar formats. Most have interesting interfaces to more standard languages, as well as, significantly, Scheme-like ones.

    Posted by: John Goodwin | March 19, 2007 1:12 PM



  5. why should scraping be illegal?
    If a company exposes itself in the public space i can use the information. If the company wants the information to be restricted they should employ some use barrier: password-protected site, entry page where one has to agree to some kind of license, restrict access by the same IP in a certain time frame,...
    When i take a photography of time square should i have to double-check with all companies whos copyrighted logo happens to be on my image? i don't think so. But i bet company lawyers would like us to believe that.

    Posted by: lhe | March 19, 2007 1:12 PM



  6. Whoops -- I meant this one. The idea is worth a look though!

    http://www.opendap.org/developers/third_party_software.html

    Posted by: John Goodwin | March 19, 2007 1:16 PM



  7. @lhe - so this is a very interesting point, that is probably going to be highly debatable. I think the key issue is: how is the data being used? In the extreme case we know that if you copy your essay straight out of the textbook teachers will not like that. So where exactly is the boundary here of what is acceptable and what is not?

    Alex

    Posted by: Alex Iskold | March 19, 2007 1:18 PM



  8. I think it's important to look at web services such as Freebase, built off MetaWeb's infrastructure, and note that these guys are trying to give birth to a host of companies that build off their structured, managed approach to building schema for the "semantic web." I'm intrigued by what they've done, but am also torn -- I'm uneasy with any one company trying to be the rule-maker when it comes to this kind of stuff. On the other hand, Microformats offer a decentralized way for more structured meta information to percolate from the bottom up. The problem with something like this is, of course, adoption and agreeing upon standards.

    Posted by: Steve de Brun | March 19, 2007 1:29 PM



  9. First and foremost, thanks for mentioning Teqlo in this post.

    I think it’s a little unfortunate that scraping seems to get a lot higher profile than it deserves. Teqlo doesn’t scrape anything, we manipulate APIs to enable sequencing of services that our users specify in order to automate some task they do. In other words, it’s a mashup that goes well beyond taking search results from one service and dropping them in another. There’s a real appetite for this, our 2,000 beta users have created almost 400 mashups apps in less than a month.

    Having said all that, we do take advantage of OpenKapow to get to websites that don‚Äôt have readily accessible APIs, like Linkedin. Works great and proves the point that just because something is old technology doesn‚Äôt mean it won‚Äôt work well. BTW, Kapow is very sophisticated in terms of how it does it‚Äôs work, I‚Äôm reluctant to lump it in with ‚Äúold technology‚Ä? because it‚Äôs not.

    Posted by: Jeff Nolan | March 19, 2007 2:15 PM



  10. isn't access to APIs an essential component of any real Web 2.0 site? Google Maps, Flickr, Delicious, etc., etc., etc. If you don't have an API, I think of you as very Web 1.0. I don't think what you are talking about is the next big thing...it's happening right now.

    Posted by: adm | March 19, 2007 3:48 PM



  11. @adm: this is true, except that not all sites have api's now. thats what I am saying is bound to happen, particularly for the sites that are essentially views of databases.

    Alex

    Posted by: Alex Iskold | March 19, 2007 3:52 PM



  12. I can see sites transforming more and more into services, I guess that's the next step. I also believe that the web browser in it's current form won't be here for much longer. I think specialized "browsers" will start to appear. As a matter of fact, here is humble attempt: http://soarack.blogspot.com

    ovi

    Posted by: ovi | March 19, 2007 5:00 PM



  13. Why Literary agents and publishers have to make more money than authors/producers. £1 per copy sold. Why don't authors sell directly. Cheque www.writeout.co.uk in yahoo search. it's not ready yet but have a look

    Posted by: JoeBondini | March 19, 2007 5:04 PM



  14. This is a very complex issue and there will be many court suits before a common approach is widely agreed upon. From my experience and contrary to what this post mentions, content provider sometimes don't like scraping even if the proxy side credits the source. In the past I built an added value tool that combines postings from FatWallet and SlickDeals bargain forums, and I am still not sure if it will be there for long.

    Posted by: Yan | March 19, 2007 5:29 PM



  15. @Steve - Semantic Web technologies can be used in a non-centralized manner as well, and they carry the benefits of microformats without the limitations in what can be expressed.

    I think that SPARQL, the Semantic Web query language (or something very similar), is going to play a huge role in the next generation of data access APIs, as I've considered in a past blog entry. There's no doubt that data is extremely valuable; it is even more valuable when it's accessed through a flexible API. This doesn't necessarily mean that the data needs to be stored natively in RDF (the Semantic Web data format) - it could be in a relational database with a SPARQL layer running on top of it (there are existing open-source projects that provide this).

    Posted by: wingerz | March 19, 2007 5:49 PM



  16. You know what, I just had to click on this article and then read only one sentence and then comment. Why? Because from the start to finish of this entire process I was ticked off. Why dont people actually concentrate on making their code semantic and respecting standards instead of inventing new word and adding numbers up (reffering to web 2.0, 3.0 etc...). I'm really sick of this, to me it's like talking and talking about stuff, inventing new lies and keepin on talkin instead of actually ACTING and DOING what has to be done. I propose we call everything new on the web 'WEB' and everything old on thw web we would call 'old WEB' or 'before' and stop putting silly version numbers on an entity that never stops evolving and thus cannot be versioned.

    Besides, this site can't even pass a simple xhtml 1.1 validation as it's doctype dictates.

    Posted by: QuickBrownFox | March 19, 2007 6:48 PM



  17. @QuickBrownFox - I don't think you should be getting angry. People are trying to get to where we need to be. There is not a single clear path since there is a difference between talking about how things should be and then executing it on the web scale.

    As far as xhtml non-compliance, you should direct this to type pad not us, we as users of the software should not have to edit the headers.

    Alex

    Posted by: Alex Iskold | March 19, 2007 7:00 PM



  18. Sorry but in my view, this is one of the most 'flawed' articles I have read on web 2.0 in a long time.

    I'm not a 'web 2.0' hater, on the other hand a supporter. But so many people miss what it has slowly come to mean to the internet. There never will be a 'web 3.0' because 'web 2.0' when you strip away all the hype and spin (ajax, style, community power) is just a model for the internet that 'works'.

    The .com boom was powered by web 1.0 and it just collapsed under it's own flawed system and weight in the 2001 .com bust - it's whole concept and revenue model was fatally flawed. As with any new platform it takes a while to work out it's revenue model. Web 2.0 is about a 'fixed' model of the internet, largely pioneered by google, giving power to the user, and not charging them for it (the VC's of the .com boom would have rather cut their hands off than do that).

    Unless there is another .com boom and bust (which, provided this model really does work, there won't be) there never will be a web 3.0 ... the concept of 'what defines' web 2.0 will simply continue to evolve and develop as we see the internet become richer, more interactive and secure a realistic long-term business model.

    This article on the other hand seems to presume that by introducing an updated API system, ajax and javascript styles to give the user a richer-experience cross platform is worthy of the title 3.0, when in fact, those concepts date back to pre .com boom (take Viaweb for example) and have slowly been taking their natural course of progression for several years now.

    Though the physical concepts, systems and predictions in the article I agree with, I think calling this web 3.0 is simply an attention grabber. This is just the next logical step in the proccess.

    Posted by: Alex | March 19, 2007 7:28 PM



  19. To be crystal clear, we are not trying to 'define' web 3.0 here - we only used the term to signify that it is something between what has come to be known as 'web 2.0' and the Semantic Web. Note that there have been a lot of different attempts to 'define' web 3.0 already, but the common denominator seems to be that it's the next step from where we are now - but not quite the Semantic Web.

    In any case, I swore off defining web 2.0 a long time ago - and it is certainly not our intention to try defining web 3.0. What I suggest is that you read the whole article and then make up your mind about the concepts that Alex is presenting here. In hindsight, perhaps we should've just left off the web 3.0 bit, as it seems to be distracting from the main theme - i.e. When Web Sites Become Web Services.

    Posted by: Richard MacManus | March 19, 2007 7:51 PM



  20. Web APIs are definitely part of the web 2.0 umbrella. Nothing new here, and definitely nothing 'web 3.0'

    Jordan http://www.sumolabs.com

    Posted by: Jordan Willms | March 19, 2007 8:37 PM



  21. Check our Ruby on Rails' SimplyRESTful (released) and ActiveResource (coming soon!) to see how the future of web sites and services will look.

    Posted by: Tom Mornini | March 19, 2007 10:07 PM



  22. Alex;

    I thought this post was a tremendously valuable overview of what's happening today. I have been following the development of the programmable Web for some time, and your post is the most comprehensive discussion I have seen.

    I agree with your view that allowing access to information is good. There is unquestionably a growing school of thought that indicates the value of "transparency," which let's you tap the innovative ideas of the outside world, leads to a more powerful long-term business than businesses that cling to a model built on "not invented here."

    Here's one question: To me, the idea of the programmable Web is all about empowering the individual who does notknow how to program: That's the point of all the services you mention. These services let anyone create something that does something meaningful for them. If enterprises followed the suggestions you make in this pose, would it help or hurt this fundamental goal?

    Posted by: Bruce Judson | March 19, 2007 10:48 PM



  23. @Bruce Judson - It would make things substantially easier for everyone including non-technies. Personally I am somewhat skeptical that people with no technical background would be creating mashups. My skepticisms is based on years of experience with technical people creating bad software :) But of course that might not translate directly.

    Alex

    Posted by: Alex Iskold | March 19, 2007 10:56 PM



  24. In addition to great APIs, I think we'll also see a rise in the use of 'Data Widgets.' Data Widgets are pieces of code that will allow everyday people to expose the data that's locked away in their web pages.

    This site is a pretty good reference: http://aboutapex.org

    Posted by: Shannon Whitley | March 19, 2007 11:59 PM



  25. What you missed out are Microformats which are a way to add structure to the HTML you are already generating, thereby reducing the need for a separate API. See microformats.org for more details, or this 2004 presentation:

    http://tantek.com/presentations/20040928sdforumws/semantic-xhtml.html

    Posted by: Kevin Marks | March 20, 2007 12:11 AM



  26. Sorry Alex but I don't really agree...

    As Kevin mentioned you missed microformats. Also I think APIs (in one form or another) are already here and in many ways make up what we think of as Web 2.0.

    The next frontier will probably be helping the rest of the media catch up with the Web - to democratize the rest of the world.

    Manual Trackback:
    http://www.touchstonelive.com/blog/2007/03/web-20-nothing-to-see-here-moving-right.html

    Posted by: Chris Saad | March 20, 2007 12:45 AM



  27. In case of del.icio.us I think you are forgetting RSS as a way of opening up the del.icio.us database. Therefore web scraping is often not required and the data is already structured.

    Posted by: Yme Bosma | March 20, 2007 1:03 AM



  28. great post.

    Posted by: michael arrington | March 20, 2007 2:14 AM



  29. Have recently started playing with Yahoo Pipes will check out Teqlo and Dapper

    Posted by: Terinea Tech Tips | March 20, 2007 3:41 AM



  30. "Web Scraping is essentially reverse engineering of HTML pages. It can also be thought of as parsing out chunks of information from a page."

    Why reverse engineer when we have rss

    swik.net/Web2.0/Read%2FWriteWeb

    then there is this

    xml.swik.net/Web2.0/Read%2FWriteWeb

    Is this web scraping? A page on swik.net is created for each
    posting.

    Posted by: Dean Fragnito | March 20, 2007 5:02 AM



  31. What you are describing is referred to by the masses as Web 2.0, not "Web 3.0". Web 2.0 is a fictional buzz word, let alone Web 3.0. It's a nice attempt at trying to generate interest, but your calling this "Web 3.0" in reality is baseless.

    Also I have to agree with QuickBrownFox. With all due respect, how can you expect to move forward to web 2.0, 3.0, 73.0 etc etcif you can't even write a simple valid (x)html page? :) That's pretty ironic.

    Posted by: EoN | March 20, 2007 5:27 AM



  32. Want to add some clarifying comments. First, I did not say that we are in web 3.0, my thinking is that Web 3.0 phase is when web sites turn into API. Web 2.0 started that, but it is far from being finished. So Web 3.0 is not here.

    Also, note that what I am saying will be the next web, is not the Semantic Web, but something that largely resembles what we have now - APIs. The difference is that there will non-linear effect from many web sites opening up.

    Regarding RSS and Microformats. RSS is often mistaken for a structured format. It actually is not. You can make it structured using technologies like Google Base, but you need conventions. This brings us to Microformats, which have a good potential. What I see needs to happen is: more standards and wider adoption.

    Alex

    Posted by: Alex Iskold | March 20, 2007 6:30 AM



  33. I find it interesting (and significant) that the term Web Services has been successfully hijacked from the SOAP/WS-* folks. I've been following this trend closely because its clear to me that the 'rise of the API' culture, as you put it, was going to succeed where the W3C failed.

    RSS took off like wildfire because it was simple and got the job done. Just like HTML before it.

    I think RESTful APIs and data interfaces are going continue to proliferate because they're simple and get the job done. SOAP will continue to languish and will most likely turn into the CORBA of the 21st Century.

    Posted by: Chris Marino | March 20, 2007 7:06 AM



  34. As for the Amazon E-Commerce API, what an incredibly astute business move that was! But why aren't other e-tailer's following suit? Is it because they're afraid of comparison web-apps and the price wars that might follow? But hold on, those web apps are already out there - and precious few price wars have erupted. That can't be the reason.

    What does anyone else think regarding why we have yet to see widespread adoption amongst the big-boy e-tailers of open E-commerce API's like Amazon's?

    Posted by: J.D. | March 20, 2007 7:17 AM



  35. The key point to me, is your comment about monetisation - the bust of the first web boom is that very few people, except retailers, could monetise their business.

    Since then we've had the rise of an advertising funded web, that people mistake for being 'free' - it's not really a different model from commercial TV or free newspapers. Sites like mySpace deliver a particularly hard to reach set of consumers by traditional advertising, hence they're really irritating habit of sending an email to force you to log in to look at a message (just embed the ad in HTML mail and have done with it!).

    Users perceive information to be 'free' (see comments above) if it's public, but as you correctly say, it is still disputable whether a web page is public domain.

    There is absolute resistance towards paying for information in a raw form, and that resistance is going to hold back the development of businesses providing raw data API services.

    It's a shame really, as Ted Nelson's Xanadu project foresaw that allowing payments (particularly micropayments) would encourage people to make information available on-line, but the technical implemtation of the web has countered against it.

    Posted by: JulesLt | March 20, 2007 7:23 AM



  36. An interesting article I must say!

    I'm a librarian and there's a vast amount of information that we have to deal with on a regular basis. Two data aggregation tools that have helped me beyond appreciation are: Dapper and Feedity (www.feedity.com). Its a wonder how these services "scrape" data to add value to it.

    Posted by: RM | March 20, 2007 7:56 AM



  37. Alex, this is a highly useful post. Thanks!

    The most important point you make is the strategic value of offering open data. Amazon and eBay are good examples, but there are plenty of examples in less-known companies and sectors. Domain Tools uses the Compete.com API to provide web analytics as part of its service; in exchange, the obligatory attribution link has driven substantial new traffic to Compete's own website. In the wine industry, a great service called Wine Searcher has scraped wine retailers' websites worldwide for several years to provide the most complete and accurate database of fine and rare wine values available anywhere.

    At Mashery, we're seeing a lot of websites offering web services APIs as a means of more broadly distributing their data while being able to control how it is used. Ultimately, the API is less a technical tool than it is a means of creating a distribution channel and acting as a business development catalyst.

    If an API is well-managed, anyone using the API will agree to the vendors' terms and conditions of use before being issued an access key to the API, and usage of the API is tracked and managed by the API provider. Large providers like Amazon and eBay make their data available through APIs, but they do take these steps to manage the data's use. Legitimate users of the data are willing to follow the rules in exchange for the ease of access and flexibility offered by the API.

    Mashery has provided API management infrastructure for access control and developer community management companies like Trulia and Compete.com, and are working with others to offer both "open data" and "service manipulation" to developers.

    Posted by: Oren Michels | March 20, 2007 9:12 AM



  38. Hi,

    This article has very useful information, it will be helpful for many Web 3.0 aspirants. Actually one of my friends first read this article and asked me to visit this page.
    It’s really amazing to read this description of this article. Thank you so much for your help and for your efforts.

    Thanks,
    Steve
    http://www.eplanetlabs.com

    Posted by: Steve | March 20, 2007 1:05 PM



  39. Please read about web hooks: http://blogrium.com/2006/08/27/web-hooks/

    More powerful than traditional API's and easier to implement. If you want to evangelize something that will create a better read-write web, evangelize web hooks.

    Posted by: Jeff Lindsay | March 20, 2007 2:02 PM



  40. I remember when posts used to be bite sized. My head hurts - I think I need to lie down.

    Posted by: aaron | March 20, 2007 2:47 PM



  41. Yes! I got through it now. I saw nothing of microformats, rdf, or sparql. I guess I need to read it again to see if I skimmed over them. *sigh*

    Posted by: aaron | March 20, 2007 8:19 PM



  42. Many of the newer scrapping technologies like dapper and teqlo has started enabling simple api on web sites. Kapow technologies have been around for years and have recently published openkapow -- a powerfull api builder that transforms any web site into an api and allows it to be exposed as REST, RSS and other formats. If you are interested in seeing a well proven concept of easy api enabling, you should take a look.

    Posted by: madsh | March 21, 2007 5:03 AM



  43. Hey Alex ... excellent article and most interesting comments.
    Started thinking about what it would take to provide a SIMPLE API over the Web. 4 Hours later, it came up with this

    http://scripweb.poweredbyiis7-appliedi.net/getxml.aspx

    and

    http://scripweb.poweredbyiis7-appliedi.net/getjson.aspx

    The first provides an XML API over standard Web Sites and the second a JSON API.
    So to consume, say http://www.diggriver.com as an XML data source, all you have to do is

    http://scripweb.poweredbyiis7-appliedi.net/getxml.aspx?url=http://www.diggriver.com

    and presto, you get an XML stream of the web page. Need it in JSON format ...

    http://scripweb.poweredbyiis7-appliedi.net/getjson.aspx?url=http://www.diggriver.com


    But wait ... I don't want the entire contents of the web page, I just want the links ... no problem ...

    http://scripweb.poweredbyiis7-appliedi.net/getxml.aspx?url=http://www.diggriver.com~//a

    Note the ~//a tacked onto the url? This provides an XPATH query capability over the XML Data Stream.
    But I really don't want all the links, I just want the ones that apply to the Digg items ... no problem

    http://scripweb.poweredbyiis7-appliedi.net/getxml.aspx?url=http://www.diggriver.com~//li/a

    or how about just the title

    http://scripweb.poweredbyiis7-appliedi.net/getxml.aspx?url=http://www.diggriver.com~//title

    or how about the Google search results for your name

    http://www.google.ca/search?hl=en&q=+Alex+Iskold&btnG=Google+Search&meta=

    This refines the XPATH Query to a specific set of nodes. I'll not go into all the variations of XPATH syntax ... you can Google XPATH for that.

    I guess what this all demonstrates is that we tend to look for complex solutions sometimes when all we need is a more simplier (dare I say Web 2.0 approach).

    All this took about 4 hours to create and while it's not as sophisticated as some of the other tools/technologies mentioned ...
    it sure is simplier and maybe, just maybe "good enough".

    Now ... no doubt this site will get hammered and I'll loose my ISP priviledges on this beta site so ...

    Sponsors ... please step up :-)

    Posted by: Mike Parsons | March 21, 2007 12:37 PM



  44. Alex,

    Nice summary. We'll look back and see that Web 2.0 was a wonderful transition time, with millions of great ideas springing forth. But there is a coming shakeout, and many individual tools will only be valuable if they become bundled together.

    The same thing happened in the music industtry. Randy Newman, the songwriter and composer, described the availablity of cheap recording equipment as the "great equalizer". That, and digitally distributed mp3's, unleashed a torrent of poorly produced content with no way to sift through it. Now some tools are finally emerging, and there will be a shakeout of the tools and the talent.

    The keys to Web 2.0 ideas surviving are: moving beyond fun to usefulness, moving from automated to curated via collective intelligence, and evolving from simply grabbing content to semantically parsing it. We're trying some of that here at www.crowdrules.com.

    Posted by: David Moss | March 25, 2007 1:50 PM



  45. Don't forget the "classic" software: Web scraping can added to any application with the imacros software from http://www.iopus.com/imacros
    Of course, you can run it on a server, too.

    Posted by: Sandra | March 29, 2007 2:36 AM



RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS