scraping - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/scraping en Copyright 2012 Richard MacManus readwriteweb@gmail.com Tue, 14 Feb 2012 16:15:34 -0800 http://www.sixapart.com/movabletype/?v=4.35-en http://blogs.law.harvard.edu/tech/rss Is It Time For a Web Crawling Code of Conduct? webcrawling_fakespider.pngEarlier this week, The Wall Street Journal posted an article entitled "'Scrapers' Dig Deep for Data on Web". While the article highlights some important issues surrounding the murky and potentially shady business of Web crawling, it fails to provide a comprehensive story on the uses of Web crawling. In other words, by focusing on one or two companies with spotty business practices, it casts the entire practice of data collection from the Web as something to be feared.

]]> Guest author Shion Deysarkar (@shiondev) is responsible for overall business development at 80legs. In a previous life, he founded and ran a predictive modeling firm. He enjoys playing poker and soccer, but is only good at one of them.

Why Web Crawling Is Good

There have certainly been cases where Web crawling has gone too far. The PatientsLikeMe.com case highlighted in the article is a great example. However, I would argue that there are far more cases where Web crawling and data collection from the Web has generated real value - not only for companies, but for individuals as well.

For instance, aggregate data from the Web helps companies learn what people think about their products. Companies that can listen better can meet the needs of their customers better. Another interesting use-case is discovering and analyzing potential ad channels. Ad networks crawl millions of Web pages to find content relevant to their ad inventory. Crawling also allows companies like Infochimps and Factual to build better, more structured data sets with anything from property data to sports data. Rather than having this data scattered around the Web, it's now centralized for easy consumption and analysis.

A Web Crawling Code of Conduct

Unfortunately, and somewhat understandably, it's easier to focus on the murky underbelly of Web crawling. People gravitate more to stories about organizations doing the wrong thing than stories about companies just running their businesses the right way. 80legs and other companies involved in legitimate Web data collection need to make sure we are not grouped in with the other organizations.

I think a great first step toward this is establishing a "Web Crawling Code of Conduct". The rules and laws surrounding Web crawling have been hazy at best and show no signs of being clarified. This is not surprising, considering that law tends to play catch-up with technology. However, after some experience in this industry, I feel that the following two rules embody the minimum necessary guidelines for proper Web crawling:

1. Only publicly-available sources may be crawled. This means bots cannot log into websites, unless explicitly allowed by the website.

2. Do not overwhelm a website with crawling requests. Crawling requests should not significantly increase the amount of bandwidth needed by the server.

Some readers may feel I've left out certain aspects that should be included in proper Web crawling, such as following robots.txt and other practices. While I recognize the value that those practices have, my personal opinion is that Web data sources and Web data collectors should work together to maximize the value of Web data, and that some common practices hamper that unnecessarily. Further discussion is welcome and eagerly anticipated.

Perhaps while we wait for proper regulations to help distinguish those socially aware crawling services acting with best practices in mind from the more dubious companies with other interests, we should move toward creating a more formal, independent board that can certify, whether officially or unofficially, those crawling companies adhering to such a code and operating legitimate services.

Photo by homyox

]]> Discuss]]>
http://www.readwriteweb.com/archives/is_it_time_for_a_web_crawling_code_of_conduct.php http://www.readwriteweb.com/archives/is_it_time_for_a_web_crawling_code_of_conduct.php Security Fri, 15 Oct 2010 11:30:00 -0800 Guest Author
The Glory, Bliss and How-to of Screen Scraping for RSS Wired has an awesome top story today on the world of startups utilizing scraped data from big companies to offer new layers of value for their own users. It's a roughly objective piece that I highly recommend reading but it was also inspiration for me to finally record a screencast on the subject (see below).

I love RSS, probably more than anything on the web. If you're not familiar with the concept, see my very old definition of RSS and my almost-as-old post on teaching people about RSS.

Not every page on the web publishes an RSS feed, though. Thus the need for these wonderful screen scraping tools. I've written about a variety of tools you can use to create a feed for a site or page that doesn't have one. Sometimes, though, you've got to pull out the big guns. In those cases, it's time for Dapper.

]]>

Dapper is a company founded in Israel, now venture backed and was named in the aforementioned Wired article. It is the sweetness.

Dapper will let you pull data from almost any web page and get it in a wide variety of outputs, including RSS, email, iCal, a Google Gadget, CSV and Google Maps. Is that incredible or what?

Let's let the video do that talking. I have an awful cold (it's almost better, Mom!) so please excuse the very rough voice. I made the following screencast using JingProject, setting up an RSS feed of search results in Del.icio.us for articles tagged from ReadWriteWeb.

Clicking on the image below will open up another window so you can view the 4 minute video full screen.

If you're as excited about Dapper as I am, you should check out DapperCamp, a two day free conference all about Dapper coming up in early February in San Francisco. IBM and Mindtouch are sponsoring the event and Mitch Kapor is keynoting it. It looks like it's going to be a lot of fun.

Take that, Wired Mag ambivalence! Really, though, you should read that Wired article - it's a good one that discusses some issues that are going to be very big once more people figure out how exciting data portability is.

]]> Discuss]]>
http://www.readwriteweb.com/archives/screen-scraping.php http://www.readwriteweb.com/archives/screen-scraping.php Mon, 31 Dec 2007 20:57:24 -0800 Marshall Kirkpatrick