crawling - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/crawling en Copyright 2012 Richard MacManus readwriteweb@gmail.com Tue, 14 Feb 2012 16:29:00 -0800 http://www.sixapart.com/movabletype/?v=4.35-en http://blogs.law.harvard.edu/tech/rss New 5 Billion Page Web Index with Page Rank Now Available for Free from Common Crawl Foundation commoncrawllogo.jpgA freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. "It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it," writes Foundation director Lisa Green on the organization's blog.

The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of "individuals, academic groups, small start-ups, big companies, governments and nonprofits." It's lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.

]]> The Foundation explains the scope of the project thusly.

"Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.

"Luckily for us, Amazon's EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud."

The organization was formed three years ago, just now started talking about itself publicly and believes that free access to all this information could lead to "a new wave of innovation, education and research."

Open Web Advocate James Walker agrees: "An openly accessible archive of the web - that's not owned and controlled by Google - levels the playing field pretty significantly for research and innovation."

]]> Discuss]]>
http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php http://www.readwriteweb.com/archives/common_crawl_foundation_announces_5_billion_page_w.php Data Services Mon, 07 Nov 2011 15:42:48 -0800 Marshall Kirkpatrick
Is It Time For a Web Crawling Code of Conduct? webcrawling_fakespider.pngEarlier this week, The Wall Street Journal posted an article entitled "'Scrapers' Dig Deep for Data on Web". While the article highlights some important issues surrounding the murky and potentially shady business of Web crawling, it fails to provide a comprehensive story on the uses of Web crawling. In other words, by focusing on one or two companies with spotty business practices, it casts the entire practice of data collection from the Web as something to be feared.

]]> Guest author Shion Deysarkar (@shiondev) is responsible for overall business development at 80legs. In a previous life, he founded and ran a predictive modeling firm. He enjoys playing poker and soccer, but is only good at one of them.

Why Web Crawling Is Good

There have certainly been cases where Web crawling has gone too far. The PatientsLikeMe.com case highlighted in the article is a great example. However, I would argue that there are far more cases where Web crawling and data collection from the Web has generated real value - not only for companies, but for individuals as well.

For instance, aggregate data from the Web helps companies learn what people think about their products. Companies that can listen better can meet the needs of their customers better. Another interesting use-case is discovering and analyzing potential ad channels. Ad networks crawl millions of Web pages to find content relevant to their ad inventory. Crawling also allows companies like Infochimps and Factual to build better, more structured data sets with anything from property data to sports data. Rather than having this data scattered around the Web, it's now centralized for easy consumption and analysis.

A Web Crawling Code of Conduct

Unfortunately, and somewhat understandably, it's easier to focus on the murky underbelly of Web crawling. People gravitate more to stories about organizations doing the wrong thing than stories about companies just running their businesses the right way. 80legs and other companies involved in legitimate Web data collection need to make sure we are not grouped in with the other organizations.

I think a great first step toward this is establishing a "Web Crawling Code of Conduct". The rules and laws surrounding Web crawling have been hazy at best and show no signs of being clarified. This is not surprising, considering that law tends to play catch-up with technology. However, after some experience in this industry, I feel that the following two rules embody the minimum necessary guidelines for proper Web crawling:

1. Only publicly-available sources may be crawled. This means bots cannot log into websites, unless explicitly allowed by the website.

2. Do not overwhelm a website with crawling requests. Crawling requests should not significantly increase the amount of bandwidth needed by the server.

Some readers may feel I've left out certain aspects that should be included in proper Web crawling, such as following robots.txt and other practices. While I recognize the value that those practices have, my personal opinion is that Web data sources and Web data collectors should work together to maximize the value of Web data, and that some common practices hamper that unnecessarily. Further discussion is welcome and eagerly anticipated.

Perhaps while we wait for proper regulations to help distinguish those socially aware crawling services acting with best practices in mind from the more dubious companies with other interests, we should move toward creating a more formal, independent board that can certify, whether officially or unofficially, those crawling companies adhering to such a code and operating legitimate services.

Photo by homyox

]]> Discuss]]>
http://www.readwriteweb.com/archives/is_it_time_for_a_web_crawling_code_of_conduct.php http://www.readwriteweb.com/archives/is_it_time_for_a_web_crawling_code_of_conduct.php Security Fri, 15 Oct 2010 11:30:00 -0800 Guest Author
"The Almighty API," Crawling and The Programmable Web opendata.pngToday, applications increasingly depend on a rich ecosystem of APIs. Thousands of different services are variously tethered together to form new software offerings and enhance existing ones. The idea of a programmable Web is finally coming true.

While this is not trivial, I am nonetheless beginning to question the long-term effects of an API-centric worldview, a sort of blind faith in the almighty API, which has at best a difficult relationship with open data and big data concepts.

]]> As CEO, guest author Shion Deysarkar is responsible for the overall business and development of 80legs. In a previous life, he ran a predictive modeling firm. He enjoys playing poker and soccer, but is only good at one of them.

How Do We Access Data Today?

There are two core ways to access data today – via a publisher or via a crawler. Each has a different role.

APIs, at least as we think of them today, have many disadvantages. And before you grab your shovels and organize a mob to come after me, please understand that I'm not calling for the discontinuation of APIs.
Publishers have data and choose to make it publicly available through an API so that developers can easily design products powered by a given service.

Crawlers on the other hand are used to proactively go out and grab data by yourself - scraping Web pages for whatever it is you're looking for, data that can then be used to build products and inform better product and marketing decisions.

There is something of a third option, as well: data aggregators like Factual and Infochimps and Hoovers. I'm not going to treat them as part of this post because they gain access to data like the rest of us – via APIs and/or crawlers. They facilitate the distribution of that data as part of their core business (most often using a marketplace concept or subscription), but the input mechanisms are no different.

And there is potentially even a fourth option – human curation of the kind that Factual and WolframAlpha and CrowdFlower employ to acquire new data altogether. But all of these providers offer API access to their data, so I'm still going to bucket them as such.

APIs, at least as we think of them today, have many disadvantages. And before you grab your shovels and organize a mob to come after me, please understand that I'm not calling for the discontinuation of APIs.

At 80legs, we ourselves offer a popular API, which takes a particularly hybrid approach – providing programmatic access to the data acquired via crawls.

What I really want is a natural stratification based on who is good at what, essentially. Right now, we're asking APIs to do too much.

APIs are great for the real-time web, for example – they're great for staying up to speed, whether that means trending search data or retweet velocity. APIs are great for enhancing functionality – whether that's a Klout score or geolocation. APIs are great for integrating certain pieces of non-strategic infrastructure like invite codes (Prefinery) if you're a startup in beta, or Freshbooks, if you're an accountant. They're also great for app-level integrations, like adding Facebook accounts to Tweetdeck, or sucking down content from Netflix.

But at a higher level, as all applications and services become more and more data-driven, it's important to understand the differences between these different methods for extracting data, regardless of where you net out philosophically.

This is a discussion that needs to take place, but too few people recognize the distinctions.

Control, Control, Control

Control and flexibility are the two most important elements to look at when it comes to the difference between an API and a crawl. I also spend some time at the end of this post talking about security and privacy, because I think there are big impacts in these areas for APIs and crawlers alike.

Cost might be a fourth facet to look at, but that's grounds for a different post because pricing varies so widely.

Let's start with control.

When using an API, publishers – companies like Amazon and LinkedIn, for example – control the entire process. Publishers provide you with an API account, which allows you a certain amount of calls, or requests for data per day. They also determine what kinds of content are made available, and in what context.

Publishers offer an API for many reasons. It's financially in their best interest to have products built on top of their data to increase developer loyalty and form a kind of API-dependency to their content. It's also useful as a way to accurately measure server usage and overall engagement, even if there's no money involved.

APIs can go down and become unavailable, they can go from free to paid, and their publishers can be acquired by larger companies that make all manner of changes. There's a lot of uncertainty in APIs, and many devs have learned this fact the hard way. Think back to Gnip rethinking their entire business model due to the relicensing of certain APIs.

But like moths, we so often head right back to the flame.

Crawlers act very differently. They allow much more control over the data acquisition process. This has many advantages.

json.jpgFor starters, the format in which content is delivered can be a lifesaver if formatted properly, or prompt hours of additional work if not. APIs supply content in one format – the format chosen by the publisher. Say you need a XML file type but the company only delivers JSON through their API. You're either stuck or left spending hours re-formatting.

Crawlers let the choice-driven developer have his cake and eat it too. Formats are just another choice to make beforehand, instead of a hindrance.

Granted, standardization can be great in some cases – for example with sites like MySpace where each profile is customized and therefore rendered in HTML differently. MySpace APIs format the content to make it uniform, meaning that what was once difficult to work with as a developer (i.e. large discrepancies in the data), is now standardized and simple to use.

But the "one size fits all" mentality fails more often than you might think, especially once you step outside of the Web's largest sites – one size fits all rarely fits anyone well.

And it's not just format – crawling offers much more control when it comes to time and timing, scope, and cost, too.

Next page: Flexibility and Availability

Flexibility and Availability

Data access choices are an important component of building any Web product, especially when it comes to flexibility and availability. Specs change, needs change - heck, markets change. Especially if you're a lean startup, out early plus iterate often is a way of life.

APIs only deliver content from the publisher's site. You're locked into a single interface's content sources and structure, without flexibility by definition, which can be very limiting. You're left with acquiring stand-alone datasets to supplement your evolving needs, or mashing up with another API to fill in holes.

Now, the very best API providers are great at adapting to developers' needs and evolving alongside them. Companies like Yolink, for which its API is its bread and butter, are particularly responsive. But too often an API is left unattended, having been a mere box to check, instead of a strategic commitment.

Unimaginative APIs can also limit use cases unwittingly, because some of the furthest-flung (if more promising) applications just aren't supported in the calls or code. There's a huge difference between an API that wants to be heightened and explored and an API whose scope, if anything, constrains original thinking.

Crawlers on the other hand aren't specific to any one site's data, meaning that they can access content from any number of sources and compile it in one place, mixing and matching, comparing and contrasting to your heart's delight.

Crawls can be more open-ended and investigative as well, whereas an API is more about putting a square peg in a square hole. API's also don't offer competitive advantage; everyone has access to the same stuff. A clever crawl can help build a moat.

Finally, crawlers can reach far beyond the capabilities of an API. Millions of pieces of data are publicly available on the Web, and only a very small percentage of it is available via an API. At a certain point it's purely an issue of volume. Much of the Web is instantly crawl-able, and the amount of data available freely on the Web is growing more quickly than the number of APIs by an order of magnitude. The caveat is that you just have to know where to look.

The Elephant in the Room - Security and Privacy

Let's talk about privacy and data, because how the world evolves in this respect could have huge implications for APIs and crawlers alike.

As the recent Facebook data privacy concerns highlight, the security of people's data is a high priority, regardless of how it may or may not be acquired or sold. Further, users expect publishers to protect their data aggressively (whether they do is another matter).

This is a PR and perception issue as much as it is anything else.

Users worry that their data might get into the hands of people who will use it for malicious purposes, whether via an API or a crawler. I would argue that this is not always the case because responsible crawling companies, at least, have strict licensing agreements with their clients to ensure data is used lawfully.

But the reality is that publishers are increasingly incentivized because of public policy issues to constrain API access. And the world's biggest crawler, Google, is starting to look evil, with the ominous question, "What exactly does Google know about me?" popping up at family dinners around the country.

Tim_Berners_Lee_Knight.pngSome are even arguing that Facebook is bound to be federally regulated sooner or later because of its profligacy when it comes to data, and that would certainly have broad impacts.

APIs are not inherently more or less secure than crawlers, but in the current climate, especially with regards to privacy, we can expect companies large and small to make less and less data open and available (something that the linked data community has been ruing as well).

Security right now is a big X factor that is going to take some time to play out.

The nice thing about crawlers (depending on your perspective) is that they are harder to control, at least for now. But it is a reasonable thing to say that data responsibility and privacy issues are going to shape and reshape this conversation big time.

Today's Web is full of data that if kept within an API-driven paradigm suffers from less creative use, less flexibility, and less control (from a developer standpoint).

An endlessly crawl-able Web was in many ways what Tim Berners-Lee and WC3 intended for the Web all along. Content creators like publishers and social networks can create sites as they've always done, while data aggregators can access data in whatever format they like.

In fact, in an older but still applicable interview with Berners-Lee, he talks about why an open, linked data Web is by far preferable than APIs for data access.

There is a foundational, DNA-level need to share data. Without openness, you lose the full value and impede any future innovation in the process. APIs absolutely have benefits – but only when we are not beholden to them – when we can use them rationally, strategically, and carefully. And when data isn't at the crux of your site, service or application.

"We have an open API" is an overused phrase, especially as APIs are by no definition open or closed.

If you need certain attributes, like real-time or speed, certain capabilities, or certain pieces of infrastructure, there are thousands of amazing APIs out there. But if your business runs on data, crawling is the only way to go.

Top photo by Jason V. Tim Berners-Lee photo by the Knight Foundation

]]> Discuss]]>
http://www.readwriteweb.com/archives/the_almighty_api_crawling_and_the_programmable_web.php http://www.readwriteweb.com/archives/the_almighty_api_crawling_and_the_programmable_web.php Structured Data Wed, 04 Aug 2010 10:00:00 -0800 Guest Author
Zoetrope: New Web Crawler Allows For Searching, Analyzing The Ever-Changing Web Does Adobe think they can out-Google Google? Perhaps. The company is involved with Zoetrope, a joint project with researchers at the University of Washington. What they're building is a tool that allows for manipulating the web over time. Instead of the snapshot of the web you see today when googling, Zoetrope will let anyone use keyword searches to discover archived web information and look for patterns in the data found.

]]> About Zoetrope

As with the Internet Archive, the data in Zoetrope's database is a backup of the entire web, including those pages which have changed over time. But this archive won't be limited to the somewhat inconsistent periodic snapshots of the web's content like the Internet Archive offers. It will encompass everything.

Using the intuitive Zoetrope interface, a user could compare historical changes of various data through time by comparing snapshots of different pages on the web. Analyzing different, changing elements on web pages, side-by-side and over a period of time is downright difficult today - if not impossible. But Zoetrope makes it happen.

The process is done using Zoetrope "lenses" to draw boxes around elements, connect data from one site to another, and pull up charts of relevant data, all while manipulating a slider to scroll back and forth through time. That may sound hard, but if you watch this video, you'll see that it looks surprisingly easy.

For Everyone, Not Just The Computer Savvy

In a way, this project is similar to Google's new visualization API, which lets developers use historical web data to build charts, graphs, gadgets, and the like. However, where Google's tool is aimed at the technically savvy programmer, Zoetrope, on the other hand, is for the average user. Says Dan Weld, a UW computer science and engineering professor who worked on the project, "Zoetrope is aimed at the casual researcher. It's really for anyone who has a question."

As noted in the Washington University article on the project, example uses of Zoetrope could range from the basic: checking historical rankings of favorite players on a sports team, to the advanced: comparing daily air pollution levels in Beijing to number of world's records broken each day in the 2008 Olympics. 

"Your browser is really just a window into the Web as it exists today," said Eytan Adar, University of Washington computer science and engineering doctoral student who's also a co-author of the research paper on the project.

"When you search for something online, you're only getting today's results...This is really a new way to think about storing information on the Web."

The researchers hope to offer Zoetrope for free as early as next summer.

Image credits: Color, Torley; Others, University of Washington

]]> Discuss]]>
http://www.readwriteweb.com/archives/zoetrope_new_web_crawler_searches_analyzes_ever_changing_web.php http://www.readwriteweb.com/archives/zoetrope_new_web_crawler_searches_analyzes_ever_changing_web.php Product Reviews Fri, 21 Nov 2008 07:47:01 -0800 Sarah Perez
Googlebot Crawls Through HTML Forms Google will stop at nothing in its quest to index the world's information. Last year it ate through 100 exabytes of data, but there's still a lot that it can't get access to. Known as the deep web (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is hidden safely from Google's prying eyes -- private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.

]]> "For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made," explained Jayant Madhavan and Alon Halevy in a blog post. "If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."

Google, which says that the crawling of dynamic form results doesn't affect the "crawling, ranking, or selection of other web pages in any significant way," also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won't be crawled.

It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never -- and should never -- get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.

It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That's mildly disappointing as we were looking forward to befriending Googlebot on MySpace...

]]> Discuss]]>
http://www.readwriteweb.com/archives/google_crawling_html_forms.php http://www.readwriteweb.com/archives/google_crawling_html_forms.php Google Fri, 11 Apr 2008 15:14:43 -0800 Josh Catone