7 result(s) displayed (1 - 7 of 7):
A freely accessible index of 5 billion web pages, their page rank, their link graphs and other metadata, hosted on Amazon EC2, was announced today by the Common Crawl Foundation. "It is crucial [in] our information-based society that Web crawl data be open and accessible to anyone who desires to utilize it," writes Foundation director Lisa Green on the organization's blog.
The Foundation is an organization dedicated to leveraging the falling costs of crawling and storage for the benefit of "individuals, academic groups, small start-ups, big companies, governments and nonprofits." It's lead by Gilad Elbaz, the forefather of Google AdSense and the CEO of data platform startup Factual. Joining Elbaz on the Foundation board is internet public domain champion Carl Malamud and semantic web serial entrepreneur Nova Spivack. Director Lisa Green came to the Foundation by way of Creative Commons.
The collision of the wine websites CellarTracker and Snooth raises some interesting questions over data ownership. Snooth was accused of copying information from CellarTracker's user reviews, using an automated robot script crawling the site. While most commenters were outraged, it's not clear that there's any legal case against Snooth, even if it had crawled the data. As it turns out, the problem came from an outdated input feed, rather than its crawler, but the case highlights how many problems will arise as data flows and mixes on the Web.
Despite companies like Google making tens of billions of dollars from Web crawling, the rules governing so-called robots indexing the Web are surprisingly vague. As somebody who ran afoul of Facebook with my own crawler, I've taken a keen interest in other sites' attitudes to external access. There's some interesting stories buried in the robots.txt files that define their policies, so let me take you on a tour.
Earlier this week, The Wall Street Journal posted an article entitled "'Scrapers' Dig Deep for Data on Web". While the article highlights some important issues surrounding the murky and potentially shady business of Web crawling, it fails to provide a comprehensive story on the uses of Web crawling. In other words, by focusing on one or two companies with spotty business practices, it casts the entire practice of data collection from the Web as something to be feared.
Today, applications increasingly depend on a rich ecosystem of APIs. Thousands of different services are variously tethered together to form new software offerings and enhance existing ones. The idea of a programmable Web is finally coming true.
While this is not trivial, I am nonetheless beginning to question the long-term effects of an API-centric worldview, a sort of blind faith in the almighty API, which has at best a difficult relationship with open data and big data concepts.
Does Adobe think they can out-Google Google? Perhaps. The company is involved with Zoetrope, a joint project with researchers at the University of Washington. What they're building is a tool that allows for manipulating the web over time. Instead of the snapshot of the web you see today when googling, Zoetrope will let anyone use keyword searches to discover archived web information and look for patterns in the data found.
Google will stop at nothing in its quest to index the world's information. Last year it ate through 100 exabytes of data, but there's still a lot that it can't get access to. Known as the deep web (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is hidden safely from Google's prying eyes -- private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.
Movable Type search results powered by Fast Search