crawling - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/crawling en Copyright 2009 Richard MacManus readwriteweb@gmail.com Sun, 22 Nov 2009 12:00:55 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Zoetrope: New Web Crawler Allows For Searching, Analyzing The Ever-Changing Web Does Adobe think they can out-Google Google? Perhaps. The company is involved with Zoetrope, a joint project with researchers at the University of Washington. What they're building is a tool that allows for manipulating the web over time. Instead of the snapshot of the web you see today when googling, Zoetrope will let anyone use keyword searches to discover archived web information and look for patterns in the data found.

]]>Sponsor

]]> About Zoetrope

As with the Internet Archive, the data in Zoetrope's database is a backup of the entire web, including those pages which have changed over time. But this archive won't be limited to the somewhat inconsistent periodic snapshots of the web's content like the Internet Archive offers. It will encompass everything.

Using the intuitive Zoetrope interface, a user could compare historical changes of various data through time by comparing snapshots of different pages on the web. Analyzing different, changing elements on web pages, side-by-side and over a period of time is downright difficult today - if not impossible. But Zoetrope makes it happen.

The process is done using Zoetrope "lenses" to draw boxes around elements, connect data from one site to another, and pull up charts of relevant data, all while manipulating a slider to scroll back and forth through time. That may sound hard, but if you watch this video, you'll see that it looks surprisingly easy.

For Everyone, Not Just The Computer Savvy

In a way, this project is similar to Google's new visualization API, which lets developers use historical web data to build charts, graphs, gadgets, and the like. However, where Google's tool is aimed at the technically savvy programmer, Zoetrope, on the other hand, is for the average user. Says Dan Weld, a UW computer science and engineering professor who worked on the project, "Zoetrope is aimed at the casual researcher. It's really for anyone who has a question."

As noted in the Washington University article on the project, example uses of Zoetrope could range from the basic: checking historical rankings of favorite players on a sports team, to the advanced: comparing daily air pollution levels in Beijing to number of world's records broken each day in the 2008 Olympics. 

"Your browser is really just a window into the Web as it exists today," said Eytan Adar, University of Washington computer science and engineering doctoral student who's also a co-author of the research paper on the project.

"When you search for something online, you're only getting today's results...This is really a new way to think about storing information on the Web."

The researchers hope to offer Zoetrope for free as early as next summer.

Image credits: Color, Torley; Others, University of Washington

]]>Discuss]]>
http://www.readwriteweb.com/archives/zoetrope_new_web_crawler_searches_analyzes_ever_changing_web.php http://www.readwriteweb.com/archives/zoetrope_new_web_crawler_searches_analyzes_ever_changing_web.php Products Fri, 21 Nov 2008 07:47:01 -0800 Sarah Perez
Googlebot Crawls Through HTML Forms Google will stop at nothing in its quest to index the world's information. Last year it ate through 100 exabytes of data, but there's still a lot that it can't get access to. Known as the deep web (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is hidden safely from Google's prying eyes -- private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.

]]>Sponsor

]]> "For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made," explained Jayant Madhavan and Alon Halevy in a blog post. "If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."

Google, which says that the crawling of dynamic form results doesn't affect the "crawling, ranking, or selection of other web pages in any significant way," also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won't be crawled.

It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never -- and should never -- get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.

It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That's mildly disappointing as we were looking forward to befriending Googlebot on MySpace...

]]>Discuss]]>
http://www.readwriteweb.com/archives/google_crawling_html_forms.php http://www.readwriteweb.com/archives/google_crawling_html_forms.php Google Fri, 11 Apr 2008 15:14:43 -0800 Josh Catone