ReadWriteWeb

Googlebot Crawls Through HTML Forms

Written by Josh Catone / April 11, 2008 3:14 PM / 12 Comments

Google will stop at nothing in its quest to index the world's information. Last year it ate through 100 exabytes of data, but there's still a lot that it can't get access to. Known as the deep web (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is hidden safely from Google's prying eyes -- private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine. Google today announced that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.

"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made," explained Jayant Madhavan and Alon Halevy in a blog post. "If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."

Google, which says that the crawling of dynamic form results doesn't affect the "crawling, ranking, or selection of other web pages in any significant way," also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won't be crawled.

It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never -- and should never -- get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet. As Matt Cutts points out, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.

It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms. That's mildly disappointing as we were looking forward to befriending Googlebot on MySpace...

Comments

Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts

  • So on every page of Yahoo!, Google will select words it found in the page and fill them in the search box? That should produce fun results...

    Posted by: Dan Grossman | April 11, 2008 4:42 PM


  • I wonder how they would let users reproduce the results had they crawled POST requests? Using forms? And how would they manage caching of those pages.

    Posted by: Kishor | April 11, 2008 4:59 PM


  • Submitting a second form to view a search result would probably be a bad thing. Regardless, if they started POSTing forms they'd probably break a few thousand sites overnight, and generate a couple thousand e-mails an hour submitting contact forms.

    Posted by: Dan Grossman | April 11, 2008 5:45 PM


  • does this mean everyone with a contact form is going to get spam from google?

    Posted by: diystartupnews.com | April 11, 2008 5:45 PM


  • No, because contact forms are submitted by POST, not GET.

    Posted by: Dan Grossman | April 11, 2008 6:52 PM


  • Except search page, is there any other kind of web page use GET form?

    In my understanding, GET is used to retrieve info without changing state in server side. POST is to change state in server side. This is a convention, not mandatory. If any web site use GET form un-appropriately for changing server side, Googlebot will make problem for them.

    Posted by: Morgan Cheng | April 11, 2008 10:49 PM


  • We've a form up on the front page, give it a try... ;-)

    Posted by: 113.com | April 12, 2008 5:11 AM


  • If googlebot crawled POST forms, what would be the difference between a spambot and a googlebot? Both would do fake registration, sign-ups ...

    Posted by: Lance | April 12, 2008 4:30 PM


  • I really doubt it if Google will read captcha's :)
    So ... if you use captcha ... it will stop Googlebot. But hey ... at least it will stay longer on your website :)

    Posted by: Sava | April 13, 2008 7:20 AM


  • I guess this may also be related to the fact that for GET forms, the form information is stored in the URL. Therefore a number of pages that google will index already will contain links to the results pages of html forms e.g.:

    http://www.google.co.uk/search?q=arsenal

    Posted by: fabregas | April 13, 2008 5:30 PM


  • ?!?

    Posted by: Bob | April 14, 2008 9:07 AM


  • For the average person, this change means basically nothing. I'd suspect your site would need at least a PR5 or 6 for google to think about indexing such content. And even then you need to be using GET, rather than POST. Most searches use the latter.

    Posted by: Brant Tedeschi | April 14, 2008 10:41 AM




RECENT JOBS


RWW READERS


TEXT LINK ADS


RWW PARTNERS

adaptiveblue

Yahoo Buzz