<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" 
      xmlns:thr="http://purl.org/syndication/thread/1.0">
  <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php" />
  <link rel="self" type="application/atom+xml" href="http://www.readwriteweb.com/atom.xml" />
  <id>tag:,2009:/1/tag:www.readwriteweb.com,2008://1.6088-</id>
  <updated>2009-11-23T19:11:37Z</updated>
  <title>Comments for Googlebot Crawls Through HTML Forms</title>
  
  <generator uri="http://www.sixapart.com/movabletype/">Movable Type 4.23-en</generator>
  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088</id>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.readwriteweb.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=6088" title="Googlebot Crawls Through HTML Forms" />
    <published>2008-04-11T22:14:43Z</published>
    <updated>2008-04-11T22:16:06Z</updated>
    <title>Googlebot Crawls Through HTML Forms</title>
    <summary>Google will stop at nothing in its quest to index the world&apos;s information. Last year it ate through 100 exabytes of data, but there&apos;s still a lot that it can&apos;t get access to. Known as the deep web (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is...</summary>
    <author>
      <name>Josh Catone</name>
      <uri>http://www.readwriteweb.com/</uri>
    </author>
    
    <category term="Google" />
    
    <content type="html" xml:lang="en" xml:base="http://www.readwriteweb.com/">
      <![CDATA[<p><img src="http://www.readwriteweb.com/images/google-logo.jpg" vspace="5" hspace="5" height="55" width="155" />Google will stop at nothing in its quest to index the world's information.  Last year it ate through 100 exabytes of data, but there's still a lot that it can't get access to.  Known as the <a href="http://en.wikipedia.org/wiki/Deep_Web">deep web</a> (or hidden web, or invisible web, etc.), it is estimated that the majority of online data is hidden safely from Google's prying eyes -- private intranets, unlinked pages, some non-textual content, and until today dynamic content returned via form input was all inaccessible to the search engine.  Google <a href="http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html">today announced</a> that its Googlebot web crawler would begin to fill out HTML forms and crawl the results.</p>]]>
      <![CDATA[<p>"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made," explained Jayant Madhavan and Alon Halevy in a blog post. "If we ascertain that the web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page."</p>

<p>Google, which says that the crawling of dynamic form results doesn't affect the "crawling, ranking, or selection of other web pages in any significant way," also assured webmasters today that their enhanced crawl would respect robots.txt as usual. Any form forbidden in robots.txt won't be crawled.</p>

<p>It is estimated that the deep web is several orders of magnitude larger than the regular, public world wide web. While there is some content that Google will never -- and should never -- get its hands on, by crawling form results Google is now peering just a little bit deeper into the Internet.  As <a href="http://www.mattcutts.com/blog/solved-another-common-site-review-problem/">Matt Cutts points out</a>, this is less about indexing search results (something Google has generally not liked to do) and more about finding new links that are only available via dynamically created pages.</p>

<p>It should be noted that Google is only crawling GET forms (i.e., forms used to retrieve dynamic content, such as search results) and not POST forms.  That's mildly disappointing as we were looking forward to befriending Googlebot on MySpace...</p> ]]>
    </content>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51752</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51752" />
    <title>Comment from Dan Grossman on 2008-04-11</title>
    <author>
        <name>Dan Grossman</name>
        <uri>http://www.dangrossman.info</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://www.dangrossman.info">
        <![CDATA[<p>So on every page of Yahoo!, Google will select words it found in the page and fill them in the search box? That should produce fun results...</p>]]>
    </content>
    <published>2008-04-11T23:42:48Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51753</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51753" />
    <title>Comment from Kishor on 2008-04-11</title>
    <author>
        <name>Kishor</name>
        <uri>http://yuktya.blogspot.com/</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://yuktya.blogspot.com/">
        <![CDATA[<p>I wonder how they would let users reproduce the results had they crawled POST requests? Using forms? And how would they manage caching of those pages.</p>]]>
    </content>
    <published>2008-04-11T23:59:08Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51758</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51758" />
    <title>Comment from Dan Grossman on 2008-04-11</title>
    <author>
        <name>Dan Grossman</name>
        <uri>http://www.dangrossman.info</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://www.dangrossman.info">
        <![CDATA[<p>Submitting a second form to view a search result would probably be a bad thing. Regardless, if they started POSTing forms they'd probably break a few thousand sites overnight, and generate a couple thousand e-mails an hour submitting contact forms.</p>]]>
    </content>
    <published>2008-04-12T00:45:07Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51759</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51759" />
    <title>Comment from diystartupnews.com on 2008-04-11</title>
    <author>
        <name>diystartupnews.com</name>
        <uri>http://diystartupnews.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://diystartupnews.com">
        <![CDATA[<p>does this mean everyone with a contact form is going to get spam from google? </p>]]>
    </content>
    <published>2008-04-12T00:45:33Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51762</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51762" />
    <title>Comment from Dan Grossman on 2008-04-11</title>
    <author>
        <name>Dan Grossman</name>
        <uri>http://www.dangrossman.info</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://www.dangrossman.info">
        <![CDATA[<p>No, because contact forms are submitted by POST, not GET.</p>]]>
    </content>
    <published>2008-04-12T01:52:27Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51772</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51772" />
    <title>Comment from Morgan Cheng on 2008-04-11</title>
    <author>
        <name>Morgan Cheng</name>
        <uri>http://morganchengmo.spaces.live.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://morganchengmo.spaces.live.com">
        <![CDATA[<p>Except search page, is there any other kind of web page use GET form?</p>

<p>In my understanding, GET is used to retrieve info without changing state in server side. POST is to change state in server side. This is a convention, not mandatory. If any web site use GET form un-appropriately for changing server side, Googlebot will make problem for them.<br />
</p>]]>
    </content>
    <published>2008-04-12T05:49:39Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51782</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51782" />
    <title>Comment from 113.com on 2008-04-12</title>
    <author>
        <name>113.com</name>
        <uri>http://113.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://113.com">
        <![CDATA[<p>We've a form up on the front page, give it a try... ;-)</p>]]>
    </content>
    <published>2008-04-12T12:11:10Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51817</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51817" />
    <title>Comment from Lance on 2008-04-12</title>
    <author>
        <name>Lance</name>
        <uri>http://www.lancevance.org</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://www.lancevance.org">
        <![CDATA[<p>If googlebot crawled POST forms, what would be the difference between a spambot and a googlebot? Both would do fake registration, sign-ups ...</p>]]>
    </content>
    <published>2008-04-12T23:30:09Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51830</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51830" />
    <title>Comment from Sava on 2008-04-13</title>
    <author>
        <name>Sava</name>
        <uri>http://savasplace.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://savasplace.com">
        <![CDATA[<p>I really doubt it if Google will read captcha's :)<br />
So ... if you use captcha ... it will stop Googlebot. But hey ... at least it will stay longer on your website :)</p>]]>
    </content>
    <published>2008-04-13T14:20:25Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51853</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51853" />
    <title>Comment from fabregas on 2008-04-13</title>
    <author>
        <name>fabregas</name>
        <uri>http://repeaterstore.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://repeaterstore.com">
        <![CDATA[<p>I guess this may also be related to the fact that for GET forms, the form information is stored in the URL.  Therefore a number of pages that google will index already will contain links to the results pages of html forms e.g.: </p>

<p><a href="http://www.google.co.uk/search?q=arsenal" rel="nofollow">http://www.google.co.uk/search?q=arsenal</a></p>]]>
    </content>
    <published>2008-04-14T00:30:44Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51924</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51924" />
    <title>Comment from Bob on 2008-04-14</title>
    <author>
        <name>Bob</name>
        <uri>http://www.bobby.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://www.bobby.com">
        <![CDATA[<p>?!?</p>]]>
    </content>
    <published>2008-04-14T16:07:48Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2008://1.6088-comment:51933</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2008://1.6088" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_crawling_html_forms.php#c51933" />
    <title>Comment from Brant Tedeschi on 2008-04-14</title>
    <author>
        <name>Brant Tedeschi</name>
        <uri>http://www.fantasy-news.net</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://www.fantasy-news.net">
        <![CDATA[<p>For the average person, this change means basically nothing.  I'd suspect your site would need at least a PR5 or 6 for google to think about indexing such content.  And even then you need to be using GET, rather than POST.  Most searches use the latter.</p>]]>
    </content>
    <published>2008-04-14T17:41:40Z</published>
  </entry>

</feed>