<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" 
      xmlns:thr="http://purl.org/syndication/thread/1.0">
  <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php" />
  <link rel="self" type="application/atom+xml" href="http://www.readwriteweb.com/atom.xml" />
  <id>tag:,2009:/1/tag:www.readwriteweb.com,2009://1.13631-</id>
  <updated>2009-11-23T01:01:21Z</updated>
  <title>Comments for <![CDATA[Google: &quot;We're Not Doing a Good Job with Structured Data&quot;]]></title>
  
  <generator uri="http://www.sixapart.com/movabletype/">Movable Type 4.23-en</generator>
  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631</id>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php" />
    <link rel="service.edit" type="application/atom+xml" href="http://www.readwriteweb.com/cgi-bin/mt/mt-atom.cgi/weblog/blog_id=1/entry_id=13631" title="Google: &quot;We're Not Doing a Good Job with Structured Data&quot;" />
    <published>2009-02-02T15:32:07Z</published>
    <updated>2009-02-02T23:27:06Z</updated>
    <title>Google: &quot;We&apos;re Not Doing a Good Job with Structured Data&quot;</title>
    <summary>Google: We&apos;re Not Doing a Good Job with Structured Data</summary>
    <author>
      <name>Sarah Perez</name>
      <uri>http://www.sarahintampa.com</uri>
    </author>
    
    <category term="Features" />
    
    <category term="Google" />
    
    <category term="NYT" />
    
    <category term="Search Services" />
    
    <category term="Trends" />
    
    <content type="html" xml:lang="en" xml:base="http://www.readwriteweb.com/">
      <![CDATA[<p><img src="http://www.readwriteweb.com/images/google-logo.jpg">During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google's <a href="http://us.rd.yahoo.com/dailynews/pcworld/tc_pcworld/storytext/googleresearchertargetswebsstructureddata/30785950/SIG=114ei8q2t/*http://alonhalevy.googlepages.com/">Alon Halevy</a> <a href="http://tech.yahoo.com/news/pcworld/20090130/tc_pcworld/googleresearchertargetswebsstructureddata">admitted</a> that the search giant has <em>"not been doing a good job"</em> presenting the structured data found on the web to its users. By "structured data," Halevy was referring to the databases of the "deep web" - those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means. </p>]]>
      <![CDATA[

<h2>Google's Deep Web Search</h2>

<p>Halevy, who heads the "Deep Web" search initiative at Google, described the "Shallow Web" as containing about 5 million web pages while the "Deep Web" is estimated to be 500 times the size. This hidden web is currently being indexed in part by Google's automated systems that submit queries to various databases, retrieving the content found for indexing. In addition to that aspect of the Deep Web - dubbed "<strong>vertical searching</strong>" - Halevy also referenced <strong>two other types of Deep Web Search: semantic search and product search</strong>. </p>

<p>Google wants to also be able to retrieve the data found in structured tables on the web, said Halevy, citing a table on a page listing the U.S. presidents as an example. There are 14 billion such tables on the web, and, after filtering, about 154 million of them are interesting enough to be worth indexing. </p>

<h2>Can Google Dig into the Deep Web? </h2>

<p><img src="http://farm3.static.flickr.com/2382/2264773191_51668c31bd_o.jpg" align="right">The question that remains is whether or not Google's current search engine technology is going to be adept at doing all the different types of Deep Web indexing or if they will need to come up with something new. As of now, Google uses the <a href="http://en.wikipedia.org/wiki/BigTable">Big Table database</a> and <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce framework</a> for everything search related, notes <a href="http://data-methods.com/2009/02/new-england-database-day-2009-at-mit/">Alex Esterkin</a>, Chief Architect at <a href="http://www.infobright.com">Infobright, Inc.</a>, a company delivering open source data warehousing solutions. During the talk, Halevy listed a number of analytical database application challenges that Google is currently dealing with: schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, and data level synonyms discovery. These challenges are addressed by Infobright's technology, said Esterkin, but <em><strong>"Google will have to solve these problems the hard way."</strong></em> </p>

<p>Also mentioned during the speech was how Google plans to organize "aspects" of search queries. The company wants to be able to separate exploratory queries (e.g., "Vietnam travel") from ones where a user is in search of a particular fact ("Vietnam population"). The former query should deliver information about visa requirements, weather and tour packages, etc. In a way, this is like what the search service offered by <a href="http://www.kosmix.com/">Kosmix</a> is doing. But Google wants to go further, said Halevy. "Kosmix will give you an 'aspect,' but it's attached to an information source. In our case, all the aspects might be just Web search results, but we'd organize them differently." </p>

<h2>Yahoo Working on Similar Structured Data Retrieval</h2>

<p><img src="http://www.readwriteweb.com/images/yahoo-purple-logo.jpg" align="left">The challenges facing Google today are also being addressed by their nearest competitor in search, Yahoo. In December, <a href="http://www.readwriteweb.com/archives/yahoo_search_to_offer_abstracts_of_search_results_determine_intent.php">Yahoo announced that they were taking their SearchMonkey technology in-house</a> to automate the extraction of structured information from large classes of web sites. The results of that in-house extraction technique will allow Yahoo to augment their Yahoo Search results with key information returned alongside the URLs.</p>

<p>In this aspect of web search, it's clear that no single company has yet to dominate. However, even if a non-Google company surges ahead, it may not be enough to get people to switch engines. Today, "Google" has become synonymous with web search, just like "Kleenex" is a tissue, "Band-Aid" is an adhesive bandage, and "Xerox" is a way to make photocopies. Once that psychological mark has been made into our collective psyches and the habit formed, people tend to stick with what they know, regardless of who does it better. That's something that's a bit troublesome - if better search technology for indexing the Deep Web comes into existence outside of Google, the world may not end up using it until such point Google either duplicates or acquires the invention. </p>

<p>Still, it's far too soon to write Google off yet. They clearly have a lead when it comes to search and that came from hard work, incredibly smart people, and innovative technical achievements. No doubt they can figure out this Deep Web thing, too. (We hope). </p>]]>
    </content>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125170</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125170" />
    <title>Comment from Govind Kabra on 2009-02-02</title>
    <author>
        <name>Govind Kabra</name>
        <uri>http://cazoodle.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://cazoodle.com">
        <![CDATA[<p>Deep Web is really an unchartered territory, and thus, as you rightfully pointed out, there is no single company yet to dominate. </p>

<p>At Cazoodle, building upon our years of research at University of Illinois, we have developed structured search platform for enabling Web Scale vertical search. Try out our first product, that integrates thousands of apartment rental sites to offer unique choice for apartment search:</p>

<p><a href="http://apartments.cazoodle.com" rel="nofollow">http://apartments.cazoodle.com</a></p>

<p>Govind</p>]]>
    </content>
    <published>2009-02-02T17:36:15Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125179</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125179" />
    <title>Comment from huixing on 2009-02-02</title>
    <author>
        <name>huixing</name>
        <uri>http://friendfeed.com/huixing</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://friendfeed.com/huixing">
        <![CDATA[<p>why google don't call for those site to open up there structured data, instead of index in a such round about way. in the same vain those site possessing structured data should open up these data voluntarily so that it could help for semantic web and indirect way make up for semantic search which at now google lacking of.</p>]]>
    </content>
    <published>2009-02-02T17:36:52Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125180</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125180" />
    <title>Comment from todd on 2009-02-02</title>
    <author>
        <name>todd</name>
        <uri>http://friendfeed.com/toddh</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://friendfeed.com/toddh">
        <![CDATA[<p>I would be very scared of opening up database schemas to search engines.</p>]]>
    </content>
    <published>2009-02-02T17:45:00Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125182</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125182" />
    <title>Comment from Michael Friis on 2009-02-02</title>
    <author>
        <name>Michael Friis</name>
        <uri>http://friendfeed.com/friism</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://friendfeed.com/friism">
        <![CDATA[<p>They're working on it: <a href="http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html" rel="nofollow">http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html</a></p>]]>
    </content>
    <published>2009-02-02T17:57:28Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125187</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125187" />
    <title>Comment from mathew johnson on 2009-02-02</title>
    <author>
        <name>mathew johnson</name>
        <uri>http://blog.blist.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://blog.blist.com">
        <![CDATA[<p>Google's initiative to replace as many existing Google Search Appliances as possible with the newer Google Mini Search Appliance works exactly on the deep-web problem. Anyone running a Mini in their datacenter is giving up their structured data to Google. The same goes for Google Desktop Search. While I am not aware of any instances where Google is currently integrating this data into public facing search results - it's clear that your data privacy will be eroded here and there, a little bit at a time, going forward.</p>

<p>Mathew<br />
<a href="http://blog.blist.com" rel="nofollow">http://blog.blist.com</a></p>]]>
    </content>
    <published>2009-02-02T18:30:24Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125202</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125202" />
    <title>Comment from Alan Wilensky on 2009-02-02</title>
    <author>
        <name>Alan Wilensky</name>
        <uri>http://bizcast.typepad.com/clients/writings-and-portfolio-sa.html</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://bizcast.typepad.com/clients/writings-and-portfolio-sa.html">
        <![CDATA[<p>Big table and Map Reduce are the systems that Google uses to store and compute index data. But their search method is computational and mathematical, as opposed to linguistic (semantic, ontological).</p>

<p>The computational approach to search is well worn and has entered the 'incremental return phase'; don't expect any breakthroughs from this well-worn path.</p>

<p>Almost all of the smaller players are going down the linguistic path. Probably some hybrid will emerge as the winner.</p>]]>
    </content>
    <published>2009-02-02T19:34:43Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125209</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125209" />
    <title>Comment from Tom on 2009-02-02</title>
    <author>
        <name>Tom</name>
        <uri></uri>
    </author>
    <content type="html" xml:lang="en" xml:base="">
        <![CDATA[<p>Govind.  You did not have to spend years of research to do develop a solution.  Kapow has already done that.</p>

<p>www.kapowtech.com<br />
</p>]]>
    </content>
    <published>2009-02-02T20:12:11Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125227</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125227" />
    <title>Comment from Govind Kabra on 2009-02-02</title>
    <author>
        <name>Govind Kabra</name>
        <uri>http://cazoodle.com</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://cazoodle.com">
        <![CDATA[<p>Tom, </p>

<p>The world is simply too big! Kapow wasnt the first, and most certainly, Cazoodle wont be the last. </p>

<p>Hope to see you at Web 2.0<br />
</p>]]>
    </content>
    <published>2009-02-02T23:01:52Z</published>
  </entry>

  <entry>
    <id>tag:www.readwriteweb.com,2009://1.13631-comment:125333</id>
    <thr:in-reply-to ref="tag:www.readwriteweb.com,2009://1.13631" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php"/>
    <link rel="alternate" type="text/html" href="http://www.readwriteweb.com/archives/google_were_not_doing_a_good_job_with_structured_data.php#c125333" />
    <title>Comment from Igor Goldkind on 2009-02-03</title>
    <author>
        <name>Igor Goldkind</name>
        <uri>http://friendfeed.com/igorgold</uri>
    </author>
    <content type="html" xml:lang="en" xml:base="http://friendfeed.com/igorgold">
        <![CDATA[<p>Why Doesn't Google just hand over it's board to the W3C consortium to manage?  I mean the Internet is more of a public utility than an enterprise, isn't it?</p>]]>
    </content>
    <published>2009-02-03T21:38:52Z</published>
  </entry>

</feed>