ReadWriteWeb

Google: "We're Not Doing a Good Job with Structured Data"

Written by Sarah Perez / February 2, 2009 7:32 AM / 9 Comments

During a talk at the New England Database Day conference at the Massachusetts Institute of Technology, Google's Alon Halevy admitted that the search giant has "not been doing a good job" presenting the structured data found on the web to its users. By "structured data," Halevy was referring to the databases of the "deep web" - those internet resources that sit behind forms and site-specific search boxes, unable to be indexed through passive means.

Google's Deep Web Search

Halevy, who heads the "Deep Web" search initiative at Google, described the "Shallow Web" as containing about 5 million web pages while the "Deep Web" is estimated to be 500 times the size. This hidden web is currently being indexed in part by Google's automated systems that submit queries to various databases, retrieving the content found for indexing. In addition to that aspect of the Deep Web - dubbed "vertical searching" - Halevy also referenced two other types of Deep Web Search: semantic search and product search.

Google wants to also be able to retrieve the data found in structured tables on the web, said Halevy, citing a table on a page listing the U.S. presidents as an example. There are 14 billion such tables on the web, and, after filtering, about 154 million of them are interesting enough to be worth indexing.

Can Google Dig into the Deep Web?

The question that remains is whether or not Google's current search engine technology is going to be adept at doing all the different types of Deep Web indexing or if they will need to come up with something new. As of now, Google uses the Big Table database and MapReduce framework for everything search related, notes Alex Esterkin, Chief Architect at Infobright, Inc., a company delivering open source data warehousing solutions. During the talk, Halevy listed a number of analytical database application challenges that Google is currently dealing with: schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, and data level synonyms discovery. These challenges are addressed by Infobright's technology, said Esterkin, but "Google will have to solve these problems the hard way."

Also mentioned during the speech was how Google plans to organize "aspects" of search queries. The company wants to be able to separate exploratory queries (e.g., "Vietnam travel") from ones where a user is in search of a particular fact ("Vietnam population"). The former query should deliver information about visa requirements, weather and tour packages, etc. In a way, this is like what the search service offered by Kosmix is doing. But Google wants to go further, said Halevy. "Kosmix will give you an 'aspect,' but it's attached to an information source. In our case, all the aspects might be just Web search results, but we'd organize them differently."

Yahoo Working on Similar Structured Data Retrieval

The challenges facing Google today are also being addressed by their nearest competitor in search, Yahoo. In December, Yahoo announced that they were taking their SearchMonkey technology in-house to automate the extraction of structured information from large classes of web sites. The results of that in-house extraction technique will allow Yahoo to augment their Yahoo Search results with key information returned alongside the URLs.

In this aspect of web search, it's clear that no single company has yet to dominate. However, even if a non-Google company surges ahead, it may not be enough to get people to switch engines. Today, "Google" has become synonymous with web search, just like "Kleenex" is a tissue, "Band-Aid" is an adhesive bandage, and "Xerox" is a way to make photocopies. Once that psychological mark has been made into our collective psyches and the habit formed, people tend to stick with what they know, regardless of who does it better. That's something that's a bit troublesome - if better search technology for indexing the Deep Web comes into existence outside of Google, the world may not end up using it until such point Google either duplicates or acquires the invention.

Still, it's far too soon to write Google off yet. They clearly have a lead when it comes to search and that came from hard work, incredibly smart people, and innovative technical achievements. No doubt they can figure out this Deep Web thing, too. (We hope).


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Deep Web is really an unchartered territory, and thus, as you rightfully pointed out, there is no single company yet to dominate.

    At Cazoodle, building upon our years of research at University of Illinois, we have developed structured search platform for enabling Web Scale vertical search. Try out our first product, that integrates thousands of apartment rental sites to offer unique choice for apartment search:

    http://apartments.cazoodle.com

    Govind

    Posted by: Govind Kabra | February 2, 2009 9:36 AM



  2. why google don't call for those site to open up there structured data, instead of index in a such round about way. in the same vain those site possessing structured data should open up these data voluntarily so that it could help for semantic web and indirect way make up for semantic search which at now google lacking of.

    Posted by: huixing Posted on FriendFeed   | February 2, 2009 9:36 AM



  3. I would be very scared of opening up database schemas to search engines.

    Posted by: todd Posted on FriendFeed   | February 2, 2009 9:45 AM



  4. They're working on it: http://googlewebmastercentral.blogspot.com/2008/04/crawling-through-html-forms.html

    Posted by: Michael Friis Posted on FriendFeed   | February 2, 2009 9:57 AM



  5. Google's initiative to replace as many existing Google Search Appliances as possible with the newer Google Mini Search Appliance works exactly on the deep-web problem. Anyone running a Mini in their datacenter is giving up their structured data to Google. The same goes for Google Desktop Search. While I am not aware of any instances where Google is currently integrating this data into public facing search results - it's clear that your data privacy will be eroded here and there, a little bit at a time, going forward.

    Mathew
    http://blog.blist.com

    Posted by: mathew johnson | February 2, 2009 10:30 AM



  6. Big table and Map Reduce are the systems that Google uses to store and compute index data. But their search method is computational and mathematical, as opposed to linguistic (semantic, ontological).

    The computational approach to search is well worn and has entered the 'incremental return phase'; don't expect any breakthroughs from this well-worn path.

    Almost all of the smaller players are going down the linguistic path. Probably some hybrid will emerge as the winner.

    Posted by: Alan Wilensky | February 2, 2009 11:34 AM



  7. Govind. You did not have to spend years of research to do develop a solution. Kapow has already done that.

    www.kapowtech.com

    Posted by: Tom | February 2, 2009 12:12 PM



  8. Tom,

    The world is simply too big! Kapow wasnt the first, and most certainly, Cazoodle wont be the last.

    Hope to see you at Web 2.0

    Posted by: Govind Kabra | February 2, 2009 3:01 PM



  9. Why Doesn't Google just hand over it's board to the W3C consortium to manage? I mean the Internet is more of a public utility than an enterprise, isn't it?

    Posted by: Igor Goldkind Posted on FriendFeed   | February 3, 2009 1:38 PM



RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS