ReadWriteWeb

The Web of Data: Creating Machine-Accessible Information

Written by Alexander Korth / April 18, 2009 10:00 AM / 18 Comments

In the coming years, we will see a revolution in the ability of machines to access, process, and apply information. This revolution will emerge from three distinct areas of activity connected to the Semantic Web: the Web of Data, the Web of Services, and the Web of Identity providers. These webs aim to make semantic knowledge of data accessible, semantic services available and connectable, and semantic knowledge of individuals processable, respectively. In this post, we will look at the first of these Webs (of Data) and see how making information accessible to machines will transform how we find information.

The amount of information and services available is growing exponentially. Every day, it is getting harder to find the information we are actually looking for. Still, we have to learn how to tell machines what we want. Why can't a machine understand which website, recent tweet, Flickr photo, Facebook message, or restaurant we are currently looking for?

Because it can't. It does not understand. It has no access to most sources. It lacks the semantic understanding and common sense to build bridges between information.

It is critical that machines gain a new level of understanding. Instead of statistically computing how well a search term matches a document, a machine must literally be able to understand. Therefore, knowledge bases are needed to look things up. Examples of these knowledge bases include:

  • an encyclopedia containing knowledge to look up the semantic meaning and context of a particular term (e.g. to understand that Berlin is a city, how many people live there, and where it is),
  • Yellow Pages or a service pool to query often-changing and more complex information (e.g. a route from Berlin to Porto by car, or the current temperature of Porto in Celsius),
  • a people database to look up profile information, with user permissions, which could improve personalization and recommendations.

The Web of Data

The idea of the Web of Data originated with the Semantic Web. People tried to solve the problem of the inherent inability of machines to understand web pages. Initially, the aim of the Semantic Web was to invisibly annotate web pages with a set of meta-attributes and categories to enable machines to interpret text and put it in some kind of context. This approach did not succeed because the annotation was too complicated for humans who had no technical background. Similar approaches, like microformats, simplify the markup process and thus help bootstrap this chicken-egg problem.

These approaches have in common the effort to improve the machine-accessibility of knowledge on web pages that were designed to be consumed by humans. Furthermore, these sites contain a lot of information that is irrelevant to machines and that needs to be filtered. What is needed is a knowledge base for machines to look up "noiseless" information. But wait! Who said that machines and us humans need to share one web anyway?

The idea of the Web of Data came about as a result of both this limitation and the existence of countless structured data sets distributed all over the world and containing all kinds of information. These data sets are the property of companies that trend to make them accessible. Typically, a data set contains knowledge about a particular domain, like books, music, encyclopedic data, companies, you name it. If these data sets were interconnected (i.e. link to each other like websites), a machine could traverse this independent web of noiseless, structured information to gather semantic knowledge of arbitrary entities and domains. The result would be a massive, freely accessible knowledge base forming the foundation of a new generation of applications and services.

Linking Open Data

One promising approach is W3C's Linking Open Data (LOD) project. The above image illustrates participating data sets. The data sets themselves are set up to re-use existing ontologies such as WordNet, FOAF, and SKOS and interconnect them.

The data sets all grant access to their knowledge bases and link to items of other data sets. The project follows basic design principles of the World Wide Web: simplicity, tolerance, modular design, and decentralization. The LOD project currently counts more than 2 billion RDF triples, which is a lot of knowledge. (A triple is a piece of information that consists of a subject, predicate, and object to express a particular subject's property or relationship to another subject.) Also, the number of participating data sets is rapidly growing. The data sets currently can be accessed in heterogeneous ways; for example, through a semantic web browser or by being crawled by a semantic search engine.

To get a feeling of how this machine Web of Data feels like, you may want to look up:

With every fact available on the Web of Data, more general and specific knowledge is made accessible to machines that will enable a whole new generation of services to be created. Highly sophisticated queries become machine-processable and accessible to the next generation of, say, search services.

Check out Tim Berners-Lee's talk at TED about the Web of Data. How do you think about it? Do you encounter the same issues being overloaded by information or too much noise?

(Photo by zorro-art. Graph by the Linking Open Data project.)


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Alexander,

    You can experience the full effect of the Linked Data Cloud at:

    1. http://lod.openlinksw.com

    You can approach search, find, and explore data in the cloud via the following:

    1. Full Text Search (doing the Google or Yahoo type search)
    2. Lookups by Entity Labels (type in a pattern and system looks up the Cloud for associated entities)
    3. Lookup by Identifier (if you know the ID of something e.g. you just type it in)

    In all cases re. 1-3 you end up with a result set that enables the kind of filtering that simply isn't deliverable by search engine i.e., use of Entity Properties and/or Entity Types to filter and refine your query (Search+Find we call it).

    At the end of this, you simply end up with an Entity Name (Data Source Name or URI) that can be used for additional data access and meshing with other data sources.

    The Web has just become the facilitator of a distributed data network that serves as a bus to connected data spaces (units of Web Presence that expose data).

    Kingsley

    Posted by: Kingsley Idehen | April 18, 2009 11:30 AM



  2. Very timely!

    We'll have a gathering shortly at WWW09 [1] - feel free to chime in and discuss with other F2F what's going on.


    Funny enough I recently wrote to the editors of RWW and informed them about pushback/RDForms, which takes the idea of linked data a step further, aiming at creating a read/write Web of Data. Maybe this is worth it a separate article?

    Cheers,
    Michael


    [1] http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/MadridGathering
    [2] http://esw.w3.org/topic/PushBackDataToLegacySources

    Posted by: Michael Hausenblas | April 18, 2009 12:06 PM



  3. Unquestionably, grasping the essence of this post suggests how the publishing and distribution of "news" is undergoing fundamental change...More likely that journalists just a year or so from now, will have the task to make their content "machine-readable" and that humans will read the content...will likely be a subset...Why? Because machine-readability for content will be for successful monetization...I think this is the point..Right?

    Posted by: Lou Sagar | April 18, 2009 2:13 PM



  4. The current web, where href links are the only bridge between web silos, has been bottlenecking the information fluidity for machines from the get-go. In my mind, the ultimate goal of the web - TBL's vision: to offer information at a breadcrumb (granular) level to the world can only be achieved by LOD (Linking Open Data) or Linked Data.

    It will not only allow machines to go granular on information, but also to access the structural version of information for humans on-the-fly. The major challenge of the LOD is to convert the gargantuan contents of the Web to a machine understandable format which I think are being resolved by NLP tools popping left and right.

    So, I think it is prime time to welcome the Linking Open Data.

    Posted by: Shamod | April 18, 2009 2:31 PM



  5. Certainly!

    Key barrier in creating the web of data is to understand the semantics of the structured data already prevalent in the Web, albeit in semi-structured form.

    At Cazoodle, we are using our technology for modeling the Web pages as structured data, to build useful vertical search services. For example, Cazoodle Apartment Search is the first ever one stop search engine for apartments indexed from everywhere on the Web. Give it a try!

    http://apartments.cazoodle.com

    Posted by: Govind Kabra | April 18, 2009 3:28 PM



  6. Thanks for the comprehensive overview of why we need a semantic approach to address our state of information overload. At Jinni, we see semantic technology as key to personalization and recommendations (in the entertainment sphere).

    Posted by: Phoebe Author Profile Page | April 19, 2009 1:19 AM



  7. Lou: no, not quite. It is about making accessible data sets, think of databases, to machines. This information can cover any domain not only news. When you scan through http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets you'll notice that the domains are totally heterogenous.

    Shamod, making accessible information about humans is somewhat more tricky when it comes to privacy, personae, authorization with revokeable access rights and such. Thus, I want to cover that in the forthcoming article about the Web of Identities. I do not doubt that both topics have things in common.

    Best, @alexkorth

    Posted by: Alexander Korth | April 19, 2009 2:42 AM



  8. Hi Alexander,

    Thanks for writing this great review of the Semantic Web paradigm. It's very much in line with the way we sense the problems that are facing Web users now. It would be great to have a review of the products and companies that are tackling this issue today (next post perhaps?).

    We at SemantiNet feel that our http://Headup.com Firefox addon goes a fair distance to delivering Semantic Web experiences, especially in those categories of information that have been the focus of massive UGC (user generated content) activity, namely social-networking and online music.

    Gartners' predictions for this industry (that I'm familiar with) put massive mainstream adoption 2-3 years in the future. It would be interesting to see what else is out there and how you see this field developing. Do you support Gartner's view or do you predict earlier adoption?

    Cheers,
    Mike
    "I tweet @headup"

     Posted by: Headup Author Profile Page | April 19, 2009 6:59 AM



  9. Great post and discussion - LOD is a great initiative and the more data that becomes available the better. The semantic web angle is important indeed, but I think in many cases we're even still missing the basics - having the data available in *any* queryable form. If you have an API, even if the content doesn't follow a know ontology yet, people can get their teeth into it.

    Another interesting dimension is that while the focus here is on data, functionality is also going this way - a lot of companies now open up APIs on their functionality as well as data (e.g. image transformation, file exchange, compute resources etc.). There are W3C initiatives on Semantic Web Services which start to try to describe functionality, but even before those kick in we're now able to do some pretty cool stuff by sending for example flicker images into moo.com or suck house-price data in one side and spit out trend data the other.

    The more this happens we'll have not only a web of data, but a web of processes.

    Posted by: Steven Willmott | April 19, 2009 9:11 AM



  10. Steven, I totally agree. As mentioned in the intro, the Web of Services, and the Web of Identity Providers will be covered in future posts. Once this all is real, I would like to discuss with all of you how these techniques *combined* will change the Web! This will be a service-enabler beyond our imagination.

    Mike: by "tackling this issue" do you mean companies exposing data sets? That is already happening. My guess is that it will take some time for service providers to learn how to make in-depth use this knowledge.

    Posted by: Alexander Korth | April 19, 2009 9:51 AM



  11. According to Tim Berners-Lee, the whole web would have gone semantic a long time ago. But while this sounds great in theory, all the entropy will not so easily be dealt with.

    Posted by: datadirt Posted on FriendFeed   | April 19, 2009 10:34 AM



  12. This post contains URLs where it should infact use URIs.

    The essence of Linked Data is comes down to a URI that delivers the following in one go:

    1. Named Reference (a Web Space name for something)
    2. Conduit to an address (URL) that exposes the Description of a Named Thing in a negotiated representation ( (X)HTML, RDFa, N3, Turtle, RDF/XML etc..)

    Examples (based on this post).

    1 http://dbpedia.org/resource/Berlin - URI of Berlin the Place
    2. http://dbpedia.org/resource/Tetris - URI of Tetris (*note* /resource not /page)
    3. http://linkeddata.uriburner.com/about/rdf/http://www.crunchbase.com/company/yahoo#this - a different Linked Data URI for Yahoo from Crunchbase data space (note the meshing with Calais and then DBpedia when you follow the "primarytopic" link etc..)

    Kingsley

    Posted by: Kingsley Idehen | April 19, 2009 11:51 AM



  13. Michael,
    I am really happy to - in this way - support the Linked Data on the Web (LDOW2009) workshop at WWW2009 (http://events.linkeddata.org/ldow2009), tho I am sad that I could not make it to Madrid to meet you guys F2F.

    Posted by: Alexander Korth | April 20, 2009 2:09 AM



  14. Hey there!

    @Alexander: Great way to seperate the different challenges concerning true linkage. I'm looking forward to the discussion on the "people database".

    But just as I'm (again) logging in to this site in order to comment, I have to stress the most important part of all this linkage: a better user experience.

    Having facebook connect for logging in created a delay of 5 seconds (page reload), and having to remember the friendfeed remote key another 5.

    After that, this comment might end up on RWW's mod panel, in my friendfeed and be ?co-posted to Facebook?. But how do I, as the prosumer follow the ensuing conversation?

    While FF-addons like headup or Juice make discovery of new content quite nice, the DISO-like dashboard seems to be missing from my browser.

    How can we create an interface (through the browser) that is as adaptable to the constant switch between read and write mode without overwhelming end users with the plethora of services, plugins and destinations?

    I think all silos (browser history, bookmarks, search history) need to be opened up as wide as possible, syndicatedly searchable and accessible through common APIs.

    Ubiquity and addons like identify will soon become common to mainstream users. If, and only if the benefits are seen earlier than the downsides of information fragmentation, attention deficits and privacy concerns.

    And if the user experience becomes more streamlined.

     Posted by: Björn Author Profile Page | April 20, 2009 3:27 AM



  15. Hi Alex. It would be nice to get more information about the interaction of the webs und future developments: the level of machine understanding (implications for business, culture, etc.).
    But what about the security of private data, I think this is what you mean with the "web of identities"...
    Do we have to fear a new level of the abuse of privacy in your opinion? Looking forward to read sth about that.

    marc

    Posted by: mcb | April 20, 2009 3:50 AM



  16. I just wrote up a blog article on the emerging web data ecosystem and how it parallels biological ecosystems; We also need to understand the sequence of its "boot" cycle in order to understand the importance of certain evolutionary stepping stones such as "rel='me'":

    http://jpatterson.floe.tv/index.php/2009/04/19/the-data-ecology/

    Another critical area is the emerging XRD and LRDD specs (done by the author of OAuth, Eran Hammer-Lahav), which will provide dynamic discovery and auto linking of data and services in future evolutions of the web.

    Posted by: Josh Patterson | April 20, 2009 7:44 AM



  17. Few observations.

    1. Web of data with its links is not web of knowledge, it is a very trivial variant of it. There are great deal of shortcomings in the link data structures as presented. It is linkable but not computable. A deeper and more complete representation of reality should be used.

    2. All this will be of sufficient value only when evolved to a stage when enterprises can RUN in this space. In other words when deep utility is derived. Search is just one small aspect of what is possible.

    3. Once this is open to everyone, there is a need for new type of technology, we call it wisdom technology (ala DIKW hierarchy) that is able to understand structures and see simplifications, analogies etc. This is needed to manage complexity that will emerge.

    Pawel Lubczonok
    ThoughtExpress

    Posted by: pawel lubczonok | April 22, 2009 12:15 AM



  18. Great article, very informative.

    -Giddens

    Posted by: Matt Giddens | September 18, 2009 9:33 AM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  

If you think Twitter is big, check out the Real-Time Web
RWW SPONSORS



FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook
ReadWriteCloud - Sponsored by VMware and Intel



TEXT LINK ADS



RWW PARTNERS