ReadWriteWeb

Semantic Web: Where Are The Meaning-Enabled Authoring Tools?

Written by Guest Author / February 20, 2008 11:49 PM / 27 Comments

By Nitin Karandikar Much has been written recently about the concepts, approaches and applications of the Semantic Web. But there's something missing. In terms of understanding, finding and displaying content, there is no doubt that the Semantic Web is slowly becoming real (e.g. there were some great demos at a recent SDForum meet ). However, there is a gap emerging with Content Authoring tools, which have not yet made this paradigm shift.

On the one hand, most authors are comfortable with, and proficient in, desktop authoring tools such as Microsoft Word, FrontPage, Adobe GoLive and others. This is especially true for professionals and other experts who create technical reference content for web applications, such as legal references, accounting manuals or engineering documents. The current crop of authoring tools produce visually high-quality articles and web pages, but their XML creation capabilities are severely limited.

On the other hand, parsing Word documents or HTML web pages to extract meaningful XML out of them gives poor results; much of the semantic knowledge of the content is lost. There do not appear to be any popular tools that create Semantic content natively and yet are natural and easy for a content author to use.

Top-Down? Or Bottom-Up?

Of course, there are ways to get around this issue to some extent. Allowing authors or readers to add tags to articles or posts allows a measure of classification, but it does not capture the true semantic essence of the document. Automated Semantic Parsing (especially within a given domain) is on the way - a la Spock, twine and Powerset - but it is currently limited in scope and needs a lot of computing power; in addition, if we could put the proper tools in the authors' hands in the first place, extracting the semantic meaning would be so much easier.

For example, imagine that you are building an online repository of content, using paid expert authors or community collaboration, to create a large number of similar records - say, a cookbook of recipes, a stack of electrical circuit designs, or something similar. Naturally, you would want to create domain-specific semantic knowledge of your stack at the same time, so that you can classify and search for content in a variety of ways, including by using intelligent queries.

Ideally, the authors would create the content as meaningful XML text, so that parsing the semantics would be much easier. A side benefit is that this content can then be easily published in a variety of ways and there would be SEO benefits as well, if search engines could understand it more easily. But tools that create such XML, and yet are natural and easy for authors to use, don't appear to be on their way; and the creation of a custom tool for each individual domain seems a difficult and expensive proposition.


Image: andrea.paiola

Car Review Example

As a more concrete example: imagine that you control a web site called New-Car-Reviews.com, a hypothetical site that reviews new cars; you pay expert authors to write reviews of new car models every year for this site. Unlike other automobile characteristics, reviews cannot be easily stored into a database and queried. Conceptually, your reviews are similar to this review for the 2008 Volvo S40 2.4i sedan on the automotive site Kelley Blue Book.

Imagine this: when your authors are originally composing this review, instead of writing the content as

    <span id="ctl00">You'll Like This Car If...</span>
        ...description_positive...
    <span id="ctl00">You May Not Like This Car If...</span>
       ...description_negative...

if they could instead create it as

    <advantages><label>You'll Like This Car If...</label>
        <text>...description_positive...<text>
    </advantages>
    <disadvantages><label>You May Not Like This Car If...</label>
        <text>...description_negative...<text>
    </disadvantages>

In other words, you get more value out of the same exact content:

  (a) You can easily re-purpose the content in additional ways, such as for mobile devices, RSS feeds, web services APIs, mashups and so on;
  (b) As search engines start to take advantage of semantic notation, you get SEO benefits;
  (c) You can provide users with ways to query the content intelligently ("show me cars which are family-friendly AND don't roll over easily vs those that work better off-road AND seat 7"), using the recently-released SPARQL.

As a content publisher, you want your content to be found and used as much as possible, and making it meaning-enabled is a big step in this direction. At the same time, you cannot ask authors to use a pure XML tool such as XMLSpy; and MS Word creates unreadable XML that specifies formatting rather than semantics.

A solution for this specific example already exists: Microformats could be applied to handle the problem of annotating the advantages and disadvantages. While the Microformat solution works very well for specific types of information - such as for describing people and addresses - it is too limited to be applicable in a general way to add semantic information to web content at large.

So the general problem must be solved if we are to see large-scale adoption of the Semantic Web. It would be a boon to expert authors everywhere, including those who create news articles for the newspaper publishing industry. But there do not seem to be any solutions on the horizon, in terms of technologies, tools or processes to promote the creation of more meaning-rich content.

Reactions: But is there a Business Case?

When I put this question to a group of prominent bloggers and industry thought-leaders in the Semantic Web space, the results were not encouraging. There does not seem to be much interest in building Semantic authoring tools. The main stumbling block is the lack of a clear business model for publishers to embrace this approach.

Jeremy Liew of Lightspeed Venture Partners has recently penned a series of articles focused on Semantic Web: Meaning = Data + Structure, based on user-generated structure; domain knowledge and user behavior, which focuses on the problem of inferring meaning from content.

He questions the business rationale for authors to take the effort to add XML markup to their content, and points to domain-specific extraction approaches as the more likely solution:

"The challenge with getting most authors to markup in XML is not just one of tools, but also of motivation IMO. Unless and until a clear business case advantage justifies the additional effort required, and that advantage is greater than other projects offer, you won't see much semantic markup except from academics and others whose interests are more philosophically driven than business driven.

That is why I think the domain specific extraction approaches will likely be more prevalent - the business advantage of better search and structure accrues to the person doing the extraction, and because it is domain specific, the additional effort is lessened."

He's right, of course; domain-specific extraction approaches are definitely going to be popular, and are beginning to take off already. It provides significant added value for the extractor. However, it's difficult and expensive to do it well, so the business case is somewhat dubious for the early adopters.

ReadWriteWeb's own Alex Iskold is another thought leader in this space. He has a series of fantastic articles about the Semantic Web, including the problem of annotating data, the different approaches used, and a primer for the structured web.

His comments echoed those of Liew:

"There seems to be little incentive for publishers to annotate information.

The problem is that if you go deep enough you hit RDF. The light version is Microformats. But the issue is not the format, its the incentive."

Tim O'Reilly wrote about this issue almost a year ago: Different Approaches to the Semantic Web, in which he echoes the same sentiment:

"It seems easy enough, but why hasn't this approach taken off? Because there's no immediate benefit to the user. He or she has to be committed to the goal of building hidden structure into the data. It's an extra task, undertaken for the benefit of others. And as I've written before, one of the secrets of success in Web 2.0 is to harness self-interest, not volunteerism, in a natural 'architecture of participation.'"

Conclusion

I guess I'm a minority of one. In my view if content creators could add semantic meaning while constructing the content in the first place (which is, conceptually, only marginally more difficult for the authors), then the value of the content would increase exponentially at very low cost. That seems like a defensible business case for content publishers.

The business case for publishers to annotate existing web pages and content is certainly very weak. But for new content, if you're creating it for your site anyway, why wouldn't you add semantic markup to make it more findable and usable?

What do you think? Please leave a comment below.

Top image credit: nennett


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Quarkshop/Quarkrank (http://www.quarkshop.com) does some of automatic extraction and annotation of features for electronics domain. Currently, they do it for reviews from review sites.

    Posted by: Nakul Aggarwal | February 21, 2008 12:53 AM



  2. Well that's a new low point for semantic web discussion on RWW. Here's a clue - using arbitrary XML formats is actually worse than using HTML for broader reuse because it means there is a potentially unlimited number of schemas to support. That's why the Semantic Web uses RDF, not XML (judging by the blog.softwareabstractions.com site this appears to be a distinction which has escaped Nitan).

    To be clear - you can't run SPARQL on the XML example given above. To do that you'd need to convert that format to RDF, which is no easier or harder than converting the HTML example given above.

    Please RWW - consider getting someone who actually knows what they are talking about to read pieces like this before posting them. It's embarrassing seeing the falling standards at what used to be a pretty good site. Compare this article to Jon Udell's wonderful piece on basically the same topic: Overcoming data friction. There's someone who actually knows what they are talking about writing well. Note that - unlike JOn - Nitin seems not to understand that most web content isn't authored using Frontpage, etc, but comes from relational databases with web sites as front ends. That's the problem space that Astoria (see Jon Udell's article) and the Google feedserver address.

    If you do want to keep authors like this writing could I suggest that they constrain themselves to writing about Web 3.0 or something so their ignorance isn't so offensive?

    Posted by: None | February 21, 2008 1:10 AM



  3. There are three major ways of annotating:

    1. fully automatic annotation
    2. halfway automaic annotation generating suggestions
    3. manual annotation

    #1 tends to generate too many or not enough or not the intended annotations.

    #2 tends to generate too many or not enough or not the intended annotations, but allows for choosing annotations manually. This, however, is a separate task from actually writing texts / noting information.

    #3 is time-consuming and widely unacceptable if not natural language-like. And it bears the risk of introducing inconsistencies if not supported by automatic consistency checks.

    The problem, however, is not that of annotating. The problem is information denomalization. Let me explain. Across and sometimes even in documents, one tends to repeat oneself again and again. Thus we would have to annotate again and again. Why is it not enough to say something once and then refer to it (using transclusion, tuples, and sequences). Documents have to be understood as sequences of utterances that can be re-sequenced and referred to. I think that instead of annotating denomalized and depersonalized documents, we should start building personal, integral semantic knowledge stores that also function as reference systems in publishing. Knowledge chunks (that can be serialized into documents) as Semantic Webs entities are per se semantically enriched and limit the need for in-document annotation.

    Now, if a document would consist of knowledge chunks, one would avoid a further problem: Annotating a document as a whole is often not possible or leads to a description that is just too broad to become relevant (that's of course the fault of the document). Entities annotated withing the document may not describe the document satisfactorily. To put it philosophically, it's simply that what is mentioned is not (at) all what is sayed, and what is sayed is often not or not only what is meant and intended.

    In my Artificial Memory project I try to develop an advanced cognitive / authoring UI allowing us to practise publishing and knowledge management in a Semantic Web without creating the idea of having to laboriously 'annotate' information, because it's all about creating and using one's memory reflections, and thus enhancing one's creativity and productivity.

    Posted by: Lars Ludwig | February 21, 2008 1:44 AM



  4. A very interesting article. However, I often wonder how the annotation scheme is decided upon. If everyone chose their own scheme, chaos would ensue.

    For example, if the above-mentioned author used and tags, whereas another car review website used and tags. Then we'd need another semantic 'layer' mapping to etc.

    Are there schemes that have been specified for certain domains? E.g. I can imagine financial news to be a good place to use such structured annotations.

    Posted by: n | February 21, 2008 2:08 AM



  5. I see that the tags from my previous comment were removed, thus making it impossible to understand. So here it is again:

    A very interesting article. However, I often wonder how the annotation scheme is decided upon. If everyone chose their own scheme, chaos would ensue.

    For example, if the above-mentioned author used "advantage" and "disadvantage" tags, whereas another car review website used "pro" and "con" tags. Then we'd need another semantic layer mapping "advantage" to "pro" etc.

    Are there schemes that have been specified for certain domains? E.g. I can imagine financial news to be a good place to use such structured annotations.

    Posted by: n | February 21, 2008 2:12 AM



  6. I think it's a great idea to develop smarter authoring tools, that assists in marking up the content in a semantically correct way. Say you spend several hours on an article, adding a few tags and possible other annotations would just take a couple of minutes, if the authoring tool was smart enough. The benefits in terms of increased discoverability and relevancy to readers are obvious, and would not be difficult to pitch to writers. Microformats and similar lighter markup methods are probably the right way to go.

    In my humble beginnings as a blogger, I have come across a few tools that help improving the semantics. For example, I am using a Wordpress theme (Sandbox), which uses POSH, hAtom, hCard and other Microformats in the markup. I am using a seo-friendly plugin (HeadSpace2), that adds some meta-tags and assists in tagging the content. Also in the blogging editor Windows Live Writer, there is a useful option to add rel-attributes when inserting a hyperlink. Blog-authoring tools are likely the front-runners in terms of semantics, improvements in other publishing and authoring tools are certainly needed.

    Posted by: Jonas Bolinder | February 21, 2008 2:38 AM



  7. Do you really think that there is "the true semantic essence" of a document? I would be afraid of a web that is built based on this assumption. I want subjectivity and different voices from different people with different viewpoints, coming from different cultures!

    Posted by: Martin | February 21, 2008 3:16 AM



  8. You can find a discussion of these issue by Barney Pell (CTO at Powerset) here.

    Also, I'm surprised you didn't mention Calais in this article.

    Posted by: Stuart Robinson Author Profile Page | February 21, 2008 3:20 AM



  9. You can add me to your minority...

    I guess it would be valuable both for authors and publishers to agree on a tagging system that would help indexing, at least on a small numbers of very general, well-agreed upon, semantic categories.

    It seems that Dublin core categories actually are gaining more and more acceptance, at least for anyone involved in "authoring" some content.

    Adam Hodgkin (Exact Editions) shows examples of what should be done about ISBNs vs. phone numbers.

    I'd love to have smart auto-completion, key binding and shortcuts to tag capitalized names and separate geographical names from person (at least physical, perhaps 'moral persons') names. This would avoid funny results in Google Book Search mash-up with Google Maps such as the link between Dom Pedro and a Brasilian city there.

    Posted by: Alain Pierrot | February 21, 2008 4:43 AM



  10. We already have one, if you could stretch the definition of Semantic authoring a little. And its built on top of Microsoft Word!

    - Tagging of content, happens in the background, without the user taking any explicit action
    - Re-purposing is a core feature of our product. And repurposing not just to emit the content in different formats but even to aggregate, create textual and numeric summaries across content sets
    - We are restricted to equity research authoring currently, but the architecture is domain neutral

    Posted by: Mahesh CR | February 21, 2008 5:02 AM



  11. Most of people are lazy by nature. That's why for mainstream only automatic annotation done by either publishing tools or search engines should work.

    Posted by: Yakov | February 21, 2008 5:34 AM



  12. I agree with the problem being one of motivation. It's just hard to see what the advantage is of taking that extra effort, it's not so much a matter of laziness. Theoretically, it sounds great, but practically there are only a handful of people that can envision what those advantages would look like. It could be one of those cases where it just needs a few years until this topic becomes more urgent, enters the mainstream brains and simple annotation tools are build into major products.

    In terms of automation, I would see a semi-automatic tool work the best. Once you're done writing your content, the tool will provide you with a suggested structure/outline/etc, you make a few adjustments and you're done. I don't think we'll get to a point where every pieces of content is perfectly identified across the web. It'll probably be a simple system like tagging that does a fairly good job and that people can relate to easily.

    Great article and a great topic.

    Posted by: Christoph | February 21, 2008 6:08 AM



  13. I work in the education industry and we're constantly battling with our test question authors. We need semantically described text that we can display in a variety of presentation layers but they're really only comfortable working in MS Word.

    Our general solution is either to create Forms in Word and scrape the information out into semantic text or to create Web-forms for collecting the data that are as similar to Word as possible.

    Posted by: Brent | February 21, 2008 6:23 AM



  14. The problem with XML authoring is that it's too time-consuming from a user perspective. You're basically requiring that the user fill out a detailed, unique form on every post or content node.

    What's really needed is a way for the CMS to prefill semantic data for the user, and then let the user tweak it. The prefill would have to come from contextual information (post title keywords, word frequency analysis, link text) and metadata (category, tags). In a way you have a mini-search engine index running against your own post, and giving you "search results" to let you "rank" the sub-content into a structured form. And even then, take pains to hide the XML-ness; instead of showing the user a pile of confusing <blah>blahblah</blah> xml markup, it should provide a cleaner view like

    drink: triple latte
    cost: $4.50
    opinion: sucks, overpriced

    where of course the labels (drink, cost, opinion) are mapped to the actual XML containers <drink></drink> etc. The user can edit the list easily, insert or delete labels as they choose, and then hit publish.

    To achieve this, you need good metadata. By good, I mean "rich" - it should be noted that tagging alone is actually pretty poor as far as metadata goes because it's usually only a taxonomy imposed by the author, not a true folksonomy. The advantage of the latter is that the metadata is more variable, giving any semantic algorithm more room to play with. Note that tagging as implemented in Wordpress is not a true folksonomy, though a plugin now exists to rectify that. Semantic algorithms will starve on taxonomies alone.

    Posted by: Aziz Poonawalla | February 21, 2008 6:50 AM



  15. Where are the rules for this thing? I'm a seasoned web professional and I have no idea HOW to annotate content. Also, will there be browser compatibility issues as a result? There are more questions than answers at this point.

    Posted by: chris | February 21, 2008 7:00 AM



  16. Excellent post - but the very non transparent "Written by Guest Author" makes me question the integrity of the post...also mildly creeped out by it.

    Posted by: Todd | February 21, 2008 7:32 AM



  17. Notice that quite a lot of websites now are "search engines optimized". That's because in most cases it's very good for you and your site to be search engines friendly.

    As soon as some "semantic-powered" search engine / directory / some other application becomes popular people will want to make their websites friendly to that application. Maybe Google, Yahoo! or some other mayor player will begin to experiment with reading semantically annotated data and those experiments will become popular. After that lots of tools will appear in different forms - CMS plugins, standalone software, online services etc. Actually I think it'l take a couple of weeks for these tools as soon as some demand appears, because adding some XML / microformat tagging to a webpage code isn't difficult at all. What people need to know is why they should bother and what is the actual format they need to use. Now it's like "that could be format X (e.g. microformats) or format Y (e.g. XML) but I'm not sure what is best and what this will give to you".

    And that's the problem. Semantic applications to become popular need some "real world" content annotated but people don't want to annotate because semantic application aren't popular. And even if some publishers believe in the future of the semantic web and are ready to begin semantic tagging they don't know what type of tagging to use.

    So we need to wait for some killer application. Taking into account the recent hype around microformats such applications can appear quite soon and will be based on them (because you can add microformats tagging little by little where needed and quite easily).

    Posted by: alibloomdido | February 21, 2008 7:35 AM



  18. There are a lot of benefits to gain from semantic content. Semantic content is easy to repurpose within your organisation (multiple brands, multiple media, multiple publications, etc). Semantic content is easily converted to suit the needs of external consumers (search engines, visually impaired people, etc).

    The challenge with semantic content is creating it. The biggest reason why there is so little semantic content is the authoring tools: they are either too flexible (word processors) or too technical (xml editors).

    It is the core business of Xopus Company to fill that gap. We have developed a browser based XML editor specifically targeted at non-technical users.

    We think the semantics and rigid structure should be used to guide the author while editing. This makes the author the first to benefit from the semantics. With this key stakeholder on your side, you will be far more successful creating semantic content.

    Posted by: Laurens van den Oever | February 21, 2008 9:03 AM



  19. Microsoft's Windows Live Writer (I can hear the MS moans now...) was actually designed with some of these scenarios in mind. Writer's plugin system allow for the creation of editor components that generate content that is marked up with HTML *and* extended metadata. Microformats are one type of content Writer plugins support, but the goal of the API was to allow users/developers to easily extend the editor to support any other types of content markup that may emerge over time.

    Unfortunately, Writer's publishing capabilities are still only targeted at blogging platforms, so these plugins are only useful if you are publishing to blog-API enabled platforms (Metaweblog, ATOM, etc), which your hypothetical car-reviews example could easily be.


    [disclosure: I was part of the original Writer dev team]

    Posted by: Spike | February 21, 2008 9:14 AM



  20. Have added a little more detail here
    http://maheshcr.com/blog/2008/02/22/authoring-tools-and-semantics-possibilties-that-outrun-imagination/

    in addition to an earlier comment

    Posted by: Mahesh CR | February 21, 2008 11:11 AM



  21. We design expert predictive systems, so we've created our own 'semantic' understanding of web data. This is common for those who have funds.

    Bringing detailed tags to web data might benefit the 'wisdom of the crowd', but be aware, some folks remain in power by gaming that 'collective wisdom'.

    We agree with Martin's comment above, "...afraid of a web that is built based on this assumption..."

    (R)Evolution thrives within chaos...

    Sylvi & Michael :)
    Moab

    Posted by: tweetip | February 21, 2008 12:29 PM



  22. Did anyone else notice that this guy doesn't seem to realize that arbitrary XML dialects and RDF are very different things?

    An arbitrary XML dialect is not better than HTML for machine use, because specific logic needs to be coded to understand the meaning (semantics) behind each tag.

    Posted by: Nick Lothian | February 21, 2008 2:12 PM



  23. to #2 - bravo
    to #22 - read #2

    to RWW:
    - this article is a big flop.
    - even bigger flop would be not to give your daily best comment award to #2. That would be a nice way to admit mistake you made posting this.

    Posted by: Gleb Tulukin | February 21, 2008 3:29 PM



  24. Wow, this post generated a lot of great comments!

    First, the naysayers:

    #2, 22, 23: Yes, you're right. The example confused things, by dragging XML into the picture - I was trying to keep things simple in order to make a point. If I'd turned the information into a set of RDF triples, that would have made things more obscure and difficult-to-understand in a post of this nature.

    This post is not about specific formats or technologies; the point is that there don't seem to be any good tools for publishers, that create semantic content (whether in XML or RDF) and yet are easy and natural for authors to use.

    Much as it pains us CS-majors, the reality is that a large amount of web content does NOT live in a database, or is not amenable to SQL queries if it does. If you're creating 1000 articles, or 10000, then you have to let the authors manage their content themselves; expert authors are no more likely to be familiar with data-entry tools for the database than they are with tools that generate XML (or RDF).

    [Oh, and I welcome divergent viewpoints; but I hope you can actually sign your name to your comments, as Nick has done above - it's the only way to have a conversation.]

    ---------

    For everyone else:

    I will take a look at Quarkshop, Calais (which sounds really interesting!), Xopus and Live Writer. Thank you all for the links!

    Aziz (#14) above has got the right idea - if the tool could pre-fill in semantic data seamlessly without causing additional work for authors, then that would be perfect!

    Brent (#13), I have seen a real-world solution that works exactly as you described (the first one); but it's neither easy nor elegant to use!

    Lars (#3) - fascinating! And great comment. The knowledge chunks idea sounds very interesting.

    -----------

    The point about proliferating formats for similar data types (e.g. multiple formats for "car review") across different sites, is valid. From the publisher's point-of-view, though, there is a lot of value in being able to semantically manipulate information across a large number of records within their own domain.

    Posted by: NitinK Author Profile Page | February 21, 2008 8:38 PM



  25. To add to my earlier comment:

    It's too bad that the term Semantic Web has gotten so closely associated with a specific set of technologies: RDF, OWL, Ontologies et al.

    The concept of attaching semantic knowledge to content (and equally, of extracting it) is much more general; there may well be other solutions in the future that do the job just as well or better.

    Posted by: NitinK Author Profile Page | February 21, 2008 8:49 PM



  26. I loved this post and comments above, so I tried to make a french translation of it. It's there on Innovablog : http://innovablog.com/analyse/web-semantique-outils-creation-contenu-riche-de-sens/

    Thank's for this work !

    Posted by: Olivier | February 25, 2008 12:53 AM



  27. very good points,

    I am just in the middle of a presentation about the semantic web as it applies to the presentation of persistent identifiers, rich metadata and XML feeds for our content.
    The content are literaric texts, and we choose the DOI system for unique ID+meta, and I solved the problem of how to Author this metadata in two ways:
    - JAXFRONT is a Java front-end form generator, which displays input form fields from an XML Schema automatically. very nice and convenient
    - Altova AUTHENTIC is a free XML forms front-end client, the only problem is that you have to purchase Altova STYLEVISION to actually create the forms, and the process is a bit tedious, but it works.

    The Altova Authentic Forms are VERY usable for non-experienced users, and create valid XML. The JAXFRONT Java elements have to be integrated i your own Java application to be practically usable, and I have not explored this method fully yet.

    DOI System
    JAXFRONT
    AUTHENTIC

    Posted by: Karl Petermichl | February 29, 2008 10:49 AM



RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS