markup - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/markup en Copyright 2012 Richard MacManus readwriteweb@gmail.com Mon, 13 Feb 2012 16:00:00 -0800 http://www.sixapart.com/movabletype/?v=4.35-en http://blogs.law.harvard.edu/tech/rss Does Facebook Really Want a Semantic Web? fb_open_graph.pngTwo weeks ago, Facebook has announced a major new initiative called Facebook Open Graph. This is an attempt to not only re-imagine Facebook, but in a lot of ways, an attempt to re-define how the Web works. We wrote in details about the implications of this move for all interested parties.

A big part of the announcement is Facebook's vision of a consumer Semantic Web. In this new world, publishers have an incentive to annotate pages by marking up activities, events, people, movies, books, music and more. The proper markup, would in turn, lead to a much more interconnected Web - people would be connected with each other across websites and around the things they are interested in.

]]> Directionally, this vision is both correct and important. We've been talking about pragmatic approach to the Semantic Web for sometime, and we're excited at the possibility of it finally happening. Yet, two weeks after the announcement it is becoming more and more apparent that there are gaps in Facebook's offering and intentions. A close look reveals that perhaps Facebook's intent is not to make the Web more structured, but instead to engineer a way for more data - mostly unstructured - to flow into Facebook databases.

"Instead, it appears that semantics is an afterthought in the race to capture user identity and information, in exchange for sending publishers the traffic."

As you will see from the rest of the post, it appears that getting semantics right has not been a big priority for Facebook, at least not prior to the announcement. Here are the issues we identify:

  1. Open Graph Protocol does not support object disambiguation
  2. Open Graph Protocol does not support multiple objects on the page
  3. Launch partners have not implemented Open Graph Protocol correctly on their sites
  4. Facebook does not have the markup on its own pages that it asks the world to adopt
  5. A growing amount of user profile data is full of duplicates and ambiguity

Concerns with Open Graph Protocol

A week ago, we complimented Open Graph Protocol for its simplicity, but upon closer look we are seeing a couple of flaws. First of all, there is no way to disambiguate objects. For example, two movies that have the same name would be considered to be the same movie. A proper way to deal with this sort of thing is to introduce secondary attributes like director or a year that can help identify specific object, but the protocol does not define secondary attributes.

The second issue is that there is no way to markup the objects inside the page. In its current version, the protocol only supports declaring that entire page is about a person, a news event, a musician or a movie, but there is no way to identify objects inside the page. This is a big use case for bloggers and review sites - each blog post typically mentions many entities, and it would be nice to support this use case from the start.

Both of these shortcomings are easy to correct. The nice thing is that the protocol is simple and minimalistic, so adding the bits to handle disambiguation and multiple entities is straightforward. The other things that we are going to discuss, are much more troublesome

Launch Partners - Why No Markup?

The truism of making the Web more structured is adding more markup. No matter how limited, having markup on the pages is always better than not having it. When Facebook announced the Open Graph Protocol, it highlighted several sites that are already using it. Among them were Yelp, IMDB and Pandora. We took a look to see how exactly these sites are marking up their pages. What we found is rather surprising - none of these sites implemented markup correctly. We looked at the How to Train Your Dragon movie on IMDB, Brad Pitt's page on IMDB, the Muse page on Pandora and the Acquagrill page on Yelp.

This is what Facebook defines as required properties:

fb_protocol_ai.png

And this is what we found on the actual partner pages:

fb_partners_ai.png

So what does this mean? It means that Facebook implemented special handling for these sites. When a user likes a movie on IMDB or when she likes a movie star, Facebook can't really tell the difference since IMDB is not passing correct information via the protocol. The only reason it works is because IMDB is explicitly hard coded by Facebook.

The roll-out for launch was not generic, but custom, targeted more towards PR than correctness. Why would Facebook allow this instead of having partners implement correct markup is unclear. It is so easy to implement, and the partner pages already have all the necessary information. We conclude that enforcing correctness was simply not a priority for the launch.

Eating Your Own Dog Food

As it turns out, not only did publishers not markup pages - neither did Facebook. At the time of writing of this post, none of the entity pages on Facebook.com have Open Graph markup. So much for being open - Facebook's own pages remain closed. Ironically, it might not be because the company does not want to markup the pages, but instead that it can't. At least not yet.

Figuring out what is on the page is actually not a trivial problem. This is what semantic technologies that Freebase, Powerset, Open Calais, Evri, Zemanta and GetGlue, among others, have been building over the past several years. To be able to markup the pages correctly, especially the ones created by the users, Facebook needs to run them through a sort of semantic processing and disambiguation. This isn't a trivial matter.

Unstructured data on user profiles

fb_movies_ai.png

All of this comes full circle to impact the users. As the Like buttons spread through the Web, so is the unstructured, duplicated data spreading through user profiles. Absence of semantics creates fragmented connections and noise around the Web.

Below is the listing of movies that I liked and fetched via Facebook Open Graph API. How to Train Your Dragon shows up twice, because I liked it once on IMDB and then also on Fandango. Friends that are see on Fandango page are different from the ones I see on IMDB. And worst of all, all this uncleaned data is showing up on my profile - movie title contains a year in one case and the originating site in the other case.

So right now Facebook does not correlate things across sites. Instead, it just captures the information as is, hoping to maybe clean it up later.

Conclusion: A different goal?

All of these facts when added together lead to the obvious conclusion: Facebook's goal is not to create a better, more structured Web. Instead, it appears that semantics is an afterthought in the race to capture user identity and information, in exchange for sending publishers the traffic.

As more and more data flows into Facebook via the Like buttons, Facebook and publishers are getting the benefit of recycling friends through the content on sites around the Web. But at the same time, the data in user profiles is becoming more and more noisy. Since not as many users are paying attention yet, it just looks silly under a closer inspection.

But to be able to power recommendations, to make social plugins a success and to facilitate good user experience, Facebook will literally need clean up its act. Duplicate and dirty data will be a big turn off for the users, and the longer this problem goes on the more difficult it is going to be for Facebook to deal with it.

We will see in coming weeks and months how the social networking giant will handle this issue. In the mean time, we'd like to reverse the tables. Please tell us what do you think about Facebook's semantic Web ambitions. Should they have gotten the core bits right first before the launch, or is this fine and they will be able to quickly catch up?

Disclaimer: Alex Iskold is a founder and CEO of GetGlue.com, a social network for entertainment. GetGlue developed the ability to connect users across different sites through a combination of browser addons and semantic databases in the cloud.

]]> Discuss]]>
http://www.readwriteweb.com/archives/does_facebook_really_want_a_semantic_web.php http://www.readwriteweb.com/archives/does_facebook_really_want_a_semantic_web.php Semantic Web Thu, 06 May 2010 14:07:00 -0800 Alex Iskold
Why Google and Other Humans Don't Read Your Book Reviews bookreview_tag_0210.jpgThe book and media industries are going through interesting times, to put it mildly. As physical books prepare for their demise, the confusion around pricing of digital ones grows. Yet, whether physical or digital, to sell books you need marketing. People need to hear about a book before they buy it.

This is where the book review come in. Every publicist and publisher's dream is to land a positive review with an authoritative source. A good review in The New York Times or the L.A. Times used to be a pass to big figure sales. Sounds like it still should be, but it is not, because most book reviews are poorly formatted and cannot be recognized by Google and other software.

]]> The Book Review That Nobody Saw

Lets take a look at this edgy review of the Manhood, by the L.A. Times. It is a pure joy to read - it is elegant, clever and gets to the heart of the issue. There is only one problem with it - nobody is going to read it, because Google can't find it.

bookreview_tag5.png

Try running this very specific Google search - "Manhood" by Mels van Driel review - and you will not find the L.A. Times among the results - at least not within first three pages that humans would care to flip through. How come might you ask? Well the answer is simple - there is nothing whatsoever that tells Google that this post is a book review about this particular book.

And this is not just an isolated problem with this book review from this particular newspaper. The issue is widespread across all major U.S. and international media outlets. Either due to lack of tools or lack of understanding how search engines and other software works, people notoriously don't make their content discoverable.

A Simple Way to Please Google

So how should be the book reviews tagged?

To start with, the title needs to make it clear, that this is a book review. Of course humans may find a more subtle title more enticing, but for the sake of machine: Book Review: Manhood by Mels has to be present. It would be even better to mark up that this is a book review, and here is the book title and here is the author.

Next, the post needs to be adorned with the right tags and keywords. L.A. Times' reviews are certainly very clever, but again, Google does not get humor. A better tag would the title of the book, the name of the author and the non-conspicuous phrase "book review".

A Better Way to Please Google and Tim Berners-Lee

The tagging system described above is still error prone. A computer might not interpret it correctly and would miss this post in the search results. This is because that kind of description is not structured. Humans enjoy a wonderful ability to deal with fuzzy things; computers simply can't do it.

bookreview_tag2.png

For a computer to understand content, it needs to be described using a markup language. This is a broad and complex topic that has been a focus of the so-called Semantic Web and structured data.

The right way of marking up content so that it can be understood by Google, other search engines and semantic technologies is by using a structured format such as ePub, hReview Microformat, abmeta or one of the other structured formats. Using a structured format removes the ambiguity and enables computer to "know" what the review is about.

Making the content discoverable by Google in turn makes it discoverable by humans.

Tagging: It's All About the Money

Could it just be that book reviewers in major newspapers would get more page views if they did a better job tagging content? And then in turn, could it also be that if more people discovered clever and elegant reviews then more books would be sold? Even if you don't think so, there is way too much risk of getting this one wrong.

Doing appropriate, standard tagging and markup of book reviews is cheap and simple and should be part of the daily publishing routine. Each media company needs to invest in standards and guidelines around content markup. This is not just a matter of being good citizen of the Web, it is a matter of making money.

Photo credit: Ivan Petrov]]> Discuss]]> http://www.readwriteweb.com/archives/why_google_and_other_humans_dont_read_your_book_reviews.php http://www.readwriteweb.com/archives/why_google_and_other_humans_dont_read_your_book_reviews.php Semantic Web Sun, 14 Feb 2010 18:00:00 -0800 Alex Iskold XBRL: Mashing up Financial Statements Amid the dark days on Wall Street and in global markets, it seems to be up to technology to step up and deliver solid analysis and rational scrutiny. The US market regulator, the Securities and Exchange Commission (SEC), ratified a proposal on Wednesday for public companies and mutual fund companies to file their financial statements in XBRL (eXtensible Business Reporting Language). The XML-based language is also known as "Interactive Data" in financial circles and promises faster analysis with wider coverage. All things being equal, it will mitigate the poor analysis and regulation that's been contributing to stupendously bad financial decisions.

]]> This is a guest post by Derek Abdinor.

Companies have traditionally filed on paper, in ASCII, or in HTML: all essentially lifeless formats for conducting any meaningful comparison or analysis. With XBRL, every line item is given a tag that identifies it and its role in the financial statement. Imagine that a line item -- say, "Net Income" -- is tagged like a migrating goose (which frequently happens, I'm told). That goose is part of a flight of geese, which may change their course mid-flight, fly over national borders, have babies, and even join another flight. But thanks to the vital information on the goose's tag, we never lose the original information, and we are even able to see it in the context of other information.

Financial accounts are the same. Figures get re-purposed all over the place, which leads to input errors, or worse. It's easy to cover up information or fail to notice business risks when the analysis is relegated to a footnote somewhere, and you're reading the annual report like a "Choose Your Own Adventure" book.

When financial data is tagged, it's begging to get mashed up. Take a look at this comparison of executive pay, this dynamic charting, and the SEC's own repository and viewer. Software exists that can be first used upstream with the creation of management accounts and go all the way through to taxonomy design, document tagging, and viewing. One would be able to call up income statements of two or more companies in different sectors and different countries and compare line items in seconds.

But to see XBRL simply as a means of marking up financial statements at the end of a financial reporting period is to miss the rest of the iceberg. If financial items are automatically tagged upon their creation using a system like SAP, the rich analysis can be filtered through the enterprise and to suppliers. Triggers and reports can be generated on the fly. Knowledge workers will be manipulating XBRL without knowing it by its accurate, albeit consonant-heavy, name.

XBRL, by its nature, has largely escaped the wave of Enterprise 2.0 functionality. But the openness of the data, its ability to be mashed up and displayed in previously unthought-of ways, will impress itself upon a public disillusioned with poor financial management -- management that has itself partly relied on poor data. It's time for some developer rock stars to step in and make those spreadsheets sing.

More about XBRL

Often described as being simply complex, XBRL should be approached from a technological, as well as an accounting, perspective. XBRL is simply a flavor of XML. Financial line items, totals, text, and metadata are XML elements that are mapped to a predefined schema (called a "taxonomy" in XBRL). In all cases, these taxonomies are the financial rules of accounting for that jurisdiction. Throw in XPath, XLink, and more, and you have a mature language for tagging and submitting your financials.

An introductory resource to begin with is Wikipedia, which links to the various regulatory bodies, IT initiatives, and current issues. The "XBRL in Plain English" video is specific to executive summaries.

This was a guest post by Derek Abdinor, a divisional director at motiv - the Investor and Branding agency of Ince, a large communications concern from South Africa.

]]> Discuss]]>
http://www.readwriteweb.com/archives/xbrl_mashing_up_financial_statements.php http://www.readwriteweb.com/archives/xbrl_mashing_up_financial_statements.php Mon, 22 Dec 2008 19:00:00 -0800 Guest Author