metadata - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/metadata en Copyright 2009 Richard MacManus readwriteweb@gmail.com Sat, 21 Nov 2009 05:00:00 -0800 http://www.sixapart.com/movabletype/?v=4.23-en http://blogs.law.harvard.edu/tech/rss Common Tag Brings Standards to Metadata Let's suppose you uploaded some pictures of a trip to New York City to an online account. Do you tag them "New York City," "NYC," "newyork," or all of the above? How do you know your content will be correctly identified and related to other content on the web? And if you come across the tag "Tesla," how do you know whether it refers to the scientist, the car company, or the band?

Common Tag is a new tagging format that creates references to concretely defined concepts with their own metadata and URLs. With Common Tag, site owners can simply topic hubs, cross-promote content, and enrich pages with data, images, and widgets.

]]>Sponsor

]]> Currently, companies involved include AdaptiveBlue, DERI (NUI Galway), Faviki, Freebase, Yahoo!, Zemanta, and Zigtag.

According to the Common Tag website, "The Common Tag format was developed to address the current shortcomings of tagging and help everyone - including end users, publishers, and developers - get more out of Web content. With Common Tag, content is tagged with unique, well-defined concepts - everything about New York City is tagged with one concept for New York City and everything about jaguar the animal is tagged with one concept for jaguar the animal. Common Tag also provides access to useful metadata that defines each concept and describes how the concepts relate to one another. For example, metadata for the Barack Obama Common Tag indicates that he's the President of the United States and that he's married to Michelle Obama."

The project aims to help make content as discoverable and connected as could reasonably be assumed. The creators also hope to make content more engaging. When a web app can determine what a piece of content is actually about, the UX improves exponentially. The website gives the example of a developer creating an app that uses an article about the most recent Star Trek movie and lets users purchase tickets on the same page. The site reads, "Since both the publisher and ticket service use Common Tag, the application is able to easily make the connection without having to guess at what the content of the two services is about."

Tags are expressed using RDFa, a standard format for defining data in HTML. Relevant code can be found in the Common Tag Quick Start Guide. Interested parties can learn more in the Yahoo! Common Tag group.

According to a Q&A with partner company Zemanta, CTO, Andraz Tori, said the idea for Common Tags "started in informal discussion with Peter Mika from Yahoo! and research about what would be the easiest way to let publishers get more out of their content by semantically marking it up." We've seen Common Tag as a vehicle to make Web content more discoverable, connected, and engaging.

"We then learned on previous efforts and decided that we need a full-blown ecosystem from day one. Not just academic support, but web industry support. As you can see the idea was well received."

In terms of adoption, Tori stated, "This is the first time that this number of web companies have stepped together from day one to introduce a tagging standard. We tried to build on previous academic efforts. Over that we added business incentive to participate."

]]>Discuss]]>
http://www.readwriteweb.com/archives/common_tag_brings_standards_to_metadata.php http://www.readwriteweb.com/archives/common_tag_brings_standards_to_metadata.php Semantic Web Wed, 10 Jun 2009 18:10:00 -0800 Jolie O'Dell
Researchers Create YouTube Archiving Tool A new project called ContextMiner has been created by researchers at the University of North Carolina at Chapel Hill. The tool lets anyone automate the collection of links to online videos and blogs along with their extensive metadata. Although they're calling ContextMiner a YouTube archiving tool, it doesn't actually download the videos off the site...yet. Instead, it extracts the embed, and the provides that to you along with other details like the number of views and what sites are linking to the video.

]]>Sponsor

]]> The tool, a part of the university's NDIIPP VidArch project, is designed to be a framework that collects, analyzes, and presents contextual information along with the data it archives. To get started with ContextMiner, you create a scheduled, repeated collection activity called a "campaign." For each campaign, you can enter in details like description and scope, then customize how often the campaign should run (daily, weekly, monthly), among other things. If you want to collect "in-links" - the web sites on the internet linking to the video in question - that is also an option. In addition to scouring YouTube, you can configure ContextMiner to search through the web and blogs, too.

contextminer_diagram

Why Use ContextMiner?

Marketers will probably be interested in how this free tool is able to track views and links, but that's not really the purpose behind ContextMiner's creation. Instead, the tool is designed more for research than anything else. For example, one of the main reasons to use ConextMiner is its ability to document the cultural phenomena of viral videos.

Often, when a video goes viral, very few people are aware of where it came from, what the story is behind it, who created it and why. As time goes by, finding the original video creator and source is even harder as the video spreads across the internet. But now, thanks to ContextMiner, the history behind a video's creation is no longer a mystery.

Take for instance, Vote Different, the mashup of Hillary Clinton with the famous Apple 1984 Super Bowl ad and one of the most popular videos on YouTube. We've probably all seen this video at one point or another, but did you ever want to know who created it and why?

With ContextMiner, that information can easily be discovered. Because of its ability to pull the inbound links to a video, we can see that the original creator of the video, a user by the name of "ParkRidge47," is the subject of one of the inbound links to the video. A blog post on TechPresident titled "Who is ParkRidge47?" gives us a great history of this particular video's creation. You could also sort through the links provided to find the very first person to link to the video, which is often the creator themselves.

contextminer_ex

ContextMiner is still under development. In the future, the developers hope to offer tools and policies for exporting the videos, blog pages, and metadata. That's probably not an empty promise - there's already an an option to "download Flash video from YouTube" on the campaign creation form, it's just disabled right now. When that feature becomes available, we think it would be fine to then call ContextMiner a YouTube archiving tool. Since "Archive" implies making a backup copy, until then we think ContextMiner should really just be considered a research tool. Still, we have to say, it's a pretty good one.

]]>Discuss]]>
http://www.readwriteweb.com/archives/researchers_create_youtube_archiving_tool.php http://www.readwriteweb.com/archives/researchers_create_youtube_archiving_tool.php Products Tue, 16 Dec 2008 06:21:27 -0800 Sarah Perez
Making the Web Searchable: The Story of SearchMonkey Last week at the SemTech 2008 Conference that took place in San Jose, Yahoo! Researcher Peter Mika spoke in detail about the company's new SearchMonkey search platform initiative. Mika talked broadly about his work looking at metadata on the web, and how that led to the birth of SearchMonkey. This post is based on notes from that talk.

]]>Sponsor

]]> History of Web Page Annotations

The motivating question for Mika's presentation was: How can we make web search better by leveraging web annotation? There are many kinds of annotations, but Mika focused on simple data and lightweight semantics, and began by reviewing the history and evolution of annotations to explain how we got to where we are today.

One of the first methods of annotating HTML was Simple HTML Ontology Extensions (SHOE). This method allowed for the declaration of ontologies as well as relationships between the entities on HTML pages. The problem with it was that it introduced new tags that were not part of standard HTML and were not recognized by most browsers.

In 2003 Tantek Celik started work on Microformats - a way to embed light semantics using XHTML. Microformats are now driven by a community of developers, which evangelizes existing formats and is working on new ones. The major focus of this effort is to leverage standards, but Microformats are limited because they don't share common syntax. Every microformat looks different and there are no ontologies, and no schemas.

Things get particularly complicated when you start combining different Microformats, for example, when you describe that a person wrote a review at a particular event. In addition to this, Microformats have no concept of unique identity, and for this reason are largerly incompatible with other Semantic Web efforts. Yet, Microformats took off and have become somewhat widespread. So, the take away here is that simple things can quickly gain adoption.

Another way of providing metadata that emerged recently is tagging. As an example, Flickr uses tags for photos to enable its users to annotate and describe the content. The problem with tags is that there is no agreement on meaning, so the same tag on Flickr and del.icio.us can mean different things, and there's no way to be sure which tag means what. Tags are a much more personal way of annotating information; they are not objective.

In 2005, Ian Davis, CTO of Semantic Web infrastructure company Talis, proposed eRDF - a form of RDF that can be embedded into HTML (compatible with HTML4). There is a simple mapping from eRDF to RDF so you can use any RDF/OWL vocabulary. But eRDF is not full RDF -- it has limitations. For example, there are no data types and there no blank nodes. Also, each page can only "talk" about itself and not about other pages.

Finally, the W3C published RDFa the latest embedding of RDF in XHTML, which has full RDF support. RDFa adds complexity in terms of implementation, but at the moment, gives the best way to embed RDF into HTML.

How Much Metadata is Out There?

Given the increasing trend towards web annotations, the natural question is, Just how much metadata is already out there?. Peter Mika set out to answer this question and created a prototype, called Microsearch. The idea was to look at web pages and to see how much metadata was there. Beyond that, Mika was also interested in what type of metadata, as well as the ratio between annotated and plain HTML pages.

With the Microsearch exercise, Mika wanted to demonstrate what could be done to enhance search with this information. For each type of metadata, Mika augmented search results with additional links and information. For example, maps, events, information from hCard, etc. are presented in an enhanced way, unlike what we're used to seeing with today's search engines.

Mika discovered a few interesting things. First, about 53% of queries have 1 page with metadata in the top 10 results. However, lots of the data Mika saw was not clean and contained information that was not well formed, and performance was pretty poor due to lack of an index. So the unfortunate conclusion that Mika came to was that RDF templating was difficult and the approach was not easily scalable. Finally, Mika realized that metadata really needs to be on the page for users to see, because otherwise there is a big opportunity for semantic spam.

The Birth of SearchMonkey

The point of any experiment is to draw the right conclusions. Looking at the facts, Mika and the Yahoo! search team realized that they could not count on enhancing search by leveraging metadata on today's web - it simply does not exist to the extent needed. At the same time, it was clear that enhancing search results and cross linking them to other pieces of information on the web is compelling and potentially disruptive. Yahoo! realized that in order to make this work, they need to incentivize and enable publishers to control search result presentation. And thus, SearchMonkey was born.

SearchMonkey is a system that motivates publishers to use semantic annotations, and is based on existing semantic standards and industry standard vocabularies. It provides tools for developers to create compelling applications that enhance search results. The main focus of these applications is on the end user experience - enhanced results contain what Yahoo! calls an "infobar" - a set of overlays to present additional information. For example, with SearchMonkey, LinkedIn is able to surface additional information from the user profile, Netflix can present a blurb a about plot and a rating for a movie, and Barnes & Nobles can embed a preview of a book.

SearchMonkey's aim is to make information presentation more intelligent when it comes to search results by enabling the people who know each result best - the publishers - to define what should be presented and how.

A Better Search Experience Ahead

This first version of Search Monkey is just the first small step towards creating a better search experience. Much more is planned, but even with this first simple version, we can clearly see the power of semantics and annotations in web pages. By creating the right incentive for publishers and putting them in control, Yahoo! is aiming to up the bar on search results, and, who knows, maybe even start attracting converts from Google's plain-looking results.

]]>Discuss]]>
http://www.readwriteweb.com/archives/semtech_making_the_web_searchable_searchmonkey.php http://www.readwriteweb.com/archives/semtech_making_the_web_searchable_searchmonkey.php Semantic Web Tue, 27 May 2008 20:29:34 -0800 Alex Iskold
How YOU Can Make the Web More Structured We have written a lot here about the the vision of building a structured layer on top of the current web. Annotating billions of HTML documents in a bottom-up way or building top-down tools that can automagically interpret the existing information are the two approaches that we discussed. Together these approaches would result in a global database which will make the web even more connected. The ability to correlate content and concepts accross web sites would reduce the time necessary for searching and would enable the discovery of related information.

]]>Sponsor

]]> In previous posts we discussed the difficulties with the bottom-up approach to the Semantic Web - a sophisticated form of annotating information using tools like RDF and OWL. Among the factors that impair the web wide adoption of these tools is complexity and the lack of clear end user benefits.

On the other hand, the top-down approach that we discussed does not place any burden on content owners and delivers instant benefits to end users. Yet, the top-down tools run into a difficulty - interpreting raw information is not that simple. Typical solutions focus on a vertical, but still suffer from imperfections.

What if there was some minimal annotation in the content to help top-down tools interpret it? In this post we look at how content owners can implement simple annotation strategies which can help the top-down tools and search engines to make the web more structured.

Annotation Basics - Headers

It is striking how many sites today do not use meta tags in the head of the document to provide the bare minimum information about a page's content. Forget building a smarter web, this is just plain bad SEO practice. The work that is being put into generating great content can be offset by lack of a succinct, meaningful description of that content. Every page on the web should have the following information filled in:

  • title - a sentence briefly describing the site/page
  • description - a paragraph about the site/page
  • keywords - a list of keywords that describe the site/page

Note that it makes sense to provide different information for the root page and subsequent pages. For example, for a newspaper or a blog, the root page should provide information about the site at large, while individual article and post pages should contain information about that specific page, not the overall site.

The New York Times' web site provides a good example of how to properly use meta tags. For example, this article on Slowdown in US Growth includes the following meta data:

  • title - U.S. Growth Slowed Drastically in 4th Quarter
  • description - The economy expanded by a weak 0.6 percent in the latest indication of a substantial slowdown and perhaps a recession.
  • keywords - United States Economy,Gross Domestic Product

The New York Times is actually a great example of taking the basics of annotation and building on top of them. Each page includes an extended set of rich meta data including, the author of the article, the date it was published, thumbnail image URL, creator, category and even ticker symbols for public companies that are mentioned in the article. Certainly, the New York Times provides a really great set of information, perhaps even wider than needed for most content, but lets focus on the ones that should be used on a wider scale.

author: Web content is produced by people and for people. With the rise of social culture we are increasingly interested in finding bits of everyone's identity around the web. If something piqued your interest enough for you to blog or to write an article, at least you can put your name on it. Having people attached to content would allow seamless navigation from one to another. There is already a standard meta tag for this, with a suggestive name: author.

thumbnail: We love pictures. Since the launch of Flickr we can't live without them. Facebook's success owes a lot to photo sharing. With bandwidth becoming cheap, we are increasingly become more visual. We do not want text we want pictures, so if a news article or blog post contains an image, it is simple to do what the Times did - generate a meta tag for it. There is no standard meta as far as I know, but any of these would do: thumb, image, picture, thumbnail, etc.

date: As we are becoming a real-time culture the freshness of content becomes paramount. Tagging the page with date is important way of helping classify the page in time. Most blog posts and articles contain dates anyways, and having a standard date header would make it simple and obvious.

location: Location is becoming increasingly more important as well. With GPS and widely available Internet access we are able to easily let people know where we are and are able to take advantage of local services. If the article or a post is related to a specific location there is a conventional way of annotating it. The technical term for annotating content with location information is Geotagging. It generally means placing a pair of latitude and longtitude coordinates. A more relaxed form would be specifying country/region/city and is described in detail by the Geo microformat specification. While specifying exact position coordinates may be difficult, even something as simple as the geo header New York, NY would be very helpful.

Tags in Blog Posts

The concept of tagging, which was popularized by services like del.icio.us and Flickr, is now commonly understood and is ubiquitous. The idea of humans tagging content to categorize it and later to find it is a simple, yet important bit of the web infrastructure. Most major blogging platforms support tags. The tags are standardized based on the rel-tag microformat. You can see the implementation on ReadWriteWeb - each post is tagged with a set of tags.

For example, one of our recent posts contains this tag:
<a href="http://www.readwriteweb.com/tag/twitter" rel="tag">twitter</a>
The tag has several benefits:

  • Readers can instantly click to find other posts with this tag
  • Search engines can better classify the content
  • Semantic tools can offer additional services such as finding related content, pictures, and video

Tags are similar in principle to keywords, but provide more flexibility because they are inside the post and can have richer content. In principle, it could be possible to add more information into the keywords meta tag in the head of the document but it has existed in its current form for several decades and is thus probably not likely to change. In any case, all modern blogging platforms make it trivial to tag content, so there should be no excuses.

Standardizing Blog Templates Across Platforms

In the nineties people created web sites. These days only companies have web sites, individuals have blogs and social network profiles. There is a great opportunity to standardize and structure the information because blogs and profiles are based on templates. Consider a common structure for each blog. One or a few sidebars and the central area for the content. In the content area, on a post page there is a post body, date, author and tags - a minimum set of elements.

Why not standardize on a few things here?

  • <div class="post"> - a container for the post body
  • <div class="sidebar"> - a container for the sidebar
  • <div class="author"> - a container for the author
  • <div class="date"> - a container for the date
  • <div class="tags"> - a container for the tags
  • <div class="comments"> - a container for the comments

Platforms already do have very similar things in place and standardizing between them is rather simple. In no way would this be a competitive advantage or disadvantage to them, but it would be a big help towards making the web more structured. Extending on these basics, it would also be helpful if widgets were wrapped into standard enclosures. A simple widget tag can go a long way toward distinguishing widgets from the other content in the sidebar.

If blogging platforms standardized on these basic conventions, likely major newspapers would follow as well.

The situation with social network profiles is different, as the information contained in them is not public. In addition, there is a competitive advantage to Facebook in having its own proprietary structure. However, entities like the DataPortability group have been created precisely to deal with this problem and Facebook just joined. So we may yet seem some progress on that front.

Beyond Basics - Microformats

The annotations that we discussed up to now are very basic and would a require minimum amount of work from newspapers, bloggers, and blogging platforms to deploy. The advantage of them is that they are simple to implement but would deliver big bang for the buck. Yet, these are primitive ways to annotate content. The next step is to use bottom-up technologies like microformats, which offers a way to embed objects into HTML documents in a compact way.

Microformats have been around for a few years and have certainly caught the attention of some. Several major services are using microformats. For example, Flickr is using the geo microformat and headers to geotag photos. Eventful uses the hCal format to describe meta data for each event. Blogger pages contain hCards for each blogger. But the problem is that there needs to be more and better integration of microformats into the blogging platforms. For example, coming back to the Blogger hCard, right now, most of them are not useful because they do not require people to fill in information and just generate the card based on the login. This is more harmful than good as semantic tools can not take advatange of such cards and they do not look good to people either.

Similarly, there is not much support for geotagging photos and event microformats in the platforms. But even beyond the lack of support, the limitation of the current microformat specs is that they do not cover the basic range of things that people discuss on the web - books, music, movies, recipes, and restaurants are all noticibly absent (the existing hReview microformat does not have a way to express the type of the object or the attributes).

But it does look like with a bit of a push on both the community behind the microformat specs and blogging platforms we could see microformats becoming a major way of annotating information inside blog posts. This would be a welcomed development and would allow a large subset of the web - the blogosphere - to become quite structured.

Conclusion

The vision of the structured web is big and compelling and at the same time is hard to attain. At times, it is difficult to see how we can ever get there. But on some days we think that even if the web could be just a tiny bit more structured it would become so much more connected. And so in this post we considered a set of very basic bottom-up techniques that newspapers, bloggers, and blogging platforms can put in place to make the web more structured.

Putting meta information into page headers is easy and should be a must-do thing for everyone. Beyond that, providing information such as author, date, and location makes data that much more valuable. And if blogging platforms could also standardize on the key elements of the pages, crawlers and intelligent browsing tools could do a better job making sense of the content. Beyond that, microformats are the front runner in annotating the web with meta information about things, but they still need more pushing and effort.

What do you think about these basic structures? Are you going to fix up your blog after reading this post? What other things should we push to standardize on?

]]>Discuss]]>
http://www.readwriteweb.com/archives/structured_web_microformats_tagging_meta_data.php http://www.readwriteweb.com/archives/structured_web_microformats_tagging_meta_data.php Trends Wed, 30 Jan 2008 22:48:55 -0800 Alex Iskold