Dapper - ReadWriteWeb http://www.readwriteweb.com/feeds/tag/Dapper en Copyright 2012 Richard MacManus readwriteweb@gmail.com Tue, 14 Feb 2012 12:30:00 -0800 http://www.sixapart.com/movabletype/?v=4.35-en http://blogs.law.harvard.edu/tech/rss Delicious's Data Policy is Like Setting a Museum on Fire Delicious LogoYahoo! is going to shutter its social bookmarking service Delicious, the web learned today, and with it will sink an incredibly valuable source of collectively curated knowledge. You can easily export your own bookmarks (no verdict yet where we should all meet up to import them to) but what if you want to export other peoples'? That's at least half the value of the service, socially curated discovery.

Tonight I thought I'd go loot a little from a burning building owned by a company not interested in putting out the fire. Specifically, I went to extract the top 50 links to pages that had been tagged by users with both the words "Twitter" and "International". Where else are you going to find a reading list of the best collected written works and other multimedia about almost any given topic? Unfortunately, automated extraction is blocked by the site and the rickety, antiquated API appears focused on returning you little more than your own bookmarks. If there's a clear way to accomplish export of not just my bookmarks, but all bookmarks with one or more tags, from all users - I haven't been able to find it yet.

Update: 24 hours later, Yahoo! has issued a statement saying they would like to sell, not close, Delicious.

]]> gabetweet.jpg
Above: Gabe Rivera, founder of Techmeme, puts Delicious in historical context.

Doing it DIY Style

Using data extraction tools like Needlebase (read all about it) or Yahoo's own Dapper (our coverage), I could have extracted every link ever tagged Twitter and International on Delicious in about 120 seconds. Then I could have pulled down all the links to articles that Delicious users have categorized as worth saving and pertaining to Portland and History. Then maybe Mapping and Gender. Then dogs and jokes or beans and flatulence.

Or whatever. But can I? Nope, Yahoo! blocks all automated extraction of data from Delicious. The company apparently is going to let this unique cross between a museum, a library and a crazy old collector's attic burn to the ground. I'd like to take a few things with me before that happens, please.

In the screenshot above, for example, I'm trying to use Needlebase to go through every page of links bookmarked both Twitter and International, pull down the titles, links, tags and number of times each link has been bookmarked. Then I would export in CSV, import it into Google Docs, sort by the number column and archive like the top 50 articles about International use of Twitter. I could do that in about 2 minutes, if Delicious would allow it. And then I'd move on to another topic and do the same thing. But can I? No! Right now at least, the site prohibits automated extraction of its data. I sure hope Yahoo! changes its mind. Otherwise, there goes an incredible body of curated knowledge. What a sick tragedy.

Where else are you going to find a reading list of the best collected written works and other multimedia about almost any given topic?

How has RWW used Delicious? See our write-up: R.I.P. Delicious: You Were So Beautiful to Me
One community of non-profit technologists has been bookmarking links with the tag "NPTech" for years - they have 24,028 links categorized as relevant for organizations seeking to change the world and peoples' lives using technology. Wouldn't it be good to have that body of data, metadata and curated resources available elsewhere once Delicious is gone?

What someone probably ought to do, as Karl Long said to me on Twitter today, is scrape all the public bookmarks and data and put it on Bittorrent. That would be against the rules, though.

Please, please Yahoo! let us save some of what you've got, before it goes to waste.

]]> Discuss]]>
http://www.readwriteweb.com/archives/deliciouss_data_policy_is_like_setting_a_museum_on.php http://www.readwriteweb.com/archives/deliciouss_data_policy_is_like_setting_a_museum_on.php Analysis Thu, 16 Dec 2010 19:02:58 -0800 Marshall Kirkpatrick
Awesome: DIY Data Tool Needlebase Now Available to Everyone If you've been within shouting distance of me over the last month, you've probably heard me singing the praises of Needlebase, a great new point-and-click tool for extracting, sorting and visualizing data from across pages around the web. I've been using it for all kinds of things and now you can too.

When we first reviewed Needle here on ReadWriteWeb, it was in closed beta and new users had to request an account. Now it's open and available for all: free for personal use or by subscription for commercial use. Check out some examples of ways I've used this exciting new technology below.

]]> Needlebase allows you to view web pages through a virtual browser, point and click to train it in understanding what fields on that page are of interest to you and how those fields relate to each other. Then the program goes and scrapes the data from all of those fields, publishes them into a table, list or map, and recommends merges of cells that appear to be mistakenly separate. It's very cool and it lets non-technical people do things with data quickly and easily that we used to require the assistance of someone more technical to do.

For example, I've already used Needle to do the following. But first, the official Needle demo video...

And here are a few ways I've used Needlebase so far.

Investigative journalism

Last month a local newspaper reported that a big new data center had opened in Salt Lake City with a mystery anchor client. The paper believed the client was Twitter, as the company has said it was going to open its first off-site data center in Utah at an undisclosed date.

We used Needlebase to look at all the tweets from people on the Twitter list of Twitter staff members and extract the username, message body and location, if exposed. Needlebase scraped the last 1500 Tweets in less than 5 minutes. We displayed them on a map and saw that there was just one Tweet published in that time from Utah: a Twitter Site Operations Technician who had just left San Francisco to move to Salt Lake City, complaining about Qwest router problems. That wasn't quite confirmation, but it sure felt like a valuable clue and was very easy to come by thanks to Needlebase.

Data Re-Sorting

Last night I found a solution to a long-running issue I've been struggling with. I've got this list of 300 blogs around the web that cover geotechnology (that's a whole other story) and have them all run through Postrank. That service ranks them in order of most to least social media and reader engagement per blog post.

Wouldn't it be great to extract that data over time, to track it and to turn it into blog posts? I think it would. I couldn't figure out how to get all the data out that I wanted though.

Enter Needlebase. Last night I pointed Needle to my Postrank pages for geotech blogs and in minutes it pulled down all the data I wanted. I exported that data as a CSV, uploaded it to Google Docs as a spreadsheet, did a little subtraction and now have the following chart tracking the top 300 geotech blogs on the web. Now in my handy spreadsheet, I was able to set up a function to show me which blogs jumped or fell in the rankings the most over the previous week. Thanks, Needlebase!

Event Preparation

I've written here about how to use Mechanical Turk to get ready and rock an industry event. Needlebase can prove useful for that as well.

My wife Mikalina, for example, has used Needle to extract the session titles, speakers, topic tags and more information about all of the SXSW Interactive sessions that have been announced so far. The sky's the limit on what could be done using that.

needlesxsw2.jpg

Other Uses

There are all kinds of other ways that a tool like this can be used. There is a learning curve, but it's nothing compared to what it would take to learn to do this kind of work programmatically. When we first reviewed Needlebase, beta invites had to be requested by email. We got emails, which were then forwarded to the company, from a wide variety of people. A Japanese potter, a local yarn store owner, a Geocacher who wanted to organize his online geocaching information and enterprise mobile app developers, for example. We got an email from a publisher who wanted to scrape their website for place names and see what parts of the world they cover the most and least.

Needlebase was built as a side project of travel search company ITA Software. Google is currently in legal negotiations to acquire ITA (the US government isn't sure it wants Google to own travel search too).

What will Google do with Needlebase if it gets its hands on it? I'm much more interested in hearing what you are going to do with it, now that anyone can use it.

The DIY Data Hackers Toolkit

I put Needle in my mind in between two other wonderful tools. On one end of the spectrum is the now Yahoo-acquired Dapper, which anyone can use to build an RSS feed from changes made to any field on any web page. (See: The Glory and Bliss of Screen Scraping and How Yahoo's Latest Acquisition Stole and Broke My Heart)

One the other end of the spectrum is the brand-new Extractiv, a bulk web-crawling and semantic analysis tool that's also remarkably easy to use. Earlier this month I used Extractiv to search across 300 top geotech blogs for all instances of the word "ESRI," all entities mentioned in relation to ESRI and the words used to describe those relations. The service processed 125,000 pages and spit out my results in less than an hour for less than a dollar. That's incredible - it's a game changer.

Needlebase is too. It sits somewhere in between Dapper and Extractiv, I think. These tools are democratizing the ability to extract and work with data from across the web. They are to text processing what blogging was to text publishing.

I'll stop now so you can go and start learning to use Needlebase. Let me know what cool things you figure out how to use it for.

]]> Discuss]]>
http://www.readwriteweb.com/archives/awesome_diy_data_tool_needlebase_now_available_to.php http://www.readwriteweb.com/archives/awesome_diy_data_tool_needlebase_now_available_to.php Data Services Tue, 30 Nov 2010 13:24:48 -0800 Marshall Kirkpatrick
How Yahoo's Latest Acquisition Stole & Broke My Heart "What do you think about Dapper?" That was the question it felt like everyone asked me for weeks after I wrote up a startup called Dapper.net on TechCrunch in the Summer of 2006. "Create an API for any website!" was the company's unofficial slogan. Almost no one understood exactly what could be done with this powerful point-and-click tool, but everyone I talked to knew it was exciting.

Last week the company was acquired by Yahoo and brief press coverage of the deal called Dapper simply a semantic advertising platform. It was so much more than that, especially for me. Dapper set my imagination on fire, it powered acts of community management magic and it helped me meet Neil Young in person. We spent many long nights together. Four years after I first wrote about it, I still bring Dapper up in conversation frequently - but for a while now it's been part of a story of heartbreak and caution.

]]> What Dapper Does

Here's how I described the core service when it launched, in August 2006:

Here's how it works. Users identify a web site they are interested in extracting data from and view it through the Dapper virtual browser. [Co-founder Jon] Aizen showed my how to do it using Digg as an example. I clicked on a story headline, on the number of diggs and the URL field. I went to another page on the same site and did the same thing so that Dapper could clearly identify the fields I was interested in.

I then went through the various tools available on the site to set certain conditions and threshholds and ended up with XML feeds I could do all kinds of things with. Like send me an email whenever there's a TechCrunch story on the front page of digg, or when a search results page shows a TechCrunch story with more than 10 diggs. After I create an end product through the site, other users will be able (after a 24 hour period in which I can edit the project) to use my project either as is, altered to fit their needs or in the future, in combination with other projects.

Below, a 4 minute video demonstrating Dapper that I recorded on New Year's Eve 2007, after Wired Magazine wrote a post slamming web scraping. I had a sore throat, it was a holiday (on the next New Years I eloped to my living room and got married) but it was important that scraping be defended - a screencast had to be made! It was important.

In February, 2008 the startup held an event called DapperCamp in San Francisco. It was sponsored by IBM and MindTouch, because those and other companies were exploring ways to move data around from static websites into dynamic processes using Dapper.

The event was fabulous. I was the least technical person there, but I flew down my young cousin on my Dad's side, a developer in training, for his first experience in the Bay area web geek scene. We had a great time and worked late into the night sitting in a little bar brainstorming ideas and scraping feeds from websites.

Our best idea was this: Yahoo's service MyBlogLog tracked users who navigated to any participating website and upon visiting a site for the 3rd time, a user appeared in a field labled "New Fans" on your site's MyBlogLog page. We used Dapper to scrape an RSS feed of the usernames of all the new people appearing as fans, people who had just made their 3rd visit to ReadWriteWeb, and we set up a workflow to email those people and welcome them to the community here. It was awesome.

We scraped a feed of the most bookmarked ReadWriteWeb pages in Delicious, a feed of RWW stories submitted to Digg and the number of Diggs they had. We monitored those feeds in a dashboard. We scraped feeds, .csv files, image slideshows and more. It was wonderful.

How Dapper Helped Me Meet Neil Young

I love Neil Young, I always have. In my early twenties I hitchhiked all around the country listening to a tape I'd recorded of Neil Young's Greatest Hits albums that I'd checked out from the public library, until the tape was worn out and unlistenable. It was my personal soundtrack for years.

Years later, I work on the internet. In my personal consulting practice, I used Dapper a lot. I used it in working with a group of accountants to scrape feeds of news updates posted to old-fashioned government agency websites that had no feeds. I once subcontracted as a consultant to a consultant to a consultancy to an analyst service serving a pharmaceutical company. (I thought that was far enough removed that I wouldn't get any on me, but none the less at my first meeting an executive said to me ominously "welcome to Big Pharma.")

It turns out the client at the end of the long pipeline of invoices sold a diet pill, and young women were complaining on MySpace and forums that the pill sometimes caused leakage from their...and I showed the next consultant in line how to use Dapper to scrape the forums for a feed monitoring said customer complaints. The check cleared and I never went back, but I still thank Dapper for making that work possible. If stranger things were ever piped through the service, I don't know what they were.

And then Dapper helped me meet Neil Young, in person.

I was working on a blog monitoring project for Sun Microsystems, building a web page that displayed the most recent and the most-talked about blog posts from around the web about 12 different Sun technologies, for use during the company's huge user conference.


As a part of that work, I was grabbing a feed from Google Blogsearch for long search queries like "Sun+Java-Indonesia...." etc. Google Blogsearch's own RSS feeds were all full of cruft, though. HTML bolding the search terms in the description field, and more. Not being a developer myself, I couldn't figure out how to strip that all out. I spent several nights pulling out my hair, worried I wouldn't be able to create something that was production-ready for this big client.

I tried Yahoo Pipes, I tried other blog search engines, but what I ended up doing was using Dapper to scrape a new feed from the search results pages. Those feeds were nice and clean to display on the project website.

This wasn't an easy thing to figure out. I tried many different strategies before discovering that, with help from the guys at Dapper even. As the project proceeded, my contact at Sun came to me and said (paraphrasing) - "Marshall, it looks like you're going to be able to pull this off after all, but I wonder if you could add one more search query and module to the end product. It is very, very top secret though and you cannot tell anyone about it."

I said of course I could do that, what was the search query?

"Neil Young," she said.

Of course I was more than happy to do that. It turned out that the big splashy secret announcement at Sun's conference was that Neil Young was going to make a surprise appearance on stage to unveil the first ever collection of his entire life's work, including letters he'd written, scanned-in notes from studio recording sessions, video interviews and of course all his music. All those materials would be made available on Blu-Ray, the media storage format that runs in all media players required to use Sun's Java software.

I built a long search query that would automatically deliver the best feed of search results about Neil Young's news that I knew nothing about yet, and included it in my deliverables. The project was completed days before the big conference and it was exhausting.

Just before the conference began, my Sun contact called me and said, "can we fly you down to the event for an interview with Neil Young as thanks for all your hard work?"

And that's how Dapper made it possible for me to meet Neil Young. We talked about electric cars (his new passion), about MP3 audio quality, about DRM and more. It was great.

I used Dapper for many, many different things. I still use it regularly (I used it last night, in fact) and if I could stop time and geek out for an evening with no obligations, I'd still probably spend that time playing with Dapper or the similar new tool NeedleBase.

Isn't That Just an Ad Network?

facterypic.jpgWhen Dapper was acquired by Yahoo last week, all the news coverage was brief and called the service a semantic advertising platform. How tragic! Co-founder Eran Shir wrote last week about the acquisition and said that the Dapper team always envisioned themselves making the display advertising world a more meaningful place. If that's true I'm disappointed. That sure wasn't what the service's earliest adopters wanted to use it for.

In February 2008 Dapper announced at its DapperCamp event that it would be launching an advertising technology. The Dapp Factory, as it was called, would not longer just be used to extract data for an undetermined purpose - it would be used to target contextual relevance for ad placement.

A mere 35,000 "Dapps" to perform extraction had been built and the company was struggling to be financially viable. It was a confusing service with a challenging interface on top of a radically new user paradigm. The only clear solution was to become an ad network. To fund the semantic indexing of text fields around the web by turning some of them into advertisements.

It's cool. I'm ad-supported. But Dapper had promised more than that. It had promised to be an easy and powerful tool that anyone, with no technical skills, could use to render any web page dynamic, to monitor particular fields in pages for updates automatically, to pull sets of data off of pages around the web. It's magic.

It was beautiful, but people didn't want it, they didn't understand it. Because people are stupid. It's maddening. If you tell people: take this tool, use it to get real-time notifications of changes to the tiniest part of any web page, use it to pull down sets of data from the web with a snap of your fingers, use it to work fast and get first movers' advantage. Scrape, then grab the fruits of that scraping, then enjoy a fast-growing career and meet your childhood musical heroes! But no, if there's an unclear step between a technology of empowerment and profit, a step that requires creativity and hard work, then the market at large throws a fit and demands that profit be instead put directly into its spoiled-child's hand. "I want an ad network!" people say, effectively, "Give me the money directly!"

Dapper as Parable

A beautiful web technology is like a little fairy, whose light shines bright for a short time and then extinguishes. Enjoy it while you can, until an uncaring market starves it to death and it turns into an ad network, for lack of viable alternatives.

Dapper still lets you scrape feeds using its legacy product. Hopefully Yahoo won't shut that down, if it allows any of the service to survive. But imagine how much more powerful (and stable) this beautiful service might have been if the company could have found a way to monetize its core feed scraping and publishing product. If that had remained the top development priority.

The same thing happens time and time again. "Your technology is too wonderful," I sometimes tell the most inspiring startups I interview before they launch. "No one will understand how to capture the incredible value you deliver. Your sales people will pound their heads against a wall for months. And then you will become an ad network."

Companies laugh uneasily. Perhaps because they know how likely it is that I'm right. (Perhaps because they think I'm a creep who ought to be perfectly happy for them if they can manage to build a viable ad network.)

I told Factery Labs that when I saw its demo. That startup provides an API that you can throw any URL at and get in response a feed of "fact-type sentences" extracted from the text behind the submitted URL. It's awesome. Twitter client Sobees, for example, uses it to offer text summary previews of any links shared by your friends on Twitter. It's great - but what are the odds that Factery is going to turn into an ad network? I think they are pretty good.

I told the company that and they said, "what's your shirt size?"

I told them, and a week later a package showed up at my door from Cafe Press. In it was a hooded sweatshirt with the Factery Labs robot logo screen printed on the back of it. Around the logo circled the words: "Factery Labs - Not an Ad Network Yet."

It's a cautionary tale - tell people that anyone can blog or Tweet, post a photo or a video, and you will change the world. Tell people that anyone can now extract text and data, process it automatically and treat web content like bowling pins, torches and knives in a capable juggler's hands. Not enough people, at least so far, will care. You will likely become an ad network.

Maybe that will change someday. Or maybe these freaky little services will remain forever like short-lived fairies, destined to be extinguished before their time.

Either way, I had a lot of great times with Dapper. I hope that technology like it will never stop being born.

]]> Discuss]]>
http://www.readwriteweb.com/archives/when.php http://www.readwriteweb.com/archives/when.php Analysis Fri, 15 Oct 2010 14:30:51 -0800 Marshall Kirkpatrick
Semantic Web Patterns: A Guide to Semantic Technologies In this article, we'll analyze the trends and technologies that power the Semantic Web. We'll identify patterns that are beginning to emerge, classify the different trends, and peak into what the future holds.

In a recent interview Tim Berners-Lee pointed out that the infrastructure to power the Semantic Web is already here. ReadWriteWeb's founder, Richard MacManus, even picked it to be the number one trend in 2008. And rightly so. Not only are the bits of infrastructure now in place, but we are also seeing startups and larger corporations working hard to deliver end user value on top of this sophisticated set of technologies.

]]> Editor's note: Looking back over 2008, there were some posts on ReadWriteWeb that did not get the attention we felt they deserved - whether because of timing, competing news stories, etc. So in this end-of-year series, called Redux, we're resurrecting some of those hidden gems. This is one of them, we hope you enjoy (re)reading it!

The Semantic Web means many things to different people, because there are a lot of pieces to it. To some, the Semantic Web is the web of data, where information is represented in RDF and OWL. Some people replace RDF with Microformats. Others think that the Semantic Web is about web services, while for many it is about artificial intelligence - computer programs solving complex optimization problems that are out of our reach. And business people always redefine the problem in terms of end user value, saying that whatever it is, it needs to have simple and tangible applications for consumers and enterprises.

The disagreement is not accidental, because the technology and concepts are broad. Much is possible and much is to be imagined.

1. Bottom-Up and Top-Down

We have written a lot about the different approaches to the Semantic Web - the classic bottom-up approach and the new top-down one. The bottom-up approach is focused on annotating information in pages, using RDF, so that it is machine readable. The top-down approach is focused on leveraging information in existing web pages, as is, to derive meaning automatically. Both approaches are making good progress.

A big win for the bottom-up approach was recent announcement from Yahoo! that their search engine is going to support RDF and microformats. This is a win-win-win for publishers, for Yahoo!, and for customers - publishers now have an incentive to annotate information because Yahoo! Search will be taking advantage of it, and users will then see better, more precise results.

Another recent win for the bottom-up approach was the announcement of the Semantify web service from Dapper (previous coverage). This offering will enable publishers to add semantic annotations to existing web pages. The more tools like Semantify that pop up, the easier it will be for publishers to annotate pages. Automatic annotation tools combined with the incentive to annotate the pages is going to make the bottom-up approach more compelling.

But even if the tools and incentive exist, to make the bottom-up approach widespread is difficult. Today, the magic of Google is that it can understand information as is, without asking people to fully comply with W3C standards of SEO optimization techniques. Similarly, top-down semantic tools are focused on dealing with imperfections in existing information. Among them are the natural language processing tools that do entity extraction - such as the Calais and TextWise APIs that recognize people, companies, places, etc. in documents; vertical search engines, like ZoomInfo and Spock, which mine the web for people; technologies like Dapper and BlueOrganizer, which recognize objects in web pages; and Yahoo! Shortcuts, Snap and SmartLinks, which recognize objects in text and links.

[Disclosure: Alex Iskold is founder and CEO of AdaptiveBlue, which makes BlueOrganizer and SmartLinks.]

Top-down technologies are racing forward despite imperfect information. And, of course, they benefit from the bottom-up annotations as well. The more annotations there are, the more precise top-down technologies will get - because they will be able to take advantage of structured information as well.

2. Annotation Technologies: RDF, Microformats, and Meta Headers

Within the bottom-up approach to annotation of data, there are several choices for annotation. They are not equally powerful, and in fact each approach is a trade off between simplicity and completeness. The most comprehensive approach is RDF - a powerful, graph-based language for declaring things, and attributes and relationships between things. In a simplistic way, one can think of RDF as the language that allows expressing truths like: Alex IS human (type expression), Alex HAS a brain (attribute expression), and Alex IS the father of Alice, Lilly, and Sofia (relationship expression). RDF is powerful, but because it is highly recursive, precise, and mathematically sound, it is also complex.

At present, most use of RDF is for interoperability. For example, the medical community uses RDF to describe genomic databases. Because the information is normalized, the databases that were previously silos can now be queried together and correlated. In general, in addition to semantic soundness, the major benefit of RDF is interoperability and standardization, particularly for enterprises, as we will discuss below.

Microformats offer a simpler approach by adding semantics to existing HTML documents using specific CSS styles. The metadata is compact and is embedded inside the actual HTML. Popular microformats are hCard, which describes personal and company contact information, hReview, which adds meta information to review pages, and hCalendar, which is used to describe events.

Microformats are gaining popularity because of their simplicity, but they are still quite limiting. There is no way to describe type hierarchies, which the classic semantic community would say is critical. The other issue is that microformats are somewhat cryptic, because the focus is to keep the annotations to a minimum. This, in turn, brings up another question of whether embedding metadata into the view (HTML) is a good idea. The question is: what happens if the underlying data changes when someone makes a copy of the HTML document? Nevertheless, despite these issues, microformats are gaining popularity because they are simple. Microformats are currently used by Flickr, Eventful, and LinkedIn; and many other companies are looking to adopt microformats, particularly because of the recent Yahoo! announcement.

An even simpler approach is to put meta data into the meta headers. This approach has been around for a while and it is a shame that it has not been widely adopted. As an example, the New York Times recently launched extended annotations for its news pages. The benefit of this approach is that it works great for pages that are focused on a topic or a thing. For example, a news page can be described with a set of keywords, geo location, date, time, people, and categories. Another example would be for book pages. O'Reilly.com has been putting book information into the meta headers, describing the author, ISBN, and category of the book.

Despite the fact that all these approaches are different, they are also somewhat complementary; and each of them is helpful. The more annotations there are in web pages, the more standards are implemented, and the more discoverable and powerful the information becomes.

3. Consumer and Enterprise

Yet another dimension of the conversation about the Semantic Web is the focus on consumer and enterprise applications. In the consumer arena we have been looking for a Killer App - something that delivers tangible and simple consumer value. People simply do not care that a product is built on the Semantic Web; all they are looking for is utility and usefulness.

Up until recently, the challenge has been that the Semantic Web focused on rather academic issues - like annotating information to make it machine-readable. The promise was that once the information is annotated and the web becomes one big giant RDF database, then exciting consumer applications would come. The skeptics, however, have been pointing out that first there needs to be a compelling use case.

Some consumer applications based on the Semantic Web: generic and vertical search, contextual shortcuts and previews, personal information management systems, semantic browsing tools. All of these applications are in their early days and have a long way to go before being truly compelling for the average web user. Still, even if these applications succeed, consumers will not be interested in knowing about the underlying technology - so there is really no marketing play for the Semantic Web in the consumer space.

Enterprises are a different story for a couple of reasons. First, enterprises are much more used to techno speak. To them utilizing semantic technologies translates into being intelligent and that, in turn, is good marketing. 'Our products are better and smarter because we use the Semantic Web' sounds like a good value proposition for the enterprise.

But even above the marketing speak, RDF solves a problem of data interoperability and standards. This "Tower of Babel" situation has been in existence since the early days of software. Forget semantics; just a standard protocol, a standard way to pass around information between two programs, is hugely valuable in the enterprise.

RDF offers a way to communicate using XML-based language, which on top of it has sound mathematical elements to enable semantics. This sounds great, and even the complexity of RDF is not going to stop enterprises from using it. However, there is another problem that might stop it - scalability. Unlike relational databases, which have been around for ages and have been optimized and tuned, XML-based databases are still not widespread. In general, the problem is in the scale and querying capabilities. Like object-oriented database technologies of the late '90s, XML-based databases hold a lot of promise, but we have yet to see them in action in a big way.

4. Semantic APIs

With the rise of Semantic Web applications, we are also seeing the rise of Semantic APIs. In general, these web services take as an input unstructured information and find entities and relationships. One way to think of these services is mini natural language processing tools, which are only concerned with a subset of the language.

The first example is the Open Calais API from Reuters that we have covered in two articles here and here. This service accepts raw text and returns information about people, places, and companies found in the document. The output not only returns the list of found matches, but also specifies places in the document where the information is found. Behind Calais is a powerful natural language processing technology developed by Clear Forest (now owned by Reuters), which relies on algorithms and databases to extract entities out of text. According to Reuters, Calais is extensible, and it is just a matter of time before new entities will be added.

Another example is the SemanticHacker API from TextWise, which is offering a one million dollar prize for the best commercial semantic web application developed on top of it. This API classifies information in documents into categories called semantic signatures. Given a document, it outputs entities or topics that the document is about. It is kind of like Calais, but also delivers a topical hierarchy, where the actual objects are leafs.

Another semantic API is offered by Dapper - a web service which facilitates the extraction of structure from unstructured HTML pages. Dapper works by enabling users to define attributes of an object based on the bits of the page. For example, a book publisher might define where the information about author, ISBN and number of pages is on a typical book page and the Dapper application would then create a recognizer for any page on the publisher site and enable access to it via REST API.

While this seems backwards from an engineering point of view, Dapper's technology is remarkably useful in the real world. In a typical scenario, for websites that do not have clean APIs to access their information, even non-technical people can build an API in minutes with Dapper. This is a powerful way of quickly turning websites into web services.

5. Search Technologies

Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search. The premise that a semantic understanding of pages leads to vastly better search has yet to be validated. The two main contenders, Hakia and PowerSet, have made some progress, but not enough. The problem is that Google's algorithm, which is based on statistical analysis, deals just fine with semantic entities like people, cities, and companies. When asked What is the capital of France? Google returns a good enough answer.

There is a growing realization that marginal improvement in search might not be enough to beat Google or to declare search the killer app for the Semantic Web. Likely, understanding semantics is helpful but not sufficient to build a better search engine. A combination of semantics, innovative presentation, and memory of who the user is, will be necessary to power the next generation search experience.

Alternative approaches also attempt to overlay semantics on top of the search results. Even Google ventures into verticals by partitioning the results into different categories. The consumer can then decide which type of answer they are interested in.

Yet search is a game that is far from won and a lot of semantic companies are really trying to raise the bar. There may be another twist to the whole search play - contextual technologies, as well as semantic databases, could lead to qualitatively better results. And so we turn to these next.

6. Contextual Technologies

We are seeing an increasing number of contextual tools entering the consumer market. Contextual navigation does not just improve search, but rather shortcuts it. Applications like Snap or Yahoo! Shortcuts, and SmartLinks "understand" the objects inside text and links and bring relevant information right into the user's context. The result is that the user does not need to search at all.

Thinking about this more deeply, one realizes that contextual tools leverage semantics in a much more interesting way. Instead of trying to parse what a user types into the search box, contextual technologies rely on analyzing the content. So the meaning is derived in a much more precise way - or rather, there is less guessing. The contextual tools then offer the users relevant choices, each of which leads to a correct result. This is fundamentally different from trying to pull the right results from a myriad of possible choices resulting from a web search.

We are also seeing an increasing number of contextual technologies make their way into the browser. Top-down semantic technologies need to work without publishers doing anything; and so to infer context, contextual technologies integrate into the browser. Firefox's recommended extensions page features a number of contextual browsing solutions - Interclue, ThumbStrips, Cooliris, and BlueOrganizer (from my own company).

The common theme among these tools is the recognition of information and the creation of specific micro contexts for the users to interact with that information.

7. Semantic Databases

Semantic databases are another breed of semantic applications focused on annotating web information to be more structured. Twine, a product of Radar Networks and currently in private beta, focuses on building a personal knowledge base. Twine works by absorbing unstructured content in various forms and building a personal database of people, companies, things, locations, etc. The content is sent to Twine via a bookmarklet, via email, or manually. The technology needs to evolve more, but one can see how such databases can be useful once the kinks are worked out. One of the very powerful applications that could be built on top of Twine, for example, is personalized search - a way to filter the results of any search engine based on a particular individual.

It is worth noting that Radar Networks has spent a lot of time getting the infrastructure right. The underlying representation is RDF and is ready to be consumed by other semantic web services. But a big chunk of the core algorithms, the ones that are dealing with entity extraction, are being commoditized by Semantic Web APIs. Reuters offers this as an API call, for example, and so moving forward, Twine won't need to be concerned with how to do that.

Another big player in the semantic databases space is a company called Metaweb, which created Freebase. In its present form, Freebase is just a fancier and more structured version of Wikipedia - with RDF inside and less information in total. The overall goal of Freebase, however, is to build a Wikipedia equivalent of the world's information. Such a database would be enormously powerful because it could be queried exactly - much like relational databases. So once again the promise is to build much better search.

But the problem is, how can Freebase keep up with the world? Google indexes the Internet daily and grows together with the web. Freebase currently allows editing of information by individuals and has bootstrapped by taking in parts of Wikipedia and other databases, but in order to scale this approach, it needs to perfect the art of continuously taking in unstructured information from the world, parsing it, and updating its database.

The problem of keeping up with the world is common to all database approaches, which are effectively silos. In the case of Twine, there needs to be continuous influx of user data, and in the case of Freebase there needs to be influx of data from the web. These problems are far from trivial and need to be solved successfully in order for the databases to be useful.

Conclusion

With any new technology it is important to define and classify things. The Semantic Web is offering an exciting promise: improved information discoverability, automation of complex searches, and innovative web browsing. Yet the Semantic Web means different things to different people. Indeed, its definitions in the enterprise and consumer spaces are different, and there are different means to a common end - top-down vs. bottom-up and microformats vs. RDF. In addition to these patterns, we are observing the rise of semantic APIs and contextual browsing tools. All of these are in their early days but hold a big promise to fundamentally change the way we interact with information on the web.

What do you think about Semantic Web Patterns? What trends are you seeing and which applications are you waiting for? And if you work with semantic technologies in the enterprise, please share your experiences with us in the comments below.

]]> Discuss]]>
http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php http://www.readwriteweb.com/archives/semantic_web_patterns_a_guide_redux.php Trends Fri, 26 Dec 2008 09:00:00 -0800 Alex Iskold
Top 10 RSS and Syndication Products of 2008 RSS and syndication are the veins that the new social web flows through. Countless products and services have been built on top of RSS in the past few years but there are always a few that stand above the rest.

As part of this year's Top 10 Products series, we offer below the Top 10 RSS and Syndication Products of 2008. These are the feed tools we and the people we know use day in and day out - we love them, we hate them, we wouldn't want to work without them.

]]> This is the fourth in our series of top products of 2008:

  1. Top 10 Semantic Web Products of 2008
  2. Top 10 International Products of 2008
  3. Top 10 Consumer Web Apps of 2008

Mashery

About the Selections

These aren't all new products from 2008. They are the products in the RSS and syndication world that we think made the biggest impact or were the most useful.

To be honest, this was not a particularly good year for innovation in the RSS space. Too many of the products listed below are incumbents, several of which drove us crazy this year. They remain on the list, however, because they are incredibly useful and nothing topped them.

Some honorable mentions are deserved as well. We talked to many people who like RSS magazine-style start page Feedly, though we found it overly constrictive and don't feel that it's made a big market splash yet. We also found the Associated Press's AP Member Marketplace very interesting. Had we gotten a chance to get to know it better, it could very well have been on this list. Finally, we love African social media aggregator Afrigator - it's a great way to learn about what's happening all over the continent and it's a great use of RSS. We named it one of the Top 10 International Products of 2008 but we think it deserves an honorable mention in this category as well.

And Now the RWW Top 10 RSS and Syndication Products of 2008

Postrank

postrankimage.jpgFormerly known as AideRSS, Postrank is simply the most useful RSS related application we've seen in a long time. Plug in any RSS feed and Postrank will rate each item in the feed on a scale of 1 to 10, by number of comments, inbound links, saves in Delicious, etc. You can then subscribe to a filtered feed of just the 10% most popular items in that feed.

We use Postrank all the time, in all kinds of contexts: from monitoring break-out stories in niche markets we don't follow closely, to finding out about the bread and butter of new blogs we discover to running search feeds through Postrank to surface hot conversations on any topic.

Postrank has been around for about a year and a half, but we write about it over and over again.

This year Postrank opened an API, made a bunch of deals with other companies, improved its service, raised a round of funding and just generally rocked.

FriendFeed

Social "life streaming" service FriendFeed is making syndication a more social activity than anything else has yet. The service aggregates your activity data from all around the web, lets your friends comment on it and shows you the activities of all your friends' friends when someone you know comments on something and exposes it to their network.

friendfeedRWWroom.jpgIf RSS readers will change your life and work through their awesome usefulness, FriendFeed is a service that makes syndication fun. It's one of the first places we go on the web every morning.

We interviewed the ex-Googlers who founded FriendFeed last February and that interview is still the best place to learn how the service works under the hood.

If you'd like to connect with the ReadWriteWeb crew on FriendFeed (and we hope you will) we've posted a tour of our FriendFeed profile pages here. Please join us also in the ReadWriteWeb FriendFeed Room.

Gnip

Gnip is a social media ping server, a service that other services ask for user data updates from all around the web. There's nothing here for users, but almost every developer we talk to these days who is aggregating content in order to add value to it (and that is the name of the game) has Gnip on its radar. The company aims to make aggregation more timely, scalable and efficient than it is today.

We wrote about Gnip at length when the service launched in July.
gnipscreen3.jpg

Snackr

snackrscreen5.jpgSnackr is a simple little RSS ticker built in Adobe AIR. Its frenetic and unstopping delivery of news is too much for many people, but the rest of us love it. It's where our eyes wander during page loads and other down times. Many of the stories you read here at ReadWriteWeb were based on things we first caught wind of through Snackr.

Snackr was built in-house at Adobe by Flex team member Narciso Jaramillo. We reviewed it in May and have been using it ever since.

Google Reader

Google Reader is the market leader in full featured RSS readers, having pulled ahead of the troubled Bloglines in recent months. This year Google Reader has made their sharing feature much more transparent, added the ability to translate any feed into a number of different languages and recently redesigned.

It hasn't been a super exciting year for the product, and there are still basic problems like very infrequent caching of rare feeds, but Google Reader's incredible dominance in the field makes it a required part of this list.

Google Reader RSS Subscriber Count Greasemonkey Script

greasemonkeyscriptgreader.jpgOne of the simplest little changes we've made to our browsers lately is the addition of this greasemonkey script that shows the number of readers in Google Reader that any page's RSS feed has. You can usually multiply that number by 2 to 4 times for an estimate of how many total readers a feed has across all readers, but either way it's a great little indication of a site's popularity.

The script was written by an anonymous user named "uncv" and we'd like to thank them. We love what they've done! This was one of the 7 coolest browser tweaks from the last month that we wrote about earlier this week. It's already won a permanent place in our hearts!

Dapper

Dapper.net is a point and click interface for data extraction - a nice way to say scraping an RSS feed. We continue to depend on Dapper for all kinds of research, we're always finding new ways to use it around here. We love it.

dapperscreen2008.jpg

Unfortunately, some sites don't like us to have access to links back to them available in our RSS readers (like Facebook, for example) and that really upsets us. In many cases those feeds that we created ourselves are the only way we'd be drawn back to a site, so it's their loss as much as ours.

Dapper has been around since 2006, but they recently launched a semantic ad platform that we included in our list of the top 10 semantic web products of 2008.

Twitterfeed

twitterfeedscreen.jpgLove it or hate it, Twitterfeed has made a big impact on the web in 2008. It's the service people use to publish an RSS feed right into Twitter.

Some people argue that twitter is all about conversation and that publishing an RSS feed there is grating and inappropriate. We like getting our local newspaper story links on Twitter, though, and everything from disaster monitoring to traffic conditions are now available via Twitterfeed.

Feedburner

Google's RSS publishing service Feedburner hurt our ability to break news first, can't be used in many corporate environments because it gets blocked in China and only made 6 posts all year to its company blog, none since May. That's compared to 28 posts in 2007. Apparently once you get your Google money there's not much point in communicating with the people who depend on you every day.

Why would we call Feedburner one of the top 10 RSS products on the year then? Because despite how frustrating it can be, the service is still so incredibly useful that we don't know what we'd do without it. Not just for publishing and analytics for ReadWriteWeb feeds - from numbers to email delivery to FeedFlare links, Feedburner will work magic easily on any feed you work with. I've got 68 different feeds in my account and I'll probably publish several more before the year is up.

Pipes

Yahoo! Pipes is another RSS based service that is really frustrating, hasn't innovated substantially in the last year - but is still so powerfully useful that it deserves a spot as one of the top products in this market.

Splicing and filtering RSS feeds is the simplest thing to do with Pipes, but there's much more you can do with it as well. It's great for us pseudo-geeks, we can work all kinds of magic with it. We've used Pipes throughout the year to do things that we (ok I) don't have the technical chops to do otherwise. For that I thank the Pipes team a whole lot.

PipesScreen2008.jpg

Those Were Our Favorites This Year - How About You?

Did we miss anyone you think should have been on this list? We hope you'll share your favorites in comments below. What RSS and syndication products impacted you the most in 2008?

]]> Discuss]]> http://www.readwriteweb.com/archives/top_10_rsssyndication_products_of_2008.php http://www.readwriteweb.com/archives/top_10_rsssyndication_products_of_2008.php 2008 in Review Thu, 11 Dec 2008 15:30:30 -0800 Marshall Kirkpatrick Some Web Apps Work Better Together web20.jpgHow many new websites can you fit in a Volkswagen Beetle? Sometimes it feels like that's what we're trying to do these days - but all these new applications and services don't have to be crammed into our heads and lives as separate things to try out and remember.

Many new technologies work best in concert; the functionality of one application can be vastly improved by using it together with another one. Here are some of our favorite examples of apps that work best together, followed by some favorite workflows from friends of ReadWriteWeb. We hope you'll share your favorite combos in comments, too, so we can all learn some new things.

]]> Some of Our Favorites

AideRSS plus Snacker

RSS news ticker Snackr was an app that people either loved or hated when we first wrote about it here. The attractive Adobe AIR interface is now even more compelling now that you can sync it with your Google Reader account (as of last week). One of the best uses we've found for this ever-flowing stream of news though has been to fill it up with "best of" feeds from AideRSS. AideRSS is an app we've written about over and over again here because it's just so darned useful and cool.

Picture 458.png

Put the two together though and you've got a stream of just the breakout hits from high traffic feeds. We enjoy and recommend reading the top stories on topics like the semantic web, mobile and recommendation technology through Snackr - but we're sure you can build your own easily.

Ma.gnolia (or Del.icio.us) plus Feed.Informer

Picture 453.pngYou can do a whole lot of different things with social bookmarking tools like Ma.gnolia and Del.icio.us, probably including some things most readers here aren't familiar with. One of our favorite things though is to pick a particular tag from your account and run the RSS feed from that tag through a handy little service called Feed.informer.

You can display any amount of the feed on a web page with just a few lines of embed code, including the "notes" field for your tag as editorial or summary information. The result is a little news section for your website, powered by your social bookmarking tool. It's a great way to continue sharing found items online that don't warrant an entire blog post.

FriendFeed and MuxTape plus FluidApp

We wrote here earlier this year about a fabulous mashup of mixtape service Muxtape and single-app browser creation tool for Mac called FluidApp, but it's also really useful to combine FriendFeed and Fluid.

Most of the other standalone FriendFeed apps are hard to use (excluding the wonderful mobile app FFtoGo) but putting your friends' feeds and conversation in a standalone browser makes it easy to follow along without losing the FF tab in your browser. FriendFeed's auto-updating keeps the dedicated browser up to date and the FF favicon looks great in your dock.

Single app browsers fall into the "seems stupid until you try it" category, but put the right app in there and you'll enjoy it.

Windows users can check out Bubbles, a service that was reviewed and discussed recently at Download Squad.

Facebook plus Dapper

The RSS extraction tool Dapper is really powerful, once you figure out how and why to use it. Here's a 4 minute screencast we recorded about how to use Dapper but the sky's the limit with what you can do with this free tool.

One of the things we've done with it lately is scrape birthday notifications out of Facebook. Not everyone logs into Facebook everyday, but people tend to put their real birthdays into their profiles there. It's really nice to get those birthday notifications by RSS in another setting that you spend time in more regularly. Step by step instructions for doing so are available here.

facebookdapper.png

Friends of RWW

We asked around and got some input from friends about what apps they like to use together. The responses ranged from combinations aimed to increase productivity to making the most of music listening. Here are some of our favorites.

Local Portland tech blogger Rick Turoczy says he likes to use Twitter search (formerly Summize), combined with Yahoo! Pipes and RSS to SMS service Pingie. We're not sure what he does with those apps together, but the magic results in his getting a lot of industry news before mainstream media outlets do.

MicroISV consultant Bob Walsh makes the most of his fleeting thoughts by sending voice recordings through Jott over to "memory extender" EverNote and "thence to various programs on my Mac." That's the kind of thing many of us have probably envisioned doing, we're glad it's working for Bob.

Susan Kirkpatrick (no relation) is a prolific multi-media blogger. How does she do it? [I] "send a blog post with a picture attachment via email to Utterz; it posts to Flickr, WordPress, Pownce and Twitter." We haven't used it a lot ourselves, but Utterz is pretty impressive and we here rumors that there is even more sophisticated developments being worked on behind the scenes there, too.

Virginie De Bel Air says she likes Last.fm + SonicLiving, a service that tracks your favorites on iTunes, Last.fm or Pandora and notifies you when those bands are coming to perform in your area. Utilitarian and rock and roll! We hadn't seen SonicLiving before.

Security and IT exec Greg Hughes likes to let his hair down and shout Shazam! sometimes. Specifically, Hughes says he finds himself using the Shazam music identification app to identify a song he hears and then Pandora to discover more that's related. All on the iPhone, too.

What About You?

What are your favorite apps to use together? There are so many new apps that launch everyday, we can't imagine the infinite permutations that users could come up with. Putting together multiple apps usually implies though that you're fairly comfortable with one or both of them, that they are equipped to live as something other than a walled garden and that each has stood enough of a test for users to believe they are stable enough to smoosh together.

Productivity? Fun? A combination of both, perhaps? We'd love to know what your favorite apps are to run together.

Photo: "Web 2.0 Crawl Yahoo Brickhouse: Nate Westheimer of BricaBox, Dave McClure, Gabe Rivera of Techmeme" by Brian Solis. Just imagine how great it would be if these app guys worked together!

]]> Discuss]]>
http://www.readwriteweb.com/archives/some_web_apps_work_better_together.php http://www.readwriteweb.com/archives/some_web_apps_work_better_together.php Mashups Wed, 30 Jul 2008 17:11:09 -0800 Marshall Kirkpatrick
Semantic Web Patterns: A Guide to Semantic Technologies In this article, we'll analyze the trends and technologies that power the Semantic Web. We'll identify patterns that are beginning to emerge, classify the different trends, and peak into what the future holds.

In a recent interview Tim Berners-Lee pointed out that the infrastructure to power the Semantic Web is already here. ReadWriteWeb's founder, Richard MacManus, even picked it to be the number one trend in 2008. And rightly so. Not only are the bits of infrastructure now in place, but we are also seeing startups and larger corporations working hard to deliver end user value on top of this sophisticated set of technologies.

]]> The Semantic Web means many things to different people, because there are a lot of pieces to it. To some, the Semantic Web is the web of data, where information is represented in RDF and OWL. Some people replace RDF with Microformats. Others think that the Semantic Web is about web services, while for many it is about artificial intelligence - computer programs solving complex optimization problems that are out of our reach. And business people always redefine the problem in terms of end user value, saying that whatever it is, it needs to have simple and tangible applications for consumers and enterprises.

The disagreement is not accidental, because the technology and concepts are broad. Much is possible and much is to be imagined.

1. Bottom-Up and Top-Down

We have written a lot about the different approaches to the Semantic Web - the classic bottom-up approach and the new top-down one. The bottom-up approach is focused on annotating information in pages, using RDF, so that it is machine readable. The top-down approach is focused on leveraging information in existing web pages, as-is, to derive meaning automatically. Both approaches are making good progress.

A big win for the bottom-up approach was recent announcement from Yahoo! that their search engine is going to support RDF and microformats. This is a win-win-win for publishers, for Yahoo!, and for customers - publishers now have an incentive to annotate information because Yahoo! Search will be taking advantage of it, and users will then see better, more precise results.

Another recent win for the bottom-up approach was the announcement of the Semantify web service from Dapper (previous coverage). This offering will enable publishers to add semantic annotations to existing web pages. The more tools like Semantify that pop up, the easier it will be for publishers to annotate pages. Automatic annotation tools combined with the incentive to annotate the pages is going to make the bottom-up approach more compelling.

But even if the tools and incentive exists, to make the bottom-up approach widespread is difficult. Today, the magic of Google is that it can understand information as is, without asking people to fully comply with W3C standards of SEO optimization techniques. Similarly, top-down semantic tools are focused on dealing with imperfections in existing information. Among them are the natural language processing tools that do entity extraction - such as the Calais and TextWise APIs that recognize people, companies, places, etc. in documents; vertical search engines, like ZoomInfo and Spock, which mine the web for people; technologies like Dapper and BlueOrganizer, which recognize objects in web pages; and Yahoo! Shortcuts, Snap and SmartLinks, which recognize objects in text and links.

[Disclosure: Alex Iskold is founder and CEO of AdaptiveBlue, which makes BlueOrganizer and SmartLinks.]

Top-down technologies are racing forward despite imperfect information. And, of course, they benefit from the bottom-up annotations as well. The more annotations there are, the more precise top-down technologies will get - because they will be able to take advantage of structured information as well.

2. Annotation Technologies: RDF, Microformats, and Meta Headers

Within the bottom-up approach to annotation of data, there are several choices for annotation. They are not equally powerful, and in fact each approach is a tradeoff between simplicity and completeness. The most comprehensive approach is RDF - a powerful, graph-based language for declaring things, and attributes and relationships between things. In a simplistic way, one can think of RDF as the language that allows expressing truths like: Alex IS human (type expression), Alex HAS a brain (attribute expression), and Alex IS the father of Alice, Lilly, and Sofia (relationship expression). RDF is powerful, but because it is highly recursive, precise, and mathematically sound, it is also complex.

At present, most use of RDF is for interoperability. For example, the medical community uses RDF to describe genomic databases. Because the information is normalized, the databases that were previously silos can now be queried together and correlated. In general, in addition to semantic soundness, the major benefit of RDF is interoperability and standardization, particularly for enterprises, as we will discuss below.

Microformats offer a simpler approach by adding semantics to existing HTML documents using specific CSS styles. The metadata is compact and is embedded inside the actual HTML. Popular microformats are hCard, which describes personal and company contact information, hReview, which adds meta information to review pages, and hCalendar, which is used to describe events.

Microformats are gaining popularity because of their simplicity, but they are still quite limiting. There is no way to described type hierarchies, which the classic semantic community would say is critical. The other issue is that microformats are somewhat cryptic, because the focus is to keep the annotations to a minimum. This, in turn, brings up another question of whether embedding metadata into the view (HTML) is a good idea. The question is: what happens if the underlying data changes when someone makes a copy of the HTML document? Nevertheless, despite these issues, microformats are gaining popularity because they are simple. Microformats are currently used by Flickr, Eventful, and LinkedIn; and many other companies are looking to adopt microformats, particularly because of the recent Yahoo! announcement.

An even simpler approach is to put meta data into the meta headers. This approach has been around for a while and it is a shame that it has not been widely adopted. As an example, the New York Times recently launched extended annotations for its news pages. The benefit of this approach is that it works great for pages that are focused on a topic or a thing. For example, a news page can be described with a set of keywords, geo location, date, time, people, and categories. Another example would be for book pages. O'Reilly.com has been putting book information into the meta headers, describing the author, ISBN, and category of the book.

Despite the fact that all these approaches are different, they are also somewhat complimentary; and each of them is helpful. The more annotations there are in web pages, the more standards are implemented, and the more discoverable and powerful the information becomes.

3. Consumer and Enterprise

Yet another dimension of the conversation about the Semantic Web is the focus on consumer and enterprise applications. In the consumer arena we have been looking for a Killer App - something that delivers tangible and simple consumer value. People simply do not care that a product is built on the Semantic Web, all they are looking for is utility and usefulness.

Up until recently, the challenge has been that the Semantic Web is focused on rather academic issues - like annotating information to make it machine readable. The promise was that once the information is annotated and the web becomes one big giant RDF database, then exciting consumer applications will come. The skeptics, however, have been pointing out that first there needs to be a compelling use case.

Some consumer applications based on the Semantic Web: generic and vertical search, contextual shortcuts and previews, personal information management systems, semantic browsing tools. All of these applications are in their early days and have a long way to go before being truly compelling for the average web user. Still, even if these applications succeed, consumers will not be interested in knowing about the underlying technology - so there is really no marketing play for the Semantic Web in the consumer space.

Enterprises are a different story for a couple of reasons. First, enterprises are much more used to techno speak. To them utilizing semantic technologies translates into being intelligent and that, in turn, is good marketing. 'Our products are better and smarter because we use the Semantic Web' sounds like a good value proposition for the enterprise.

But even above the marketing speak, RDF solves a problem of data interoperability and standards. This "Tower of Babel" situation has been in existence since the early days of software. Forget semantics; just a standard protocol, a standard way to pass around information between two programs, is hugely valuable in the enterprise.

RDF offers a way to communicate using XML-based language, which on top of it has sound mathematical elements to enable semantics. This sounds great, and even the complexity of RDF is not going to stop enterprises from using it. However, there is another problem that might stop it - scalability. Unlike relational databases, which have been around for ages and have been optimized and tuned, XML-based databases are still not widespread. In general, the problem is in the scale and querying capabilities. Like object-oriented database technologies of the late nineties, XML-based databases hold a lot of promise, but we are yet to see them in action in a big way.

4. Semantic APIs

With the rise of Semantic Web applications, we are also seeing the rise of Semantic APIs. In general, these web services take as an input unstructured information and find entities and relationships. One way to think of these services is mini natural language processing tools, which are only concerned with a subset of the language.

The first example is the Open Calais API from Reuters that we have covered in two articles here and here. This service accepts raw text and returns information about people, places, and companies found in the document. The output not only returns the list of found matches, but also specifies places in the document where the information is found. Behind Calais is a powerful natural language processing technology developed by Clear Forest (now owned by Reuters), which relies on algorithms and databases to extract entities out of text. According to Reuters, Calais is extensible, and it is just a matter of time before new entities will be added.

Another example is the SemanticHacker API from TextWise, which is offering a one million dollar prize for the best commercial semantic web application developed on top of it. This API classifies information in documents into categories called semantic signatures. Given a document, it outputs entities or topics that the document is about. It is kind of like Calais, but also delivers a topical hierarchy, where the actual objects are leafs.

Another semantic API is offered by Dapper - a web service which facilitates the extraction of structure from unstructured HTML pages. Dapper works by enabling users to define attributes of an object based on the bits of the page. For example, a book publisher might define where the information about author, isbn and number of pages is on a typical book page and the Dapper application would then create a recognizer for any page on the publisher site and enable access to it via REST API.

While this seems backwards from an engineering point of view, Dapper's technology is remarkably useful in the real world. In a typical scenario, for web sites that do not have clean APIs to access their information, even non-technical people can build an API in minutes with Dapper. This is a powerful way of quickly turning web sites into web services.

5. Search Technologies

Perhaps the first significant blow to the Semantic Web has been the inability thus far to improve search. The premise that semantical understanding of pages leads to vastly better search has yet to be validated. The two main contenders, Hakia and PowerSet, have made some progress, but not enough. The problem is that Google's algorithm, which is based on statistical analysis, deals just fine with semantic entities like people, cities, and companies. When asked What is the capital of France? Google returns a good enough answer.

There is a growing realization that marginal improvement in search might not be enough to beat Google, and to declare search the killer app for the Semantic Web. Likely, understanding semantics is helpful but not sufficient to build a better search engine. A combination of semantics, innovative presentation, and memory of who the user is, will be necessary to power the next generation search experience.

Alternative approaches also attempt to overlay semantics on top of the search results. Even Google ventures into verticals by partitioning the results into different categories. The consumer can then decide which type of answer they are interested in.

Yet search is a game that is far from won and a lot of semantic companies are really trying to raise the bar. There may be another twist to the whole search play - contextual technologies, as well as semantic databases, could lead to qualitatively better results. And so we turn to these next.

6. Contextual Technologies

We are seeing an increasing number of contextual tools entering the consumer market. Contextual navigation does not just improve search, but rather shortcuts it. Applications like Snap or Yahoo! Shortcuts or SmartLinks "understand" the objects inside text and links and bring relevant information right into the user's context. The result is that the user does not need to search at all.

Thinking about this more deeply, one realizes that contextual tools leverage semantics in a much more interesting way. Instead of trying to parse what a user types into the search box, contextual technologies rely on analyzing the content. So the meaning is derived in a much more precise way - or rather, there is less guessing. The contextual tools then offer the users relevant choices, each of which leads to a correct result. This is fundamentally different from trying to pull the right results from a myriad of possible choices resulting from a web search.

We are also seeing an increasing number of contextual technologies make their way into the browser. Top-down semantic technologies need to work without publishers doing anything; and so to infer context, contextual technologies integrate into the browser. Firefox's recommended extensions page features a number of contextual browsing solutions - Interclue, ThumbStrips, Cooliris, and BlueOrganizer (from my own company).

The common theme among these tools is the recognition of information and the creation of specific micro contexts for the users to interact with that information.

7. Semantic Databases

Semantic databases are another breed of semantic applications focused on annotating web information to be more structured. Twine, a product of Radar Networks and currently in private beta, focuses on building a personal knowledge base. Twine works by absorbing unstructured content in various forms and building a personal database of people, companies, things, locations, etc. The content is sent to Twine via bookmarklet or via email or manually. The technology needs to evolve more, but one can see how such databases can be useful once the kinks are worked out. One of the very powerful applications that could be built on top of Twine, for example, is personalized search - a way to filter the results of any search engine based on a particular individual.

It is worth noting that Radar Networks has spent a lot of time getting the infrastructure right. The underlying representation is RDF and is ready to be consumed by other semantic web services. But a big chunk of the core algorithms, the ones that are dealing with entity extraction, are being commoditized by Semantic Web APIs. Reuters offers this as an API call, for example, and so moving forward, Twine won't need to be concerned with how to do that.

Another big player in the semantic databases space is a company called Metaweb, which created Freebase. In its present form, Freebase is just a fancier and more structured version of Wikipedia - with RDF inside and less information in total. The overall goal of Freebase, however, is to build a Wikipedia equivalent of the world's information. Such a database would be enormously powerful because it could be queried exactly - much like relational databases. So once again the promise is to build much better search.

But the problem is, how can Freebase keep up with the world? Google indexes the Internet daily and grows together with the web. Freebase currently allows editing of information by individuals and has bootstrapped by taking in parts of Wikipedia and other databases, but in order to scale this approach, it needs to perfect the art of continuously taking in unstructured information from the world, parsing it, and updating its database.

The problem of keeping up with the world is common to all database approaches, which are effectively silos. In the case of Twine, there needs to be continuous influx of user data, and in the case of Freebase there needs to be influx of data from the web. These problems are far from trivial and need to be solved successfully in order for the databases to be useful.

Conclusion

With any new technology it is important to define and classify things. The Semantic Web is offering an exciting promise: improved information discoverability, automation of complex searches, and innovative web browsing. Yet the Semantic Web means different things to different people. Indeed, its definition in the enterprise and consumer spaces is different, and there are different means to a common end - top-down vs. bottom up and microformats vs. RDF. In addition to these patterns, we are observing the rise of semantic APIs and contextual browsing tools. All of these are in their early days, but hold a big promise to fundamentally change the way we interact with information on the web.

What do you think about Semantic Web Patterns? What trends are you seeing and which applications are you waiting for? And if you work with semantic technologies in the enterprise, please share your experiences with us in the comments below.

]]> Discuss]]>
http://www.readwriteweb.com/archives/semantic_web_patterns.php http://www.readwriteweb.com/archives/semantic_web_patterns.php Trends Tue, 25 Mar 2008 15:20:45 -0800 Alex Iskold
Semantify - Automate Your Semantic Web SEO in Five Minutes The timing couldn't be better for the release of Semantify, a new service from Israel/San Francisco's Dapper.net. One week after Yahoo! announced that it will begin indexing the semantic markup and meaning of content on the web, Semantify offers a remarkably simple way to get your website marked up semantically. Automatically, forever.

]]> Once you learn how to use Dapper's basic interface, it can take less than five minutes to set up the Semantify service. Hello SEO, 3.0.

Just a Few Steps

Here's what it takes:

1. Identify your website and show Dapper a few different pages on it.

2. Point and click to identify particular fields on your pages, like the titles, dates and authors of articles. Sometimes this requires a few extra clicks to exclude false positives in the previewed results.

3. Name those fields according to any number of Semantic Web naming protocols. In my test of Semantify, for my personal site marshallk.com, I used the Dublin Core namespaces "title," "date," description" and "creator" to name my fields in Dapper. I could have designated fields as the names of my friends or as particular locations. There are simple descriptions of other namespace conventions linked to from the Semantify page and this part is pretty intuitive.

4. Once you've gotten this far, in the standard method of using Dapper you'd grab an RSS feed that would deliver changes that get made to the fields you're monitoring. With Semantify, though, you get a few lines of PHP code to paste into the header of your website. See the screenshot at the bottom of this post.

And then you're done.

Dapper GUI + Semantic Web vocab list + PHP embed code = automated Semantic Web markup for your site. It's like a point and click sitemap creator on the element-by-element level. It's a perpetual standards-based SEO machine. That's the incentive for publishers. For the rest of us, once the meaning of content is machine readable - there's a world of sophisticated information processing we'll be able to automate and leverage.

It's The Early Days

It's as simple as that, or at least it will be once all the little kinks are worked out. At launch the embed code is only available in PHP but the company says more options are right around the corner. The company rushed to get this service out the door and that's a little obvious right now. It's also clear that the problems are small ones that they'll be able to solve quickly. There's more sophisticated options coming (more granular control over namespaces, for example) and the user interface could always be improved over there. None the less, this service could end up being very, very big.

You can go through those steps above today, I have, and whenever the Yahoo! spider hits your webpage, it will be shown a semantically marked up version of whatever content is live on your pages at the time. It will come from your domain and everyone will be happy. Wash, rinse and repeat for all your domains. Then, thank Dapper for making it so damn easy.

Historical Context

Many people have questioned the viability of the Semantic Web vision, asking who will do the markup. Yahoo! has stepped in and provided the incentive for every publisher to do so, now Dapper's Semantify is hoping to provide the service that will make it easy, too.

Once it's just a matter of course for publishers to publish semantic markup with their content, look out world. My favorite example, from our coverage of the Yahoo! announcement, is this: show me all the movie reviews written by a user's friends who live in Europe. Today, that would be hard to do. Once semantic markup is widely published and indexed - then such queries will be trivial and the only question will be what we want to do with that information.

The Semantic Web could change the world. The only things missing are incentive like Yahoo! now provides and ease-of-use, as Semantify began offering today.

Picture 2.png
]]> Discuss]]>
http://www.readwriteweb.com/archives/semantify_automate_your_semantic_web_seo_in_five_minutes.php http://www.readwriteweb.com/archives/semantify_automate_your_semantic_web_seo_in_five_minutes.php Product Reviews Thu, 20 Mar 2008 17:07:35 -0800 Marshall Kirkpatrick
Funding the Semantic Web: Dapper's Ad Network Plan The founders of the data extraction and API creation service Dapper announced this week that their aim is to leverage Dapper in the service of ad networks and derive a semantic index of pages around the web from that activity. They will launch their ad powering product at Ad:Tech in April. Essentially, it will perform ad funded indexing of the semantic web.

Here's how it will work: Dapper lets users identify and tag particular fields on any page. It then extracts the value in that field and makes it available in XML. As a result of this advertising activity, Dapper believes a substantial quantity of pages around the web could have fields of interest delineated and tagged with relevant terms. Relationships between pages and fields and terms and tags can all be extracted and analyzed from this aggregated activity.

]]> The company has already built a demonstration semantic search engine based on Dapper activity and its ability to parse search results by semantic meaning and detail is quite sophisticated. The potential applications of a semantic index built by Dapper are really exciting to consider.

Dapper currently has 35,000 extraction functions (Dapps) created, but they are betting that a clear profit motive will incentivize advertisers to create many, many more. Advertisers will pay to have web content delineated by field and categorized.

The company argues that advertisers see substantially increased relevance and click-through if ads can be served based on very specific fields of content on a page. Early prototypes run on top music site Pitchfork and book summary site Shvoong saw 100 to 500% increases in CTR.

While Dapper's approach would likely leave the vast majority of fields on a page unindexed, it could also rack up a whole lot of semantic knowledge by riding the profit motive to discover the semantic meaning of the most monetizable fields on a far greater number of pages than would likely be analyzed otherwise. What better way to analyze the web than to ride along with ad networks? I can't think of any better way.

I think Dapper has a shot at helping fund the semantic analysis of much of the web. What will they do with the data other than use it to contextualize ads? That's another question, but an interesting one to consider.

Dappercamp was a great event this week and the tool itself is one I highly recommend. It's in startup mode and I'll be frank - many of the output formats simply don't work and there are a number of errors throughout the site. None the less, I derive significant value for my work every time I engage with it. Here's a screencast tutorial I recorded on the service. Several Dapps, Dapper-created data extractions, have become daily go-to sources of information for me - but I also recognize that only so many people are going to be as excited about this technology for research purposes. For the rest of the world, for the viability of the company, and for the potentially gigantic secondary benefit of widespread semantic indexing - I think putting Dapper in service of ad networks is a plan of simple brilliance.

]]> Discuss]]>
http://www.readwriteweb.com/archives/dapper_funding_the_semantic_web.php http://www.readwriteweb.com/archives/dapper_funding_the_semantic_web.php Product Reviews Wed, 06 Feb 2008 09:15:24 -0800 Marshall Kirkpatrick
MindTouch Powers-Up DekiWiki with Dapper Open source wiki vendor MindTouch is releasing a series of major new features Monday and some of them are quite interesting. People used to talk about MindTouch for its outlandish stunts - like working with nutball John Gotts on the short-lived Wiki.com platform and hiring a Bono impersonator to walk the exhibit floor at DEMO. Those days seem like the distant past as now the MindTouch software gets attention on its own.

]]>

Today the company's product, called Dekiwiki, gets an application platform based on its own simple language called dekiscript and a new execution engine. Additionally, a newly organized infrastructure will now allow thousands of wikis to be run with a single multi-tenant install. This should make management of multiple wikis in one organization far easier than ever before.

Also new is easy integration outside data scraped by Dapper.net and displays data using the new Google Charts API wrapped in Dekiscript for easier mashup creation. I really like Dapper a lot - see our most recent coverage of this paradigm changing tool here. This is really what motivates me to write about this release. Nn open source wiki integrating the screen scraping power of Dapper and displaying the data using the Google Charts API is just plain cool.

The company claims it sees 30k free downloads each month and most public discussion of the product is very positive. This past month, the company added open source industry journalist Matt Asay to its board of advisors and released versions of Dekiwiki translated into 9 different languages.

Below is a video from the company showing off all the new features in today's release. There's a lot going on for MindTouch - the company's outlook seems to be getting brighter all the time.

]]> Discuss]]>
http://www.readwriteweb.com/archives/mindtouch_powersup_dekiwiki.php http://www.readwriteweb.com/archives/mindtouch_powersup_dekiwiki.php Mon, 07 Jan 2008 07:52:52 -0800 Marshall Kirkpatrick
The Glory, Bliss and How-to of Screen Scraping for RSS Wired has an awesome top story today on the world of startups utilizing scraped data from big companies to offer new layers of value for their own users. It's a roughly objective piece that I highly recommend reading but it was also inspiration for me to finally record a screencast on the subject (see below).

I love RSS, probably more than anything on the web. If you're not familiar with the concept, see my very old definition of RSS and my almost-as-old post on teaching people about RSS.

Not every page on the web publishes an RSS feed, though. Thus the need for these wonderful screen scraping tools. I've written about a variety of tools you can use to create a feed for a site or page that doesn't have one. Sometimes, though, you've got to pull out the big guns. In those cases, it's time for Dapper.

]]>

Dapper is a company founded in Israel, now venture backed and was named in the aforementioned Wired article. It is the sweetness.

Dapper will let you pull data from almost any web page and get it in a wide variety of outputs, including RSS, email, iCal, a Google Gadget, CSV and Google Maps. Is that incredible or what?

Let's let the video do that talking. I have an awful cold (it's almost better, Mom!) so please excuse the very rough voice. I made the following screencast using JingProject, setting up an RSS feed of search results in Del.icio.us for articles tagged from ReadWriteWeb.

Clicking on the image below will open up another window so you can view the 4 minute video full screen.

If you're as excited about Dapper as I am, you should check out DapperCamp, a two day free conference all about Dapper coming up in early February in San Francisco. IBM and Mindtouch are sponsoring the event and Mitch Kapor is keynoting it. It looks like it's going to be a lot of fun.

Take that, Wired Mag ambivalence! Really, though, you should read that Wired article - it's a good one that discusses some issues that are going to be very big once more people figure out how exciting data portability is.

]]> Discuss]]>
http://www.readwriteweb.com/archives/screen-scraping.php http://www.readwriteweb.com/archives/screen-scraping.php Mon, 31 Dec 2007 20:57:24 -0800 Marshall Kirkpatrick