Today, at the Semantic Technology Conference, Rob Larson and Evan Sandhaus of the New York Times announced together that the Times will soon be publishing its copious index as Linked Data.
The Times' data will join content from Project Gutenberg, a vast online library of text from public domain books, data from the U.S. census, and information from many other formative and vital entities in the semantic web space. Larson and his team intend to make available hundreds of thousands of tags for content dating back to 1851. This will providing give developers an invaluable, automatically navigable roadmap for the publication's vast directory of knowledge and will link that data to existing pages, people, and content around the web.
In his keynote address, Larson emphasized "How deeply we [at the Times] care about metadata."
"It's been fundamental to what we do for a long time. We feel we're good at it, but our content is an island... we want to announce our intention to publish our thesaurus to the community under a license that will allow you to use it and contribute your improvements... The results of this effort will in time take the shape of the Times entering this Linked Data cloud. This is wholly consistent with our open strategy... to facilitate access to slices of our data for those who want to include it in their applications."
Larson likened the Times corpus to a quarry of data. He said that the newspaper's API provided the picks and shovels to mine data, and the Linked Data initiative would be the map.
The timing, licensing, format, and other factors of the project are yet to be determined.
This announcement comes on the heels of CNET's partnership with Reuters to publish data to the Linked Data cloud. Moreover, exactly one month ago, we wrote that Linked Data was a concept "whose time has come" and gave a thorough overview of the concepts and standards it entails, for curious readers who would like to drill deeper on the subject.
In another recent interview, Sandhaus detailed the tagging process for the Times' corpus, both for print and online articles:
"There are two types of tagging that go on at the times... Every day, indexers take the paper and go article by article and associate each article with subject keywords. Then they manually summarize it. It's like a Google list, but in dead tree form.Another type of tagging we do is... when an article goes from the newsroom to the web, it's put there by a producer who will augment the article with any number of rich features like images, multimedia... and subject keywords. Unlike the indexers, who do this completely by hand, the producers are assisted in their tagging by an automated classification system which suggests tags to be applied to the data and which are ultimately approved by the producer."
An official announcement is expected at the Times' Open blog tomorrow, with details on the project to follow.
Comments
Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts
A big day for the semantic web! Orchestr8 also announced Linked Data support in its AlchemyAPI semantic tagging solution:
http://news.prnewswire.com/DisplayReleaseContent.aspx?ACCT=104&STORY=/www/story/06-18-2009/0005046519&EDATE=
It's great to see the semantic web find its feet. I've always been skeptical about it as something that will find its way into the whole Internet (or even into most user-generated data in web applications) - but as a way to allow automated functionality around structured data like the New York Times index is huge.
Just as interesting as what was said at the announcment is what was not said. It seems that they are not releasing their occurrence data (their index) which would be the logical thing to do if this release was meant to drive traffic to their site. More comment from the conference at http://go-to-hellman.blogspot.com/2009/06/new-york-times-and-infrastructure-of.html