ReadWriteWeb

Amazon Exposes 1 Terabyte of Public Data to Developers

Written by Marshall Kirkpatrick / February 25, 2009 5:26 AM / 32 Comments

Amazon.com changed the retail world. In the process the company built up so much surplus computing power that it started a dirt cheap "computing in the cloud" business that changed the computing world. This week the company's newest project Public Data Sets on Amazon Web Services began offering more than 1 Terabyte (1000 GB) of fascinating public data for developers to access on the fly through Amazon's cloud computing service.

We're talking about an annotated collection of all publicly available DNA sequences, including the Human Genome, huge amounts of chemistry data, machine readable encyclopedic entries about millions of different topics and an entire dump of Wikipedia. US Census data, data from the US Department of Transportation and more. It's all accessible by web applications in no time at all. What do you think this is going to change?

The company made a blog post last night announcing the availability of four new public data sets.

aws350.jpgThis includes data from:

  • The Bureau of Transportation Statistics.

  • DBPedia Knowledge Base - which "currently describes more than 2.6 million things including 213,000 people, 328,000 places, 57,000 music albums, 36,000 films, and 20,000 companies." All in handy semantic markup.

  • The Freebase Data Dump - the giant collaboratively build semantic database on a wide variety of topics, data that high profile startup Metaweb has spent millions of dollars assembling.

  • The entire English section of Wikipedia, dumped into a machine readable format.

  • A number of large genetic and scientific databases.

We counted all the databases up and it passed 1 TB of available data. The company says that accessing this data is "trivial" for developers.

What are developers going to do with this data? We can't wait to find out. The prospect of mashing up, cross referencing and user interfacing with this amount of data is nearly unfathomable. Really. This data will be leveraged by all kinds of different web applications, for a long time.

You've read, or can imagine, the impact that the first Public Libraries had on human culture. Now imagine the opening up of not just this, but other libraries of data, so huge that economies of scale blast the project off beyond any analogy that could be drawn with our everyday experience or historical memories. It won't just be Amazon that offers up this kind of data - it will be relatively commonplace soon, we imagine.

It will be like a network of libraries - for robots. Robots that go to the library frequently, read very fast and make serious use of what they've learned.

Congratulations, Amazon, on passing 1 TB of public data made available. May all our robots of the future please live in peace.


Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. Are you sure its 1TB? Just Wikipedia by itself would make up a significant chunk of that space, and since it includes dbpedia (thus duplicating most content)... Terabytes are cheap these days. I have two TB of data just on my home desktop PC.

    Posted by: Nate | February 25, 2009 7:47 AM



  2. You apparently have to have an EC2 account and active instance to be able to get access to the data, so for the mean time there is a cost attached to getting hold of the data.

    Posted by: Stuart Marsh | February 25, 2009 8:13 AM



  3. to be honest marshall - you might change that title - when i saw it i thought "oh my amazon just got hacked and my info is public!"

    not sure if others got the same idea

    Posted by: Allen | February 25, 2009 8:22 AM



  4. Amazon is a 'Cloud' Pioneer and I appreciate them for offering these datasets to the public. They have a potential to have a huge impact.

    I recommend that there be a public information release on how often Amazon plans to update these files.

    Posted by: Tecue | February 25, 2009 8:59 AM



  5. To be honest, I thought the title was speaking of a vulnerability as well. Had to read the article twice to make the connection.

    And I'm a user of AWS as well.

    Posted by: mtranda | February 25, 2009 9:22 AM



  6. Isn't terrabyte is supposed to be spelled terabyte?

    Posted by: Bob Ohsiek | February 25, 2009 9:33 AM



  7. Bob - thank you, you are the only real friend I have!

    Allen - I would have thought the words "customer data" would have given you that impression. But I'll edit the headline.

     Posted by: Marshall Kirkpatrick Author Profile Page | February 25, 2009 9:44 AM



  8. Same comment about the title. "Releases" or "Publishes" instead of "Exposes" would be less confusing maybe

    Posted by: Ozh | February 25, 2009 10:29 AM



  9. It's great that Amazon is making all of this data available via EC2 but the data has always been available for developers that were so inclined to use it.

    These are public data sets already distributed by the respective organizations - from what I can tell this is just clustered to add value to Amazon's Web Service offerings.

    Anyone have any sense on whether Amazon did work to mark these up better, cross reference them, or added any particular value besides exposing them from within EC2?

     Posted by: Christian Author Profile Page | February 25, 2009 10:48 AM



  10. Please go to http://blog.infochimps.org/2009/02/06/start-hacking-machetec2-released

    It will show you which AMI you need in order to access these new datasets. Unfortunately, Amazon is a little light on the details in terms of accessing the datasets they just published.

     Posted by: Allan Author Profile Page | February 25, 2009 11:28 AM



  11. Title is scary, but after reading it, feel much better now.

    Rex

    Posted by: Rex Posted on FriendFeed   | February 25, 2009 11:58 AM



  12. Yes, boo to the misleading page title.

    Posted by: exposer | February 25, 2009 1:30 PM



  13. This system should provide a lot of good fodder to make some interesting mashups. Heck you could probably build an entire self-contained system just using Amazon products exclusively at this point:
    -this service for the raw input
    -EC2 and S3 for crunching and storing data
    -Mechanical Turk for recognizing patterns in the output analysis

    On another note, coincidentally today we released the JumpBox for SnapLogic which is essentially an Open Source "Yahoo Pipes" system. I recorded a demo video that helps people get started with it. You don't have to be a developer anymore to make mashups:

    http://blog.jumpbox.com/2009/02/25/introducing-snaplogic-for-data-integration/

    Sean

    Posted by: Sean Tierney | February 25, 2009 1:48 PM



  14. I understood what he meant from the headline immediately... "expose" has a bad rap, apparently. Published, released, etc would be inaccurate- Amazon is only providing easy access to the info which is freely available elsewhere. "Expose" is exactly the right word for that.

     Posted by: Evan Author Profile Page | February 25, 2009 2:17 PM



  15. Totally read the same thing. I thought for sure, and had to do a triple take, that Amazon had been hacked.

    But, no, this is VERY VERY interesting.

    And not at all frightening like the usual digital security issues that we are bombarded with each and every day.

    You ever read http://www.justaskgemalto.com? Anyway, it's not fun.

    But yes, the cloud is very interesting.

    Posted by: Janet Altman | February 25, 2009 3:31 PM



  16. Did Amazon just make Freebase irrelevant?

     Posted by: David Author Profile Page | February 25, 2009 6:42 PM



  17. @fbase just pinged me to say:

    @daveevans not at all! We make that data available for anyone to use: http://download.freebase.com/datadumps/

    How great is Twitter?

    The data is coming, but where is the interface to make sense of it all and do useful/interesting things with it?

    For years I have had visions of a Photoshop-like canvas with pallets comprised of various public and private datastreams, feeds and updates. Perhaps now is the time to build this out.

    I would love to connect with someone to collaborate on this.

     Posted by: David Author Profile Page | February 25, 2009 7:03 PM



  18. i also thought it was some kind of user-data leak like account passwords or credit cards.

    this is a great thing, though, no doubt about it. it would seem problematic to continually update changing data like with Wikipedia but i'm sure they have it figured out.

    terabyte is the spelling, and really, that isn't much data these days, but if it's all ASCII text files, i think that's about a billion typical book pages or so. but if all those Wikipedia images are in there and any video, that could go quick!

    kudos to Amazon for sharing the wealth.

    Posted by: bruce | February 25, 2009 7:12 PM



  19. This seems like a good step forward for the 'Web of Data'.

     Posted by: Richard Author Profile Page Posted on FriendFeed   | February 25, 2009 8:22 PM



  20. Persistent Content!

    Posted by: Marc Canter | February 25, 2009 8:33 PM



  21. I really apologize this because AWS exposes and makes redundant another bunch of data sets, like Yahoo! does in their Yahoo! Webscope Datasets Catalog.
    BUT when I get this http://aws.amazon.com/publicdatasets right, the data sets are only exposed as dumps and thus, have to be made accessible by a customer via EC2 by mounting them and writing an application to make something out of the data. This customer of course has to pay for the traffic produced on his nodes. So Amazon wants to charge for data sets that are exposed for free somewhere else? Cunning.
    What would it mean to make the data /accessible/? With an accessible data set I mean a data set that is hosted somewhere and that can be in-depth accessed meaning that a machine, e.g. deep-dives into DBPedia to access semantic knowledge about Berlin. The details found there typically link to other data points within the same data set or of a totally different and remote one. Typical use-cases are semantic web browsers or semantic search engines. This is called the Web of Data. For details you should check out the Linking Open Data project (http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData). You may also want to see Tim Berners-Lee's 2009 TED slides: http://www.w3.org/2009/Talks/0204-ted-tbl/#(1) (just click on screen to get to the next slide). Maybe you don't know yet Marc Canter's Open Mesh (http://blog.broadbandmechanics.com/2008/05/how-to-build-the-open-mesh)?
    For those still reading this lengthy comment: In my article A Trilogy of Webs For Machines (http://www.archive.org/details/ATrilogyOfWebsForMachines) I draw an even bigger picture of the Web of Data, the Web of Services and the Web of Identities playing together and making real a new generation of services. Your feedback is welcome (http://www.linkedin.com/in/akorth, http://twitter.com/alexkorth)!

    Posted by: Alexander Korth | February 26, 2009 4:23 AM



  22. So as a representative of one of the organizations that actually puts data out in the public domain, I have to say that this is just a storage device. The public use Census files are basically just scraped from the census site without even the barest amount of documentation and little or no guidance on what it is or how to use it. The only useful application of this was if you wanted to "dump" big files but really didn't care if they were usable by human beings. The storage is great ---the usability not so great. I will eb enamoured with it when they actually make the files usable ---right now about 5% of the people who might be interested in these data files can actually use them.

    Posted by: Felicia LeClere | February 26, 2009 12:33 PM



  23. "Exposes" is the keyword that renders this headline confusing; I also thought it was about a hack.

    This is about amazon publishing data to developers for mashups; they're not exposing anything, they're just publishing it.

    Posted by: ResistToday.com | February 26, 2009 12:38 PM



  24. I am a user of the Amazon s3 and I was thinking as well that Amazon had a one TB leak in terms of sensitive user data, and I was appalled that a 2 Billion dollar infrastructure would be so vulnerable to a potential breach of security. I am glad to hear that my files are safe and that Amazon is just opening up more information to the public, that is more of wonderful due diligence from Amazon since they have this data readily available to them. I am wondering who will be the first to use this data in a significant, news worthy way.

    - Scott

    Posted by: Scott Haines | February 26, 2009 12:42 PM



  25. I recommend that there be a public information release on how often Amazon plans to update these files.

    Posted by: 花蓮民宿 | February 28, 2009 9:27 PM



  26. Title is scary, but after reading it, feel much better now.

    Posted by: ambar | March 2, 2009 2:31 PM



  27. The results from these groups gives him an insight on what words to strategically use in a campaign. Do these sound familiar? "The Change We Need and Yes We Can. Barack won the presidency using these words. All I'm saying is would he have won by saying "My Friends"? McCain didn't.

    Posted by: 花蓮民宿 | March 5, 2009 8:36 AM



  28. This seems like a good step forward for the 'Web of Data'.

    Posted by: Netbook review | June 30, 2009 2:14 AM



  29. The results from these groups gives him an insight on what words to strategically use in a campaign. Do these sound familiar? "The Change We Need and Yes We Can. Barack won the presidency using these words. All I'm saying is would he have won by saying "My Friends"? McCain didn't.

    Posted by: Hiphop | July 24, 2009 8:43 AM



  30. I am a user of the Amazon s3 and I was thinking as well that Amazon had a one TB leak in terms of sensitive user data, and I was appalled that a 2 Billion dollar infrastructure would be so vulnerable to a potential breach of security. I am glad to hear that my files are safe and that Amazon is just opening up more information to the public, that is more of wonderful due diligence from Amazon since they have this data readily available to them. I am wondering who will be the first to use this data in a significant, news worthy way.

    Posted by: Aziz | July 24, 2009 8:44 AM



  31. hmm k .very good.

    Posted by: rap dinle | August 31, 2009 7:13 AM



  32. Wouldn't it be great if all the medical record formats were also available as a public data set. S3 is the perfect place for a medical records hub and supports Obama's plan to automate medical records.

    Posted by: Multivitamins | December 5, 2009 9:41 PM



Leave a comment

Optional: Sign in with Connect Facebook   Sign in with Twitter Twitter   Sign in with OpenID OpenID  |  

If you think Twitter is big, check out the Real-Time Web
RWW SPONSORS



FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS



RWW PARTNERS