Amazon is turning to the public for help, asking for public data sets in an attempt to create a cloud data service that provides what they describe as a "convenient way to share, access, and use public data."
Called AWS Hosted Public Data Sets, the service will enable you to use public data within your Amazon EC2 environment. Select public data sets will be hosted on AWS for free as an Amazon EBS snapshot.
While there are publicly available data sets, accessing them can be expensive and tedious. For instance, the Gutenberg Project offers its eBooks files as a download, but to get a copy you can expect to wait 48 hours for the download to be complete (based on DSL 1MBit/s and a 14.5 GB zip file). If you want the mp3, you'll have a nine day wait to download the 91.5GB file.
However, as there is no indication that the Gutenberg Project will be added to AWS, we've calculated how long it would take to download and upload the 80GB UGI Virtual Conformer Library, one of the listed data sets AWS plans to host.
Using a residential cable provider in California, it would take 22 hours 36 minutes to download, and 3 days 36 minutes to upload to a server in the same state. However, if the server was in New York and we accessed it from California, it would take 3 days 42 minutes to download, and 7 days 14 hours to upload. Clearly inefficient.
People have been searching for better ways to access public data sets for some time, and AWS Hosted Data Sets may just be the answer they've been looking for; allowing anyone to do the type of computing that in the past has been limited to large organizations with lots of money.
Current data sets that Amazon are working on include: annotated Human Genome data, PubChem and UGI Virtual Conformer libraries, the U.S. Census, various labor statistics, and various economic and transportation databases.
AWS will continue to add to the collection over time, and this is where you come in.
If you have a public data set and hold the rights to the distribution of it, you can submit a request on the AWS Public Hosted Data Sets site to have it included.
This is huge.
Comments
Subscribe to comments for this post OR Subscribe to comments for all Read/WriteWeb posts
http://manybooks.net is essentially a nicer interface for the Gutenberg Project (plus other open books the webmaster manages to dig up) an, from what I understand, he's caching any of the format-specific generated books to S3.
His service offer a whole slew of formats for each book, but he only generates it the first time someone actually requests that book in that specific format. He caches that format to S3 and serves it from there, from then on.
Posted by: Walker Hamilton | November 24, 2008 6:34 AM
It is a wonderful new resource, especially for fans and users of EC2. If Amazon could develop a good process for submitting and data sets, that would open the service to even broader use.
Posted by: michael.chelen.myopenid.com
|
November 24, 2008 7:44 AM
Downloading electronic copies of complete digital books through the Gutenberg Project is instantaneous (it never takes more than a few seconds) and is completely free. You can choose formats including just a basic text file which makes searches within the text simple and efficient.
So obviously you are describing something much more elaborate here regarding smarter technologies to search, access and distribute data that is now not easily available in deep web databases?
In the past Amazon on-line services had limitations based on whether or not users made on-line Amazon purchases. Is this similar?
Posted by: Maureen Flynn-Burhoe | November 24, 2008 10:07 AM