ReadWriteWeb

Digital Information 250 Years From Now

Written by Josh Catone / April 12, 2008 9:29 AM / 26 Comments

The US National Archives and Records Administration (NARA) has apparently decided to end its policy of taking a "digital snapshot" of all public congressional and federal web sites after each congressional and presidential term. According to NARA, which is understandably drawing heat for the policy change, they shouldn't need to archive those web sites because federal agencies and congress should be doing their own archiving. I read about NARA after reading a very timely piece from Leland Rucker about the nature of information archiving in a totally digital world, and it got me wondering: what happens to all this content on the web 250 years in the future?

Last year Google's archives touched 100 exabytes of data from the web. To put that in perspective, that's about 107 billion gigabytes (or, over a half a million 200 GB hard drives). The entire catalog of the Library of Congress is about 136 terabytes -- which makes Google's archive the data equivalent of 771,000 Libraries of Congress.

So clearly, there is a lot of data out there to be stored. And the vast majority of that data isn't printed -- it is being stored digitally and created on computers via email, forums, social networks, blog posts, video sharing, bookmarking, chat, etc. A lot of that data isn't necessarily something we need to save (who needs an archive of every email I send to my mom, for example?), but what of the data that we do want to keep for the future? The posts on this blog, or thoughtful debates taking place on forums, or breaking news videos published on YouTube, for example.

The Internet is very transient in nature, things often move at a breakneck pace. The main page of a blog like ReadWriteWeb might change 10-15 times in a day. The main page of CNN.com might change far more than that. How do we archive information when the technology to read it, and indeed the information itself, changes so fast?

About 200 years ago, Thomas Jefferson sold his personal library of 6,000 books to the Library of Congress. About 150 years ago, more than half were destroyed in a fire. But today, all 6,000 of them have been recovered or recreated and will go on display at the LoC. Now we're living in the so-called information age, where almost a gigabyte of new data is being created each year for every man, woman, and child on earth. But what's going to happen it to it all 250 years from now? "Is digital content too ephemeral to last?" wondered Leland Rucker. Will digital information have the same lifespan as printed books?

We'd love to hear your thoughts on the matter, so please let us know in the comments what you think the future holds for the massive flood of information we're creating today.

Comments

Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts

  1. More bits to come.. better tools available also.. no worries.

    Posted by: YDRIVE | April 12, 2008 10:07 AM



  2. I've been involved in a project to build permanent institutional archives for several European libraries. I'm very interested in how we are going to archive and safeguard all the wonderful things we have online today. It was complicated enough to build an archive for a national library - I can't quite imagine how complicated it will be to build one for important content found online. The sheer amount of data is breathtaking!

    Posted by: Darko | April 12, 2008 10:38 AM



  3. You might want to revisit your history books on the Jefferson facts. 300 years ago -- 1708 -- Thomas Jefferson was still 35 years from being born. He wasn't president until 1801, didn't donate the books until 1815 and that was to replace the books that had been burned a year earlier. His own books were not burned.

    http://wiki.monticello.org/mediawiki/index.php/Library_of_Congress_Sale

    Read before you write on the web, readwriteweb.

    Posted by: Evan Wired | April 12, 2008 11:21 AM



  4. The first thought is, do we need it.
    And the answer is not for every website, but for important ones.
    However we should be archiving only incremental content. Both of my above propositions decrease the data to be archived by some percentage. However, the data to be archived is too high. Which means that we will be needing either more silicon or a new way to store data.
    (I myself will vote for storing data in silicon imported from moon..yes moon)
    But I think again, do we need it?

    Posted by: Varun Mahajan | April 12, 2008 11:50 AM



  5. We have nothing to worry about, as far as storage is concerned. Bigger and faster storage devices will always come out. It's the retrieval of information that is still up in the air. Search tools will have to be extremely quick and efficient and probably won't resemble search as we know it today.

    Posted by: Michael Vu | April 12, 2008 12:29 PM



  6. What guarantee do you have that it will be readable at all? Paper can be read if preserved 250 years from now, but what guarantee do we have that the formats we have today will be around in a fifty years. I have yet to see much standardization on formats. There is a real need for direction here. MSOOXML or even ODF is wildly insufficient.

    Posted by: Don Watkins | April 12, 2008 12:35 PM



  7. @Evan: That was just a math error on my part (hey, the last math class I took was in 10th grade, cut me some slack). Thanks, though. Fixed.

    Also the books he donated were burned in 1851. Click through to the Washington Post article linked in the post above. "[Jefferson's books] eventually became the foundation for the Library of Congress, although two-thirds were lost in a fire in 1851."

     Posted by: Josh Catone Author Profile Page | April 12, 2008 1:10 PM



  8. I've been thinking about this problem and the one Don mentions above for the past couple of years and believe I have a solution for all online content that I'm working towards at the moment.

    All information is a transformation of one form or another, convergence with such vast amounts of information is the key to efficiently storing it. Peer distribution, locality-caching and management of that distributed data using content-centric networking practices is where I believe we're headed.

    Time will tell. There is a lot to factor into such a system that can itself evolve over time in a way that affords systems of the future to reach back into the past as if it were the present. I hate to draw an analogy of a time machine, however that is the kind of digital archival we'll need if we want to maintain information context with the ability to execute applications on platforms currently in existence. It's not a matter of just storing the information anymore but the runtime and platform (hardware configuration) itself as well. Not for everything, but none-the-less as much as we deem necessary to want to maintain for future generations.

    Posted by: Craig Overend | April 12, 2008 1:15 PM



  9. Excellent article, Josh!

    This is a very complex question. There are two main issues:

    1. As Don and Craig mention above, formats change very quickly, as do storage media - quickly in the historical/archaeological sense of time, not Internet-time! For example, can you even find a reader for an 8-inch floppy today, and that's only 30-year-old technology.

    2. In the context of the Library of Congress documents and others (this may not apply to numbers or scientific equations), language itself also changes very rapidly. It's hard to read 100-year-old English texts, and 3000+-year-old Egyptian hieroglyphics are even more obscure!

    There are some very bright people working on these issues already.

    For #2, the Rosetta Project of the LongNow foundation is creating a modern Rosetta stone (Rosetta Disk) that will allow future generations to decipher the keys of all currently known languages.

    There are many others working on these problems as well.

    Similar initiatives are taking place in the off-line world, e.g. a project to save all types of seeds at the North Pole!

    Posted by: NitinK | April 12, 2008 7:24 PM



  10. I don't now about 250 years, even 25 is hard to guess.

    What I do know is that I've met the limits of trusting all digital recently:

    - Old Works files that I can no longer open
    - Found my University thesis on a 5.25" floppy disk
    - Could not access information from fax files because they were sent from an old PC using program I no longer have.

    The list could go on. We need to think of formats and make sure everyting is in a format that will "survive" - it's not obvious and certainly won't take care of itself :-(

    Posted by: Zoli Erdos Author Profile Page | April 12, 2008 7:52 PM



  11. There's no need to archieve anything... everything will come to pass, according to the christian bible.. should that be the case, why bother?

    Posted by: 113.com | April 12, 2008 11:22 PM



  12. >There's no need to archieve anything...
     There's no need to archive anything...

    Posted by: 113.com | April 12, 2008 11:24 PM



  13. Interesting post. I'd say the storage density is expected to go up as well:
    http://highfalcon.blogspot.com/2008/04/future-of-digital-storage.html

    Posted by: Manpreet S. | April 12, 2008 11:31 PM



  14. The question here is being dealt with when it comes to public and closed government archives. There are several standards adressing these problems. The general question of long time preservation of digital information is high up on the agenda in several norwegian government projects, as it is in many other european countries. Look at LongRec and SERES (ie semantic registry for electronic interoperability - only in norwegian, sorry!) as an example of two of these.

    As to the question of how websites are going to archive their site content I think this is a responsibility of the publisher. There are some open source projects like Fedora Commons, DSpace and EPrints that can serve as backend repositories for cms-driven sites.

    Another question is how we as individuals take care of our im-conversations, webmail-accounts and such. And if you bloggers out there demand from blogspot, wordpress etc to agree on a standard archive format it would help preserving your blogs.

    Posted by: Ståle Prestøy | April 13, 2008 4:54 AM



  15. There is a great feature by the History Channel called "Life After People" http://www.history.com/minisites/life_after_people/

    It discusses the fact that without electricity and human-maintained indoor conditions, everything mankind has published digital, plastic, paper or photographed will be destroyed in a matter of a couple of centuries. The Egyptian Civilizations may be thousands of years behind us, but they have left remains in writing that few of our modern knowledge storage methods can outlive without constant management.

    Because sustaining information live on the net or otherwise requires constant upkeep in a couple of decades we will see a natural selection of data. What matters today but does not matter in the future will be left behind as the important information is replicated again and again for preservation. The big question is how do we keep information we have generated alive after we have passed away.

    Maybe someone smart one day will offer a 1000 years of hosting and domain service package :) I wonder who will be the first one to write in a Will "And I dedicate $X for the upkeep of my blog domain and host for the next 200 years."

    Posted by: Diana Z. | April 13, 2008 12:14 PM



  16. Reading the article brought to mind the paper shredder, and its digital echo, the "crash." I am a writer, I generate words. My professional papers are stored at a university. I laugh at the thought of someone retrieving USING any of the information from any individual, especially one who is not a world figure. What is the purpose, the use? Shakespeare wrote, "The bad that men do lives after them; the good is oft interred with their bones." Who will be the audience in 250 years? Will there be literate sensate beings to read-view, comprehend and interpret?

    Posted by: Gwendoline Y. Fortune | April 13, 2008 12:27 PM



  17. The problem of digital preservation and digital archiving is one that many great brains are working on. A **few** of the projects and models are: the Open Archival Information System (OAIS http://nost.gsfc.nasa.gov/isoas/), preservation data grids (http://www.archives.gov/era/), and the work of the Digital Curation Centre in the UK (http://www.dcc.ac.uk/). The movie industry addressed the problem recently with a report by the Academy of Motion Picture Arts and Sciences entitled "Digital Dilemma" (http://www.variety.com/article/VR1117975368.html?categoryid=1009&cs=1).

    Outside of the problems of format obsolescence, software evolution/changes, hardware upgrades, et al, there is the problem of cost. Even the movie industry has stated that they will have to revise their "save everything" policy (useful for film) because the cost of storing "everything" was 1,100% (that is not a typo) higher than for film masters. And this from an industry that, unlike libraries and archives, has deep pockets.

    My guess is that NARA may not have the staffing and financial resources to take digital snapshots of the U.S. government at pre-assigned intervals.

    Posted by: Jewel Ward | April 13, 2008 2:53 PM



  18. "But today, all 6,000 of them have been recovered or recreated and will go on display at the LoC."

    Please check your facts. They are still missing over 300 books for this display. A simple call would have confirmed this. Just illustrates, again, that much of the information on the WEB is both unreliable, not reviewed and, in many cases, simply and utterly useless therefore making it extremely dangerous.

    See #3 PLEASE

    Posted by: Factsy | April 13, 2008 6:33 PM



  19. @Factsy #18: I think you're missing the point of the post.

    But like Evan #3, I'll also refer you to the Washington Post piece, which reads, "The entire collection of more than 6,000 volumes -- some originals and some replacements -- will go on display tomorrow at the Library of Congress, looking much as it would have 200 years ago."

    This post, though, was not about the LoC exhibit. Rather it was a musing on the transient nature of digital information.

     Posted by: Josh Catone Author Profile Page | April 13, 2008 7:08 PM



  20. Facts are facts...and the Washington Post is well known for its distortion of facts. Use a telephone...its old technology, but quite useful in making sure your facts are indeed facts.

    Posted by: Factsy | April 13, 2008 9:34 PM



  21. @Factsy: I think you'd have a valid point if this post was at all about the Jefferson exhibit. But it wasn't. There is no reason for me to doubt the Washington Post -- one of the nation's oldest and most respected newspapers, regardless of your opinions of it -- as a sound secondary source.

    I'm not sure why anyone would expect me to call the Library of Congress and say, "The Washington Post is reporting that all the books in the Jefferson exhibit were restored or recreated, but for no apparent reason I doubt their reporting and want to know if maybe a few of the books are still missing? This has no bearing on the point of the piece I am writing, which isn't about your exhibit, by the way."

    If I was writing about the Library of Congress exhibit, then yes, there would be reason to call and get the facts straight from them. But I wasn't writing about the exhibit, and there was absolutely no reason to check the Post's reporting for such a trivial part of this post -- and especially for such a random factoid (as in, why would anyone randomly assume that 300 of the books weren't recovered unless they already knew that?).

    Sorry if this sounds at all condescending, but your nitpicking is really unwarranted in my opinion.

     Posted by: Josh Catone Author Profile Page | April 14, 2008 5:59 AM



  22. Thanks for picking up on my thread, Josh. And after reading the comments here, I'll admit to not being very optimistic about finding real answers to the questions posed. That "great brains" are working on it doesn't make me feel optimistic. How about you?

    Posted by: Leland Rucker | April 14, 2008 3:08 PM



  23. I recently posted about some of these issues on my blog under "The sustainable web"

    It probably won't come as much of a suprise that those who have historically done a great deal of work in provenance, record keeping and archiving are ahead of the game here - libraries and archives.

    Posted by: Fiona | April 15, 2008 2:16 AM



  24. who needs an archive of every email I send to my mom, for example?

    Social history archivists.

    Honestly, you'd be surprised at what people treat as important, and rightly so.

    Posted by: Dan Eastwell | April 16, 2008 4:24 AM



  25. We read with interest your postings on this topic.

    The National Archives and Records Administration (NARA) has posted background information regarding our web harvest decision at http://www.archives.gov/records-mgmt/memos/nwm13-2008-brief.html. This background document includes links to our guidance products related to web records and the decisionmaking process we went through to arrive at our decision.

    Paul M. Wester, Jr.
    Director, Modern Records Programs
    National Archives and Records Administration

    Posted by: Paul Wester | April 16, 2008 4:37 AM



  26. As someone who has been working on digital preservation issues in an archival context for the last 6 years I can say confidently that Leland Rucker should do a bit more research before making such uninformed generalisations. As others have said here, many people and institutions have been and are working on the issues to do with the preservation of digital objects. Leland's post at #22 is unneccesarily rude and condescending, and very ignorant in the way it dismisses out of hand a lot of exciting, valuable, and significant work that has been progressing the whole question of how to go about preserving digital content.

    Posted by: Droo | May 11, 2008 3:55 PM



The ReadWrite Real-Time Web Summit
RWW SPONSORS


FOLLOW @RWW ON TWITTER

ReadWriteWeb on Facebook



TEXT LINK ADS