Problems with Preservation

By Nathan Mulac DeHoff

Abstract

This paper explores the various issues concerned with digital preservation. First, the reasons why preservation is a problem are explored. These include difficulties with systems, storage media, databases, and standards. Next, some possible solutions are identified. The focus is primarily on migration, emulation, and encapsulation, but a few other techniques are mentioned. The conclusion explores what might possibly be the best solution to the digital preservation problems.

Introduction

Digital preservation is a significant problem in the world of digital libraries. In such libraries, data are stored electronically. At first, this might sound beneficial, as electronic data will not deteriorate or become lost or stolen, as books often do. The problem arises when digital obsolescence is taken into consideration. The world of computers and the Internet is a rapidly changing one. New technologies constantly come into being, and, in a relatively short time, completely supersede the old. Works written in older word processors and saved on five-and-a-half-inch floppy diskettes are almost as useless to most researchers today as are books written in dead languages. In fact, they are perhaps more useless, as there are many scholars who can read Latin, but preserving digital documents created with obsolete programs requires a long and time-consuming process of converting these documents to newer forms, with updates being required every few years.

The Problems

Jeff Rothenberg presents digital preservation as somewhat of a paradox, since “digital documents can be copied perfectly, which is often naively taken to mean that they are eternal,” yet “most digital documents and artifacts exist only in encoded form, requiring specific software to bring their bit streams to life and make them truly usable; as these programs (or the hardware/software environments in which they run) become obsolete, the digital documents that depend on them become unreadable—held hostage to their own encoding.” The data themselves essentially last forever, but the material required to access them does not last long at all, making digital preservation less of a wonderful innovation than it might otherwise be considered. Both the programs and the media become obsolete, with new programs and program versions constantly being released by competing software companies (not to mention the rapidly changing nature of computers themselves), and storage media also being improved at an incredible rate. As Rothenberg indicates, “[t]he short lifetimes of eight-inch floppy disks, tape cartridges and reels, hard-sectored disks, and seven-track tapes, among others, demonstrate how quickly storage formats become inaccessible.” The rapidly changing nature of computer software and hardware makes it difficult to keep any information for an extended period of time. Not only do most electronic storage media become obsolete fairly quickly, but they also tend not to last very long. Items such as floppy diskettes and compact discs are rather fragile. While they do not necessarily wear out easily, they are susceptible to many outside forces, which can render them useless. Margaret Hedstrom mentions “binder degradation, magnetic particle instabilities, and substrate deformation” as problems with magnetic media, and “high humidity, rapid and extreme temperature fluctuations, and contamination from airborne particulate matter” as damaging to optical media. In order for these problems to be resolved, new storage media must come into existence. One type mentioned by Hedstrom is the High-Density Read-Only Memory, allegedly created at the Los Alamos National Laboratory. It is said to be “impervious to material degradation and it requires no bit stream interpreter because the technology can describe in human-readable form all of the instructions needed to interpret the data (LANL Ion Beam Storage).” No digital media are completely indestructible, but it is quite possible that technology will eventually develop beyond the need for compact discs and the like. Until then, though, the lack of reliability associated with these storage media remains a major problem with digital preservation.

The Reagan Moore article touches upon the problems associated with databases, in addition to simple documents. Not only must each and every document be converted to work with current technology, but the catalog must also be converted, as well as updated. Moore writes, “The organization of the data into collections must also be preserved in the face of rapidly changing database technology. Thus each collection must be migrated forward in time onto new data management systems, simultaneously with the migration of the individual data objects onto new media.” This adds an additional task that either those updating files or those working to combat digital obsolescence in the first place must handle. Organization and databases are essential parts of any library, digital or otherwise, and making old digital files less obsolete becomes less necessary if these documents are not sufficiently archived and indexed, allowing them to be found and read by future researchers. The article identifies the three aspects of a digital collection that must be kept up to date as digital object, data collection, and presentation representations. The object representation is comprised of the unique data regarding the format and context of each digital item. The data collection representation is “typically a subset of the attributes associated with the digital objects.” The attributes should be stored in metadata, so that their associations can be repeated in future versions of the database. The presentation representation relates to the interface that a researcher uses to access the information in the digital objects. According to Moore, “[r]e-creation of the original view of a collection is a typical archival requirement.” Data should not be changed much from its original form, or it will lose almost as much as would a translation or reprint of a book. For this reason, preservation of a document’s original form is a significant issue in the field of digital preservation.

One major problem with nearly any possible solution to the technological obsolescence issue is that there are no national standards for preservation of digitally stored materials. In the words of Hedstrom, “digital library research has focussed on architectures and systems for information organization and retrieval, presentation and visualization, and administration of intellectual property rights (Levy and Marshall). The critical role of digital libraries and archives in ensuring the future accessibility of information with enduring value has taken a back seat to enhancing access to current and actively used materials. As a consequence, digital preservation remains largely experimental and replete with the risks associated with untested methods; and digital preservation requirements have not been factored into the architecture, resource allocation, or planning for digital libraries.” Some libraries have tried to set standards for software and storage files, such as the use of common forms for databases and images. Hedstrom declares, “The strategy rests on the assumption that software products which are either compliant with widely adopted standards or are widely dispersed in the marketplace are less volatile than the software market as a whole. Most common commercial products today provide utilities for backward compatibility and for swapping documents, databases, and more complex objects between software systems.” This strategy does not, however, eliminate the need for migration of files. They must still be transferred as standards are changed and newer, better programs come into existence. This is a reason why standards should remain fairly constant, or at least change in a way that allows for methods such as emulation and encapsulation to remain effective as the programs continue to change.

Possible Solutions

How can the problem of digital obsolescence be solved? According to Rothenberg, one suggestion is that they be printed out on paper. This, as Rothenberg himself agrees, seems rather ridiculous. Both paper and microfilm, however, are considerably more stable than current digital storage media. Hedstrom writes, “It seems ironic that just as libraries and archives are discovering digital conversion as a cost-effective preservation method for certain deteriorating materials, much information that begins its life in electronic form is printed on paper or microfilm for safe, secure long-term storage. Yet, high-quality acid neutral paper can last a century or longer while archival quality microfilm is projected to last 300 years or more. Paper and microfilm have the additional advantage of requiring no special hardware or software for retrieval or viewing.” Still, this is not a solution, but merely a back-up procedure, and a destructive and wasteful one at that. Rothenberg’s argument against this so-called “solution” makes sense. He states that “[p]rinting any but the simplest, traditional documents results in the loss of their unique functionality (such as dynamic interaction, nonlinearity, and integration), and printing any document makes it no longer truly machine-readable, which in turn destroys its core digital attributes (perfect copying, access, distribution, and so forth).” Printing digital documents also requires a large amount of paper, and would lead to all of the same problems that are experienced with books, journals, and other print resources (deterioration, space constraints, etc.). In order to restore them to a digital format, they would have to be re-entered into a computer, which would be a remarkably time-consuming process. While print copies of the more important documents should definitely be kept, this will not solve the problems associated with digital and technological obsolescence.

Along the same lines as making hard copies is another procedure that retains the digital nature of electronic documents, but simplifies them. This, as Hedstrom states, is to save the files in “the simplest possible digital formats in order to minimize the requirements for sophisticated retrieval software.” The cost of this process is low, and data can be easily transferred from one program or system to another without any significant loss of content. The main problem with this is that many digital materials, especially multimedia files, are not simply textual and numeric data, and require a higher level of programming than an ASCII text file or other simple form would supply. Hedstrom even makes explicit the fact that the solution will only be effective “where retaining the content is paramount, but display, indexing, and computational characteristics are not critical.” An interesting point made by Rothenberg, which adds to the argument against this method, is that digital preservation methods cannot be limited to text. There are many multimedia documents being created today. Rothenberg’s article declares that “the generation of multimedia records has increased rapidly in recent years, to include audio recordings, graphical charts, photographic imagery, and video presentations, among others,” and that “multimedia and hypermedia records are likely to become ever more popular and may well become dominant in the near future.” In order to be truly useful, these multimedia and hypermedia documents must be preserved, which would require more complex and universal solution than one that simply captures text.

Hedstrom mentions the possibility of simply retaining old computer hardware and software, so that updating files would be rendered unnecessary. She states that it “would support replay of original sources and contribute to the preservation of software as a significant cultural and intellectual resource in its own right.” Hedstrom makes it clear that feasibility studies should be done before actually putting this into practice, as it would certainly cause problems. It could be expensive, and result in an enormous amount of hardware and software being kept by libraries. Maintenance would also be difficult, as computer hardware and storage media tend not to last all that long, and it would be difficult to find someone who would repair an old computer system. Rothenberg and Granger (“Emulation”) agree, with the latter stating that the former gave the cost, necessity to keep building new device interfaces, and limited lifetime of computer microchips as arguments against the idea.

Perhaps the three most significant strategies for digital preservation are migration, emulation, and encapsulation. Migration is basically just copying data to a new set of computer hardware and software. One disadvantage of this technique, according to the Preserving Access to Digital Information (PADI) website is that “[m]igration to new operating environments often means that the copy is not exactly the same as the original piece of information.” Moore’s article addresses the problem of time consumption in migration. It mentions that “[t]he concern is that when the data storage technology becomes obsolete, the time needed to migrate to new technology may exceed the lifetime of the hardware and software systems that are being used. This is exacerbated by the need to be able to retrieve information from the archived data.” Not only is copying documents time-consuming, but it is also far from foolproof. As Rothenberg states, “the copy process must avoid corrupting documents via compression, encryption, or changing data formats.” It is a difficult process, and not one that should have to be done every few years. Different methods of migration include transferring data to non-digital media, using digital standard formats (a somewhat dangerous task in this world of ever-changing standards), and use of software that can decode data from older programs or versions of the same program (as, for instance, recent versions of Microsoft Word can do). Backwards compatibility is not always possible, though. The Commission on Preservation and Access makes it clear that “copying depends either on the compatibility of present and past versions of software and generations of hardware or the ability of competing hardware and software product lines to interoperate. In respect of these factors -- backward compatibility and interoperability -- the rate of technological change exacts a serious toll on efforts to ensure the longevity of digital information.” He goes on to mention that “it is costly and difficult for vendors to assure that their products are either ‘backwardly compatible’ with previous versions or that they can interoperate with competing products.” Despite its flaws, however, migration is probably the most feasible solution using modern technology.

Emulation is defined by the PADI website as “the process of mimicking, in software, a piece of hardware or software so that other processes think the original equipment/function is still available in its original form.” The site gives the fact that the data do not need to be changed as a major advantage of this method. Problems with emulation include cost, which might be prohibitive because of issues with intellectual property rights (Granger, “Emulation”). David Bearman also has problems with emulation, suggesting that it “would not preserve electronic records as evidence even if it could be made to work and is serious overkill for most electronic documents where preserving evidence is not a requirement.” He argues that Rothenberg, a major proponent of emulation, is concentrating too much on the functionality of computer systems, and not enough on the actual content of records.

The PADI website identifies encapsulation as “a technique of grouping together a digital object and anything else necessary to provide access to that object.” The information in an encapsulation package should include “the representation information used to interpret the bits appropriately for access; the provenance to describe the source of the object; the context to describe how the object relates to other information outside the container; reference to provide one or more identifiers to uniquely identify the object; and fixity to provide evidence that the object has not been altered.” The presence of this information makes it much easier for later computer programs to interpret the original data in its original form. Still, it is not foolproof, as it requires future computers to be able to interpret the data from the old package. For obvious reasons, capsules can only be based on present technology, not what might be used in the future. As long as every new generation of information on standards and programs is added to the package, it can remain in existence for a long time. While not as time-consuming as simply copying the documents, however, this process could still require a significant amount of extra work, as well as some extra cost.

Granger (“Emulation”) advises using a combination of these methods. For instance, if the storage media become obsolete, but the document formats do not, migration would be the most sensible strategy, while emulation could be used when the software and operating systems are in danger of becoming obsolete. In this way, the advantages of each system could be utilized, while the disadvantages could be avoided.

Metadata are also an important aspect of digital preservation. As Titia van der Werf-Davelaar indicates, such data will be necessary to future researchers who are trying to convert previously stored information to more modern forms. The metadata should include information about the format of the document in question, requirements for accessing the data, and how the information has changed from its earliest form. As the article states, “[t]he parts that need to be emulated need to be specified in detail (metadata) in a high-level language and the user needs to be educated to "use" the digital original -- as future generations will not know how to interact with obsolete IT-based end user environments.” Regardless of what conversion method is used, people making conversions will find metadata useful in determining exactly what needs to be done, how to do it, and what differences might exist between the new version and the original. Unfortunately, standards for metadata might be slow in coming into being. Both businesses and educational institutions, according to Charles Thomas and Linda Griffin, do not want to implement metadata standards, because the creation of the metadata would be costly (with no immediate benefits seen by the money suppliers) and time-consuming. They suggest that a metadata standard will only come into existence when it becomes profitable. Metadata standards are important for preservation, however. Stewart Granger (“Metadata”) writes, “We should recognise and accept that art historians, say, will have special and different requirements from, say, mathematicians, and vice versa. But what we should resist as far as possible is the situation where metadata can meet the needs for, say, resource discovery perfectly but does nothing for preservation or rights management - and vice versa.” While not every field would necessarily create and use metadata in the same manner, there should definitely be standards used for creation, discovery, and preservation.

Conclusion

What is the best solution to the problems with digital preservation? This is a difficult, if not impossible, question to answer, as every proposed solution seems to have just as many disadvantages as it does benefits. The encapsulation method is a sound one, but it still relies on systems being able to read data from older machines, which is not always possible with the way computers are now. Therefore, a universal standard is also necessary to digital libraries and other repositories of digital information. Repositories today can use any computer programs or storage methods extant to store their digital information. While making standards too strict might not be a good idea, there should be some guidelines as to which programs can be used. The file formats should all be compatible with one another, so that information can be transferred from one system to another with the minimal amount of trouble. Backwards compatibility is also important. If a library installs a new computer program or system, it should be able to read or interpret files from the old programs and systems. Many software companies do not include such capabilities in their offerings, largely because they are in competition with one another, and they do not care if their programs are compatible with those of rival developers. Digital libraries, however, should work toward a standard method of file storage, so that document can be preserved for more than a few years at a time. As long as there are many different standards, such techniques as encapsulation and the use of metadata are not as useful as they could be. In the meantime, migration remains the only truly feasible solution, and it will always continue to have some place in the solution, but using it in conjunction with other methods will help to overcome its shortcomings.

WORKS CITED

Bearman, David (April 1999). “Reality and Chimeras in the Preservation of Electronic Records.” D-Lib Magazine 5(4). Available: http://www.dlib.org/dlib/april99/bearman/04bearman.html

Commission on Preservation and Access, The (1 May 1996). “Preserving Digital Information.” Available: http://www.rlg.org/ArchTF/tfadi.index.htm

Granger, Stewart (October 2000). “Emulation as a Digital Preservation Strategy.” D-Lib Magazine 6(10). Available: http://www.dlib.org/dlib/october00/granger/10granger.html

Granger, Stewart. “Metadata and Digital Preservation: A Plea for Cross-Interest Collaboration.” Available: http://dspace.dial.pipex.com/stewartg/metpres.html

Hedstrom, Margaret. “Digital Preservation: A Time Bomb for Digital Libraries.” Available: http://www.uky.edu/~kiernan/DL/hedstrom.html

Moore, Reagan, et al (March 2000). “Collection-Based Persistent Digital Archives—Part 1.” D-Lib Magazine 6(3). Available: http://www.dlib.org/dlib/march00/moore/03moore-pt1.html

“PADI—Preserving Access to Digital Information” (15 November 1999). Available: http://www.nla.gov.au/padi/

Rothenberg, Jeff (January 1998). “Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation.” Council on Library and Information Resources. Available: http://www.clir.org/pubs/reports/rothenberg/contents.html

Thomas, Charles F., and Linda S. Griffin (1998). “Who Will Create the Metadata for the Internet?” First Monday. Available: http://www.firstmonday.dk/issues/issue3_12/thomas/

Werf-Davelaar, Titia van der (September 1999). “Long-Term Preservation of Electronic Publications.” D-Lib Magazine 5(9). Available: http://www.dlib.org/dlib/september99/vanderwerf/09vanderwerf.html

Return to my INFO 653 page.