Diamonds May Be Forever, But Data?

February 1, 2002

Applied Clinical Trials

Applied Clinical Trials, Applied Clinical Trials-02-01-2002,

Future retrieval of vital clinical data depends on having a strategy to deal with the rapid obsolescence of computer software and hardware.

Spring is just around the corner, and with its arrival comes an annual ritual for my householdspring cleaning. Although I am not generally a big fan of this activity, there is one aspect that I do enjoy. Every year I open up another few boxes from the last centuryboxes from college, medical school, and other bygone days. Each one is a little time capsule of memories. Old newspapers, photographs, and other memorabilia recall events that are long since past. Each year I find that some of my memories are difficult to enjoy, because the medium on which the information was recorded is no longer readable. Unfortunately, the same issues that led to the loss of my memories can spell disaster for companies that are vitally dependent on their digital data.

It is worth taking a look at a few items that I pulled out of one of my boxes last year, and consider the reasons that they are unreadable. Lots of things in the box are a bit faded, but otherwise as good as new. Black and white photographs from the early 1900s dont look all that much different from when they were new. Color photos from the 1960s may be faded, but the image is clear. The first big problem is an eight-track tape of the Mamas and the Papas. I dont think I am going to be listening to that anytime soon! I dont remember the last time I saw an eight-track tape player outsideor for that matter, insidea museum. Thank goodness somebody preserved the songs in a different format and had them transferred to CDs, where they will last foreveror will they?Next, I found an old computer chess program for my Apple IIe. Unfortunately, I threw out that computer a number of years ago during another annual spring cleaning. Even though the Apple IIe was sold until 1993, it seems hopelessly out-of-date. Perhaps I could just put the 5.25-inch floppy disk in my current computer. Not too many of those around the office, these days.

Finally, I unearthed a videotape from years ago. Thankfully, it is VHS, not Betamax. I put it into the VCR and voilaa lot of static and very, very poor images. It seems that New England conditions in my attic are not all that friendly to magnetic tapes.

By now, I am sure that most of you see the direct parallels between these stories and the data collected during pharmaceutical development. Drug development is becoming more and more data intensive as our ability to collect and record data increases. In the past, drug discovery often involved the screening of a finite number of compounds against a specific assay or disease model. Today, genomic analysis, modeling, and high throughput screening of combinatorial libraries are generating huge amounts of data. Even clinical trials can now involve continuous monitoring through telemetry with the potential for generation of megabytes of information on a single subject. A small portion of this huge amount of data is extremely valuable today and will be in the future.

Data obsolescence
These real world examples have clear parallels in the issues of data obsolescence in pharmaceutical data. First, think about those eight-tracks, Betamax tapes, and 5.25-inch floppy disks. Each is an example of a recording medium that has become obsolete. While it is theoretically possible to find someone with a player for one of these formats, you would have to look far and wide.

Data orphans. Media obsolescence can make orphans of data in the clinical world as well. Certainly, older reel-to-reel mainframe data tapes, 8-inch floppy disks, paper tapes, and punch card data would be hard to resurrect today in most settings. Extracting data from those formats would be costly and fraught with challenges. Yet, a significant amount of clinical trial data for currently approved drugs exists in just those formats.

Although these examples are somewhat obvious, more subtle examples abound. Only 10 years ago I backed up all of my academic data onto a tape-based portable hard drive for a Macintosh computer. Today, I dont own a Macintosh and dont know where to find the particular specialized hardware I would need to get access to that data.

Obsolescence also applies to the software and hardware necessary to read data files. Since I tossed my Apple IIe, I cannot run my chess program. More painful, however, is the loss of access to five years worth of word processing files that I produced with the DOS-based Leading Edge Word Processor. Back in the early days of PCs, there were dozens of choices in word processors. In 1987, I bought an inexpensive Leading Edge PC that came with its own word processor. We all know what has happened with word processorsfirst Word Perfect seemed to command a lead in our industry, but now Microsoft Word has a dominant position in the market. MS Word can read some older files from other software, such as Word Perfect 5.0 or 6.0, but to read earlier versions of Word Perfect requires a special translation program that might be difficult to obtain. The results would be only fair, because much of the formatting is lost in translation.

It would probably be difficult or expensive to find translation software that would read original Leading Edge Word Processor files. As I remember, this word processor had its own proprietary file format so that the words themselves arent even retrievable from those files. Software obsolescence virtually destroyed my data.

A special challenge in imaging. Software and hardware obsolescence are special challenges in imaging, especially diagnostic imaging. In the early days of any technology, there are rarely any standards for recording data. With diagnostic imaging, this is a very significant issue. In order to be able to read and analyze data, it is essential to have hardware and software that can manage these images. Not too long ago I had a tour of a contract research organization that specializes in managing diagnostic images. They own just about every type of proprietary file reader that exists for those imagesin at least one case they have the last known working copy of these machines. The hardware companies that produced the data readers may be out of business, and the specifications for the readers may no longer exist. Once there is no longer a reader, there is no longer data.

Data longevity
Data can be lost even if it is written in a current format. Many of us assume that digital media are permanent. This is especially the case with CDs, which appear permanent and indestructible. Yet, those static-ridden analog VHS images are indicative of another issue with medialimitations on the longevity of data. Various digital media are also susceptible to environmental variations in temperature, humidity, oxidation, magnetic and sunlight damage. Most of us have had the shocking experience of putting an important floppy disk into a computer and getting the message: The floppy disk in the A: drive is not formatted. Would you like to format it now? That doesnt mean that all the data have been wiped clean. If the data are text- or ASCII-based, a forensic computer scientist could recover most, if not all, of it. But environmental influences may have corrupted some of the data. If the damage was done to a program or to encrypted data, there is very little that can be done to recover the data.

To avoid this issue, most of us back up our data. If we store our backups in another location, under environmental control, we assume that we are safewhich is not necessarily true. For example, many individual computer users back up their data to Zip or Jaz drives. A reliable source has documented a very serious data-destroying fault with some early drives of this type, called the Click of Death. (If you use these drives, see: http://grc.com/codfaq1.htm). Unfortunately, I have been a victim of this particular phenomenon. Corporate IT (information technology) departments typically do backups to digital tapes. It is not uncommon for some backup tapes or portions of backup tapes to fail when they are most neededwhen data has been lost.

Most recently, recordable CDs (CD-R) have been extensively used for the backup of data. For the past several years there has been a raging debate in the computer industry about the longevity of CD-Rs. The very expensive Gold CDs supposedly have a usable life of over 100 years. Some argue, however, that low-end conventional CD-Rs may be reliable for only 5 to10 years. Whether or not CD-Rs have a limited shelf life in times relevant to clinical development, it is clear that they can be easily damaged by heat and mishandling. A recent report suggests that CDs stored in humid environments can actually be eaten and destroyed by a Geotrichum fungus (www.nature.com/ nsu/010628/010628-11.html). We may not notice loss of data in damaged music CDsthe players are designed to fill-in for lost databut loss of information in clinical data CDs could be a serious problem.

Data migration. One strategy for retaining data quality is to migrate the data from an older format to a newer format. Although this strategy can keep the data fresh, it needs to be done very carefully, with a validated migration path. Without this, there is a significant risk that data will be lost as part of the transfer, or changed in subtle and possibly undetectable ways. Formatting, metadata, and footnotes to data are often lost or changed in the transfer process. Even worse, the actual data itself may be changed. In a 1998 Business Week article, the FDA reported that some data that had been migrated from one operating system (OS) to another was randomly off by up to eight digits. Once the migration has occurred, any changes to the data are permanent because the original is usually destroyed.

Strategy
Clearly, our data is at risk. So, what strategy do we follow? The most important first step is to have a strategy at allmany companies dont think about these issues. Develop a strategy, revisit it often, and make sure that it is implemented routinely. The first consideration should be the file format for data storage. All other things being equal, it is best to chose file formats, media writers and readers, and hardware that are widely adopted and appear to be heading toward being a data standard. If that is not possible, try to go with the leading front of the mainstream. Dont chose an older, nonstandard technologythe file format may be on the road to obscurity. Similarly, dont choose a hot, cutting-edge technology from a company that may not exist a year later. If possible, keep to a minimum the number of different formats in which data is stored.

Related Content:

News