in: O’Donnell, James J. Avatars of the Word. From Papyrus to Cyberspace. Cambridge, MA: Harvard UP, 1998. 44-49.

H y p e r l i n k

THE INSTABILITY OF THE TEXT

The culture of print has inculcated the expectation that a given author's words may be frozen once and for all into a fixed and lasting pattern, one that readers can depend on finding whenever they find a copy of a particular book. Broadly speaking, that expectation is valid, but even in the world of print there are reservations to be had. It is surprising what variations can occur between one printed edition of the same book and another, and if the work is a classic, often printed by different houses over a long period of time, a burgeoning of variant readings can arise comparable to those in works copied in manuscript. How much effort it is worth to identify and delimit such abundance is debatable. A generation ago there was such a debate between the Modern Language Association, on the one hand, approving a series of new editions of standard texts brought to a high level of editorial accuracy, and the critic Edmund Wilson on the other hand, who found in the project a superabundance of pedantic detail and little benefit.

But no one would disagree that before the relative stability of printing, texts were often disconcertingly labile and unreliable. No modern edition of a classic widely copied in the middle ages has been possible without an attempt to construct the family tree of resemblances and disresemblances among the manuscripts that survive, and then to find a path back to the remotest ancestor we can reconstruct. There lies one caution for readers of ancient books. To see a cleanly edited, intelligently punctuated text of Cicero is to get a very inaccurate impression of how these words would have appeared in antiquity. Our standards of the orderly page, marked by consistent visual punctuation and clear wordbreaks, simply did not obtain. By comparison to the neatly weeded and pruned gardens of words in which we take our literary pleasures, the ancients made do in a wilderness of irregular scratches on a page, and made do quite well.

A linear progression from chaos to order cannot be imposed on the history of the presentation of the word, though, for we have now returned to a time of instability marked by debate over means of presentation. To enter a text in a computer means to make choices. It is possible to make the simplest possible set of choices and to allow the text to take the form of a series of Roman alphabet characters, upper and lower case, delimited by carriage returns, tabs, and a handful of standard punctuation marks. For many purposes such a text may suffice. But it must be borne in mind that this is already a highly encoded form of representation, depending on a recognized convention that joins together what the computer reads as eight separate digits (1s or 0s) into a unit (or byte) and allowing 128 such combinations to stand for numbers, letters, and punctuation marks. Texts represented with these 128 characters are widely transportable from one computer system to another. If we knew how to be content with simple texts, this might be the basis of a universal orderly representation of texts.

But already 128 characters are insufficient to render a text in, say, French or German, where characters require accents and umlauts. A first attempt to render this environment more flexible, introduced by IBM at the time of the first personal computers, has doubled the character set to 256, including a fairly wide set of marked vowels for some European languages (but notably omitting Scandinavian requirements) and a few Greek letters of use to scientists. (The added characters are so variably and eccentrically chosen that the story is told, perhaps an urban legend, of their being selected by two IBM employees on an overnight flight to Europe with a deadline for a morning meeting. If that's not how they were chosen, it may as well have been.) Virtually every computer in use has embedded in itself awareness of the conventional 256-character set of signs known as ASCII (for American Standard Code for Information Interchange).

But this way madness lies. A new universal alphabet is in the making, to replace the 256 characters now known to computers with a set running to about 35,000, embracing every distinct symbol in every writing system known to humankind. The Unicode project creating this alphabet has been a distinct intellectual adventure in its own right: When do two symbols merely provide variant ways of writing the same sign? When do two symbols used in different languages and looking superficially alike really differ and when may they be counted as a single symbol? Once coded into every computer, Unicode would allow a far more flexible representation of all the writing systems of the world. But there are already too many computers in use in the world to promise rapid replacement of what we have.

Letters in a sequence are only part of a text. The struggles of users of early word processing software to get their expensive new machines to do what they had been able to do with typewriters were a reminder of the importance of layout and typography in the communication of information. Typesetters of the last century, before heavy mechanization took over, estimated that they spent only a third to a half of their time putting characters in sequence, and the rest of their time arranging the white space on the page that surrounded the characters. We now have a myriad of ways to introduce formatting information into our computerized texts, and the results are beautiful to see on screen or on paper. But there is a cost in the loss of standardization. An abundance of word processing formats has generated another abundance of would-be standard formats. Recognition of these formats depends on users' choices of hardware and software. If, for example, I need to get tax forms from the U.S. Treasury, I can find them on the World Wide Web and print them at home in minutes. But I must have previously acquired one of (at last count) four different ways to manage text (PDF, PCL, PostScript, or SGML) in order to get those forms at all. Each of those names represents a very different conception of how text may be arranged in a computer.

Roughly, such schemes divide between those that seek to describe the visual arrangement of a text that might be printed and those that seek to describe the structure of the information that goes into a document. PDF, for example, stands for portable document format, and depends on software created and distributed by the Adobe Corporation. It allows readers with the right software to receive files easily and display them with all the typographic features of print on their screens. But such text is hard to edit and search because the graphic page representation inserts coding that gets in the way of the original string of characters. At the opposite extreme, a text in SGML (standard generalized markup language) is presented only in the ubiquitously available ASCII characters and is described in a way that any computer anywhere may read. But an SGML text per se is still a mass of text interspersed with codes that require special software to be read intelligibly. (The HTML on which the World Wide Web relies is a simpler version of SGML)

The result of this disagreement about ways of representing electronic texts is that any given collection of texts in a computer will be a mix of usually incompatible kinds of codes requiring different software and often different hardware to interpret. Some textual communities will use one or another coding system fairly consistently (scientists are fond of a language called TeX, for example, but few humanists or commercial users have ever seen it), but we are very far from any possibility of mutual interchangeability.

The obvious inconvenience of this state of affairs—the difficulty of managing a "library" of such texts—masks a more troubling problem. Computers, and their software, change rapidly. Bodies of information created for a computer are marked by the kind of coding possible and necessary at the time of their creation. As the environment changes, it is usually necessary to make at least some small changes in the text to keep it read-able. If the format is particularly idiosyncratic (say, if the text has been coded by a CD-ROM manufacturer determined to present text in a proprietary way) or simply if several generations of transformations of software and hardware intervene, a text can become impossible to read. (I still own a box of 5 1/4" diskettes and could probably find a place to read them, but how much longer will that be true? CD-ROMs may last as long as a few decades.) NASA has found this problem particularly troubling: reels and reels of tape bearing computer data from the 1960s are now, at best, a series of 1 s and Os, while the hardware and software that created them have long since been rendered obsolete and destroyed. It takes a "data archaeologist" to attempt to decipher what has been lost.

Give us another generation of proliferation and surely vast quantities of information will slip away from us this way. We will no longer be able to depend on survival of information through benign neglect. There are medieval manuscript books that may have lain unread for hundreds of years, but offered their treasures to the first reader who found and tried them. An electronic text subjected to the same degree of neglect is unlikely to survive five years.

None of this is good news for librarians. Whereas the codex book in print form has been a remarkably standardized and stable medium, subject mainly to the depredations of material aging (crumbling paper and breaking binding), the new flood of electronic texts brings with it an exponential increase in the difficulties of making information available to users and preserving it over time. Reader demand will be only moderately helpful in determining how society's institutions act, for as readers we will want both the newest and the best in everything, and permanence and reliability as well. We can't have both.