Digital Page Imaging and SGML
An Introduction to the Electronic Binding DTD (Ebind)
November 1, 2002 Update
The Ebind DTD was developed in 1996 here at the UC Berkeley Library to support a new preservation workflow which replaced traditional photocopying with digital imaging using Xerox's then-new Docutech digital photocopier. Photocopies of brittle books would replace the original book on the shelf and the digital files generated by the docutech would be archived to tape. An Ebind-encoded SGML file would be stored with the digital images acting as a directory to the files and to preserve the most important bibliographic and structural metadata. Ebind was designed with simplicity in mind and had to be generated very cheaply by student workers processing a large volume of brittle books. Designed primarily to support the relatively uniform bitonal TIFF4 compressed output generated by the docutech, Ebind supported none of the important technical metadata which today is considered critical for effective long term preservation of digital files. Ebind2Html was another proof-of-concept cgi script designed to demonstrate how an Ebind-encoded document could serve as more than a simple directory to the images, but could also serve as a navigation tool for online users.
The Ebind DTD never made it beyond the proof-of-concept stage and it was never integrated into the production workflow here at Berkeley. It remained as something of a curiosity and point of discussion for other institutions considering the issues which today we know as digital preservation. Many institutions have found Ebind more useful for its primitive companion page-flipping script, Ebind2Html, than for the DTD itself.
In 1998 the Library embarked on the development of a new digital preservation XML-based format called MOA2 (for "Making of America 2", the name of the NEH-funded grant for which it was developed). MOA2 was designed to support all of the administrative, technical and structural metadata now known to be critical for effective and reliable long-term preservation of digital files. More information can be found on the MOA2 website.
In late 2001 MOA2 was taken to the national level and became METS, the Metadata Encoding & Transmission Standard. Adopted by the Digital Library Federation and overseen by representatives from several institutions, the development of METS currently nears completion. METS is expected to be widely adopted with an accompanying multitude of tools institutions can use to create and view METS-encoded documents. More information about METS can be found at the official METS website.
Ebind has passed into the great beyond. This page remains here for historical purposes only. You can find tools for creating and viewing METS documents at the official METS website, including the UC Berkeley Library's "official" METS viewer. I am currently creating my own little perl cgi viewer as well called Flip-o-Matic which I will release sometime in January 2003 (You can take sneak peek at it here.). When the METS standard stabilizes a bit more I will make available tools for converting Ebind-encoded documents into METS as well as my own little toolset for creating METS files. Stay tuned!
View some Ebind-encoded documents
The Electronic Binding Project, or Ebind, is a method for binding together digital page images using an SGML document type definition (DTD) developed at UC Berkeley in 1996 by Alvin Pollock and Daniel Pitti. The Ebind SGML file records the bibliographic information associated with the document in an ebindheader, the structural hierarchy of the document (e.g, parts, chapters, sections), its native pagination, textual transcriptions of the pages themselves, as well as optional meta-information such as controlled access points (subjects, personal, corporate, and geographic names) and abstracts which can be provided all the way down to the level of the individual page.
This SGML file acts primarily as a non-proprietary, international standards-based (ISO 8879) control file for the multiple image files which make up a digitized book or document. But it can also serve as the basis for browsing the images in any SGML-aware software system in a natural and convenient way. One such system, a cgi program written in perl, provides a simple, easy-to-use web-based interface that remote users can connect to using a web browser such as Netscape or MS Internet Explorer. This cgi-script is freely available and may be downloaded from this site.
DIGITAL IMAGING PROJECTS AT UC BERKELEY USING EBIND
The Digital Photocopy Project
In 1993 the University of California Printing Dept. acquired a Xerox XDOD (Xerox Document on Demand) system consisting of a Docutech printer and a high quality Xerox scanner. Brittle books which previously had been copied using a traditional photocopier could then be copied using this state-of-the-art digital photocopier. The paper output was bound and placed back on the library shelf, but in addition to this the digital image files could be archived for direct patron access to files as well as to replace damaged pages or even to make more copies of the entire book at some future date. Ebind was chosen as the archival control file format because it was based on an international standard, SGML, and would migrate easily to any future standards if and when they are developed. It was demonstrated that in addition to acting as a control file, the very same SGML document could act as the basis for an on-line document navigation system. This was the Ebind cgi script written in perl.
American Heritage Virtual Archive Project
In 1996 The American Heritage Virtual Archive Project was begun as a collaboration between UC Berkeley, Stanford, the University of Virginia, and Duke University, to encode the finding aids to their archives and repositories in SGML using the Encoded Archival Description (EAD) DTD. In a later phase of the project, selected manuscripts and other primary source material would be digitized and made available on the World Wide Web. These digitized images would be bound together using Ebind and linked to the elctronic finding aid.
THE EBIND DTD
The structure of the Ebind DTD is based loosely on the Core tag set of the Text Encoding Initiative (TEI) DTDs. Like TEI, Ebind is divided into a bibliographic header, front matter, a body, and back matter. The front, body and back elements can themselves be divided into generic textual divisions called divs. A type attribute on the div element may specify the type of division more precisely, e.g., type="chapter".
View the Ebind DTD
View SGML tagging of some sample Ebind-encoded documents.
Two fundamental concepts separate Ebind from TEI. First, Ebind privileges the physical structure of a document while TEI privileges the intellectual structure. In Ebind, the atomic unit is the page while in TEI it can be down to the individual character. In TEI there is no element which can contain a page. The reason for this is that two distinct structural hierarchies cannot exist within the same document, at least not in current implementations of SGML. If a chapter ends in the middle of a page and a new chapter begins on that same page, one cannot explicitely describe both the hierarchy of the page and the hierarchy of the chapter. TEI favors the chapter by enclosing it within a div tag and describes the hierarchy of the page implicitely through the use of the pb (page break) empty tag, one of TEI's so-called "milestone" elements. (See TEI guidelines section 6.9.3). In Ebind, all pages are enclosed within a <page> element. This allows one to gather together a variety of information associated with individual pages, such as textual transcription ("raw" OCR or keyed), page abstracts, even controlled access points for individual pages if desired.
The second fundamental difference between TEI and Ebind is that Ebind is simpler to use. It was recognized early on that Ebind would be used in a high-volume production environment and would be applied to a wide variety of documents. The same DTD can be used to encode books, manuscripts, diaries, newspapers, or magazines. For this reason, many of the requirements imposed by TEI were "loosened up" in Ebind. The DTD is far less restrictive. Page elements can occur just about anywhere, for example. They may occur between divs and in fact needn't be enclosed in divs of any kind. This greatly simplifies the task of automated markup.
The UC Berkeley Digital Photocopy Project is a good example of how Ebind may be applied in a high-volume production environment. The Ebind SGML documents are encoded programmatically from simple, single page worksheets prepared by library staff and completed by the scanner operator.
View some sample worksheets.
Like the Ebind cgi script, the perl script which generates the SGML file from the worksheet is freely available from this site.
Download perl conversion script (ebind.pl)
Copyright © 1996 UC Regents. All rights reserved.
Document maintained at http://sunsite.berkeley.edu/Preservation/ by the SunSITE Manager.
Last update 10/04/02. SunSITE Manager: firstname.lastname@example.org