Once the source and the overall scope of the archive have been defined, it has should be decided upon which types of material to be included in the archive. This decision has major technical implications that demand strategic considerations. For the chosen data formats it has to be determined, how they will be acquired, stored, and how to preserve and provide access to them.
When including a certain type of data in the archive, the infrastructure has to be such that this document can be handled in all stages of its life-cycle. Therefore, the system environment has to provide the necessary means for entering digital material into the archive, manage and preserve it as a collection item, and provide access to it [BG98].
In principle, all kinds of data can be included in the archive, each demanding specific treatment. The repository could comprise internal documents of a company being, e.g., primarily text processing types, digitised pictures of an art museum, and many others are conceivable. More open-access sources present newsgroups, mailing lists, or bulletin boards. The Internet has probably the broadest range of differing data types. Besides HTML-pages, all kinds of multimedia formats such as music files or videos can be found. Exacerbating the handling of such a broad range of data formats is their fast-changing nature. New types emerge and vanish in quick succession.
As long as static types are concerned, this has no major implications on the acquisition of the data. These data objects can be acquired and managed in the archive environment without knowing their type. Only at a later stage, when a user of the archive wants to display the data object, a means of interpreting the data type must be provided. Yet, dynamic document types are on the rise. These types are needed to implement interactivity, one of the revolutionary features of the Internet. Whenever a user poses a query at an information service, a dynamically generated page is returned, that holds information extracted from a database. Interactive and dynamically generated web-pages are thus the interface to a hidden database. Automatic means to capture this interactivity do not exist for the time being. Even if the intention is not to extract the whole database in behind, but only to trace a typical dialog between a user and the service, developing appropriate automatic means turns out to be a complex task. Experiments on how this can be tackled have been done in the course of this thesis and are introduced in Chapter 5.3. Similar problems are raised with interactive sites such as on-line games, forms of art, and others.
To sum up, an array of data types can be considered to be included in the archive: