next up previous contents
Next: Types of data Up: Goal- and Scope-Definition Previous: Goal- and Scope-Definition   Contents


Finding the source, determining the scope

A primary decision to take is the source for material the archive shall be composed of and determining a scope on the selected source. This is not necessarily clear from the beginning, given the purpose of the initiative. Yet, this ultimately controls the content of the collection items and thereby the services that can be offered. Therefore, a decision has to be taken according to the initial concern that originated the initiative, based on the needs of the clients that have to be fulfilled.

An archive could be dedicated to a specific topic or a certain person, such as a collection on the works of an important poet or philosopher1. Also, a repository as a means to an end, that evolves and grows as an auxiliary facility is concerned. This involves also companies that care for the professional storage and retention of their data, representing significant information and know-how. Those archives serve a functionality with a clearly limited scope. Since these applications are very specific the sole source is conceivably closed-access material, which only designated people have access to, such as data purely available on the Intranet of a company.

Another conceivable intention for the formation of an archive is to store the digital presence of a nation for the future. Taking national libraries as a model, it is their purpose to collect digital artefacts concerning a country or created by inhabitants. When embarking on a national strategy a primary source of data is the Internet. However, no institution can hope to collect all of the digital content for the volume is staggering [Ino01]. In order to delimit the scope, the nature of the national web-space has to be constituted, which is done following three lines of argument. Obviously, all sites being part of the very country's national domain (.at in the case of Austria) are within that context. Yet, many servers located in a country are registered under a foreign domain, most notably under domains such as .com, .edu, .org, but also under "foreign" national domains, such as .cc, or .tv. While the addresses of these domains have no association with, e.g., Austria, the sites themselves might still be physically located in Austria and be operated by Austrian organisations (e.g. www.austria.com). They, thus, most probably are considered worth to be included in a national archive of Austria. Last, but not least, web-sites dealing with topics of interest, such as foreign web-sites of, e.g., expatriate communities, or other sites dedicated to reports on Austria (so-called "Austriaca"), should possibly be collected, even if they are physically located in another country.

Identifying suitable sites automatically would be an important asset due to the masses of web-servers around the world. Obviously, sites being part of a country's national domain are easily recognised. Selecting servers under a foreign domain, yet, located in the very country automatically becomes more difficult. Theoretically, it should be possible to identify these servers on the basis of their IP-addresses, yet, there is no straight forward solution at this point of time. However, recognising sites of interest, the servers of which are located abroad, without human arbitration is not possible at all. Perhaps tools based on heuristics can be developed, that facilitate this task. Nevertheless, the final decision, whether or not a site is of "interest", will have to be taken by a human. This, in turn, demands a heavy input of personnel and restricts the scope of the collection significantly.

Besides the World Wide Web, also other sources can be considered from the Internet overall offering inherently different kinds of services. Those include mailing lists, newsgroups, gopher, or ftp archives. Also, highly dynamic and interactive applications have emerged, such as on-line games (cf. Section 4.2). Each of those demands wholly different methods for capturing.

The significance of finding a suitable source and delimiting the scope thereon must not be underestimated. Going with this decision are substantial structural implications, not only of a technical nature but also concerning the management of the organisation. If, for example, ingest is designed such that the documents are deposited by specific authors who were instructed on the procedures beforehand, it is hard to change to a policy that involves collecting the material from the World Wide Web. Further methods of acquiring the data having defined the source and the scope are discussed in Section 2.2.

Putting it in a nutshell, numerous sources offer themselves for archivation. Besides closed-access material for specific projects, the Internet offers a rich source of freely available data. A decision has to be taken carefully, since changing to another policy might require major restructuring.



Footnotes

... collection on the works of an important poet or philosopher1
Wittgenstein's Nachlass, the Bergen Electronic Edition is exemplary for such an application

next up previous contents
Next: Types of data Up: Goal- and Scope-Definition Previous: Goal- and Scope-Definition   Contents
Andreas Aschenbrenner