To gain insight into the material retrieved during a snapshot, we implemented a module, capable of compiling statistics. Both, the run using the Nedlib-crawler and the other with Combine, were incomplete. However, the latter was considerably larger resulting in more accurate numbers. Therefore, in order to convey a picture of the dimensions this repository is dealing with, an excerpt of the statistics based on the Combine-crawl is presented here. Also incomplete crawls present an appropriate insight, yet, numbers about ten times as high may be expected for a complete snapshot.
Table 4.1 shows numbers for the various domains documents have been extracted from. It shows for each domain the number of hosts that have been accessed, the number of documents that have been acquired, and the size in bytes of all the files downloaded. Obviously, most documents have been collected from the .at-domain. The numbers for standardised second level domains being .ac.at, .co.at, .gv.at, and .or.at are not included in the numbers for the .at-domain but are listed separately. It is quite striking that they have relatively few registered hosts, thus they are seemingly not accepted by the general public. When comparing the numbers between .ac.at and .co.at it is quite striking, that even though the academic sector has less hosts by a minor percentage, it is more than four times as big as the commercial sector. Quite popular in Austria is the .cc-domain of the Cocos (Keeling) Islands, an island group in the Indian Ocean. Yet, .tv, which is a shortcut for Tuvalu, an island group in the South Pacific Ocean, and at the same time for television, was discovered only recently and is expected to grow, especially with the introduction of private television.
Table 4.2 lists extensions of the acquired files, i.e. the data format they have. For each extension the number of files and the size of all those files is listed. The first paragraph of the table details the most prevalent extensions for the HTML data format, first each separately, then summed up. Besides the extensions .html, .htm, .shtml, and .shtm the entry "automatic" is listed. This refers to URLs that do not point directly to a file but rather a directory. On a request the web-server returns a default file that is to be found in the very directory. Our web-server, for example, redirects http://www.ifs.tuwien.ac.at/~aola/ to the URL http://www.ifs.tuwien.ac.at/~aola/index.html.
Furthermore, it is clearly shown, that Adobe's PDF-format is more popular than PostScript-files. Also, the dominance of the JPEG-format over other picture types is quite obvious. This is due to the high compression rate JPEG offers, which is a crucial feature considering the low download rates many users have to manage with.
Furthermore, loads of unusual extensions have been discovered, such as .d15 or .grv. The MIME-type of the document could give information about its type, yet, many files remain unrecognised [Arv01]. For those unknown formats it is difficult, perhaps impossible to find an appropriate long-term preservation strategy. The same difficulties apply for access provision.
|
|