Evaluation of the harvested data

To gain insight into the material retrieved during a snapshot, we implemented a module, capable of compiling statistics. Both, the run using the Nedlib-crawler and the other with Combine, were incomplete. However, the latter was considerably larger resulting in more accurate numbers. Therefore, in order to convey a picture of the dimensions this repository is dealing with, an excerpt of the statistics based on the Combine-crawl is presented here. Also incomplete crawls present an appropriate insight, yet, numbers about ten times as high may be expected for a complete snapshot.

Table 4.1 shows numbers for the various domains documents have been extracted from. It shows for each domain the number of hosts that have been accessed, the number of documents that have been acquired, and the size in bytes of all the files downloaded. Obviously, most documents have been collected from the .at-domain. The numbers for standardised second level domains being .ac.at, .co.at, .gv.at, and .or.at are not included in the numbers for the .at-domain but are listed separately. It is quite striking that they have relatively few registered hosts, thus they are seemingly not accepted by the general public. When comparing the numbers between .ac.at and .co.at it is quite striking, that even though the academic sector has less hosts by a minor percentage, it is more than four times as big as the commercial sector. Quite popular in Austria is the .cc-domain of the Cocos (Keeling) Islands, an island group in the Indian Ocean. Yet, .tv, which is a shortcut for Tuvalu, an island group in the South Pacific Ocean, and at the same time for television, was discovered only recently and is expected to grow, especially with the introduction of private television.

Table 4.2 lists extensions of the acquired files, i.e. the data format they have. For each extension the number of files and the size of all those files is listed. The first paragraph of the table details the most prevalent extensions for the HTML data format, first each separately, then summed up. Besides the extensions .html, .htm, .shtml, and .shtm the entry "automatic" is listed. This refers to URLs that do not point directly to a file but rather a directory. On a request the web-server returns a default file that is to be found in the very directory. Our web-server, for example, redirects http://www.ifs.tuwien.ac.at/~aola/ to the URL http://www.ifs.tuwien.ac.at/~aola/index.html.

Furthermore, it is clearly shown, that Adobe's PDF-format is more popular than PostScript-files. Also, the dominance of the JPEG-format over other picture types is quite obvious. This is due to the high compression rate JPEG offers, which is a crucial feature considering the low download rates many users have to manage with.

Furthermore, loads of unusual extensions have been discovered, such as .d15 or .grv. The MIME-type of the document could give information about its type, yet, many files remain unrecognised [Arv01]. For those unknown formats it is difficult, perhaps impossible to find an appropriate long-term preservation strategy. The same difficulties apply for access provision.

Table 4.1: second run - statistics (excerpt) - domains

domain	#hosts	#documents	size (kilobyte)
at	38.883	2.116.940	77.191.623
ac.at	1.798	311.798	21.299.944
co.at	2.091	124.459	4.674.595
gv.at	262	54.035	3.325.528
or.at	547	61.998	2.188.627
com	797	79.553	2.165.194
edu	14	60	9.954
int	1	1.582	14.962
net	211	24.772	789.394
org	133	10.997	635.357
cc	124	56.083	1.676.642
de	104	1.310	131.809
hu	1	59	1.134
tv	2	32	217
...	...	...	...
total	45.178	2.846.544	114.183.012

Table 4.2: second run - statistics (excerpt) - extensions

*extension*	*#documents*	size (kilobyte)
html	595.848	7.903.787
htm	798.765	8.712.431
shtml	32.700	583.452
shtm	3.656	89.194
"automatic"	104.212	894.742
=> sum (htm+shtm+shtml+automatic)	1.535.181	18.183.606
txt	11.175	253.011
pdf	49.913	20.288.111
ps	2.757	1.694.369
wav	1.669	1.480.466
mp3	5.005	7.314.008
avi	576	1.299.784
mpg/mpeg	1.352	4.058.790
jpg/jpeg	99.423	7.872.700
gif	14.181	831.244
tif/tiff	997	1.588.893
zip	13.167	9.867.170
tgz/gz	5.273	1.925.112
exe	10.078	8.267.007
cgi	77.208	852.861
jsp	16.341	243.450
asp	289.657	4.838.417
pl	73.007	826.735
php	251.732	4.653.314
xls	1.722	262.933
doc	11.884	2.031.507
rtf	2.345	259.631
d15	4	52
di	1	25
es	1	12
fas	8	248
grv	1	9
kop	2	30
...	...	...