Spam Filtering Based on Latent Semantic Indexing

W. Gansterer, A. Janecek, R. Neumayer:
"Spam Filtering Based on Latent Semantic Indexing";
in:"Survey of Text Mining II: Clustering, Classification, and Retrieval", herausgegeben von: The University of Tennessee; Springer, Berlin-Heidelberg, 2008, ISBN: 978-1-84800-045-2, S. 165 - 183.

[ Publication Database ]


In this chapter, the classification performance of Latent Semantic Indexing (LSI)
applied to the task of detecting andfiltering unsolicited bulk or commercial e-mail
(UBE, UCE, commonly called“spam”) is studied. Comparisons to the simple Vector
Space Model (VSM) and to the extremely widespread, de-facto standard for spam
filtering, the SpamAssassin system, are summarized. It is shown that VSM and LSI
achieve significantly better classification results than SpamAssassin.
Obviously, the classification performance achieved in this special application
context strongly depends on the feature sets used. Consequently, the various clas-
sification methods are also compared using two different feature sets: (i.) a set of
purely textual features of e-mail messages which are based on standard word- and
token-extraction techniques, and (ii.) a set of application-specific“meta features” of
e-mail messages as extracted by the SpamAssassin system. It is illustrated that the
latter tends to achieve consistently better classification results.
A third central aspect discussed in this chapter is the issue of problem reduction
in order to reduce the computational effort for classification, which is of particular
importance in the context of time-critical on-line spamfiltering. In particular, the
effects of truncation of the SVD in LSI and of a reduction of the underlying feature
set are investigated and compared. It is shown that a surprisingly large amount of
problem reduction is often possible in the context of spamfiltering without heavy
loss in classification performance.