Spam filtering based on latent semantic indexing

W. Gansterer,A. Janecek, R. Neumayer:
"Spam filtering based on latent semantic indexing";
Vortrag: 2007 SIAM Conference on Data Mining, Minneapolis, MN, USA; 28.04.2007; in:"2007 SIAM Conference on Data Mining Workshop and Tutorial Proceedings", (2007), 9 S.

[ Publication Database ]


In this paper, a study on the classification performance of a vector space model (VSM) and of latent semantic indexing (LSI) applied to the task of spamfiltering is summarized. Based on a feature set used in the extremely widespread, de-facto standard spamfiltering system SpamAssassin, a vector space model and latent semantic indexing are applied for classifying e-mail messages as spam or not spam. The test data sets used are partly from the official TREC 2005 data set and partly self collected. The investigation of LSI for spamfiltering summarized here evaluates the relationship between two central aspects: (i) the truncation of the SVD in LSI and (ii) the resulting classification performance in this specific application context. It is shown that a surprisingly large amount of truncation is often possible without heavy loss in classification performance. This forms the basis for good and extremely fast approximate (pre-) classification strategies, which are very useful in practice. The approaches investigated in this paper are shown to compare favorably to two important alternatives: (i) They achieve better classification results than SpamAssassin, and (ii) they are better and more robust than a related LSI-based approach using textual features which has been proposed earlier.