Speaker Recognition by Topics

Current speaker recognition research focusses strongly on features which capture the voice of the speaker. In the broadcast domain, however, there are also other cues that can be used to identify speakers, such as the topic of an utterance: Speakers often appear in programmes on topics for which they are considered experts. For example, a tennis player is more likely to talk about tennis and to appear in a report about sports than a politician. Vice versa, one can assume that a report about sports is more likely to contain a tennis player than a politician.

I've explored this idea in my PhD thesis and other publications. To conduct the experiments described in these papers, I've compiled an audio/video corpus from speeches and debates available on the German Parliament's Web-TV site: The Bundestag Corpus. In the experiments, several tools were very useful.

Contact: Doris Baum.

Bundestag Corpus

Most speaker recognition corpora with a large number of speakers don't lend themselves to exploring topic-based speaker recognition, because the speakers aren't free to chose the topics they talk about. Therefore, a German-language corpus consisting of politicians' speeches in German parliament has been set up: the Bundestag Corpus. As politicians often specialise in particular areas like economy or environment, speaker topic preferences can be expected in this corpus. Recordings of the speeches are available online at the German Parliament's Web-TV page. The Bundestag's archive also has corrected manual transcripts of the parliamentary sessions.

Corpus Lists

The audio part of the corpus is described by three lists given below which detail the speeches assigned to the training, development, and test set. The list format is
lastname;firstname;date (DD.MM.YYYY);time;url (permalink),
where lastname and firstname describe the politician/speaker, date is the date of the speech in DD.MM.YYYY format, time is the time of the speech (in 24h format), and url is a permalink to the recording of the speech.

If you want to download and use the recordings in your own research, you must check and accept the Bundestag's Conditions of Use (German) yourself. If you do so, you may want to use an automation tool to retrieve the actual video download urls from the permalinks, such as perl and WWW::Mechanize. Here's an example Perl script which combines the three lists into one long list including video download urls. This script is provided with no warranty - make sure to read the terms of use and to use it responsibly.

Corpus Characteristics

The corpus characteristics are summed up in the followig table:

Number of target speakers	235
Number of target speakers with development material	120
Number of training files per speaker	5
Number of test files per speaker	2 - 5
Number of development files per speaker	2 - 5
Total number of training files	1,175
Total number of test files	994
Total number of development files	412
Number of target trials	994
Number of non-target trials	232,596

The next table gives information on the different sets: the number of speeches/videos, the summed duration of the video material, and the number of running words recognised by the speech recognition system used in the experiments.

Set	Number of videos	Total duration	Number of running words
Train	1,175	150 h	1,141,815
Test	994	120 h	927,444
Development	412	54 h	407,661
Total	2,581	324 h	2,476,920

Of the 235 speakers in the corpus, 71 (30%) are female and 164 (70%) are male, which corresponds to the overall proportion of women amongst members of parliament of 33%. You can download a list of speaker genders.
The length of the recordings ranges from 30 seconds (for a short question) to an hour (for the Chancellor's address to the newly elected parliament), but the majority of speeches is between 5 minutes (first quartile) and 9 minutes (third quartile) long, the median being 6:30 minutes.
The acoustic conditions found in the recordings are usually rather good: not studio quality but not heavily degraded either. The controversy of the speaker or discussed topic varies and with it the amount of emotional speech, high vocal effort (shouting), and background noise. Speakers are heckled from time to time, so there is crosstalk, and the microphones are not always optimally set-up, which occasionally leads to reverberation or acoustic feedback. The politicians are semi-professional to professional orators, and although speeches are not normally read, speakers often use prepared notes for guidance. Accordingly, the resulting speech is semi-spontaneous.
Also, it seems that some target speakers are easier to recognise in a test file than others. For example, the error rate of the fused idiolectal + topic-based + spectral system described in [2] varies for each target speaker. The error rate per speaker could serve as an estimate of the "difficulty" of a speaker. You can download a list here.

Text Corpus

The topic-based speaker recognition approaches described in the papers use a background set of written text. The Bundestag corpus background set consists of all available transcripts from October 18th 2005 (1st session of the 16th legislative period) to December 19th 2008 (197th session of the 16th legislative period), downloadable at the Bundestag's transcript archive ("Plenarprotokoll Bundestag" 16/1 through 16/197). This amounts to approximately 10 million running words.

Scores

The scores on the Bundestag devel and test set for the spectral, idiolectal, keyword-based, and topic-model-based systems described in [1] can be downloaded as well: scores.zip (18 MB).

Publications

[1] D. Baum. Using Topic Cues for Speaker Recognition in Broadcast Multimedia Archives Baum. PhD Thesis, Vienna University of Technology, 2013. link
[2] D. Baum. Recognising speakers from the topics they talk about. In Speech Communication, 54(10):1132-1142, 2012. link
[3] D. Baum. Topic-based speaker recognition for German parliamentary speeches. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU '09), pages 427-431. Merano, Italy, December 2009. link

Useful Software

In setting up the experiments to evaluate topic-based speaker recognition, the following tools were very useful:

Snowball stemmer
MALLET - MAchine Learning for LanguagE Toolkit
SVMlight - Thorsten Joachims's Support Vector Machine (SVM) implementation in C
Julius - Open-Source Large Vocabulary Continuous Speech Recognition Engine
ISIP speech recognition software
FoCal and FoCal Bilinear - Toolkits for Evaluation, Fusion and Calibration of statistical pattern recognizers
DET-Curve Plotting software for use with MATLAB - software for generating Detection Error Trade- off curves, from NIST