TU Logo   IFS Logo Vienna University of Technology
Institute of Software Technology and Interactive Systems
Speaker Recognition by Topics
[Home] [People] [Publications]

Bundestag Corpus

Publications

Useful Software

Current speaker recognition research focusses strongly on features which capture the voice of the speaker. In the broadcast domain, however, there are also other cues that can be used to identify speakers, such as the topic of an utterance: Speakers often appear in programmes on topics for which they are considered experts. For example, a tennis player is more likely to talk about tennis and to appear in a report about sports than a politician. Vice versa, one can assume that a report about sports is more likely to contain a tennis player than a politician.

I've explored this idea in my PhD thesis and other publications. To conduct the experiments described in these papers, I've compiled an audio/video corpus from speeches and debates available on the German Parliament's Web-TV site: The Bundestag Corpus. In the experiments, several tools were very useful.

Contact: Doris Baum.

Bundestag Corpus

Most speaker recognition corpora with a large number of speakers don't lend themselves to exploring topic-based speaker recognition, because the speakers aren't free to chose the topics they talk about. Therefore, a German-language corpus consisting of politicians' speeches in German parliament has been set up: the Bundestag Corpus. As politicians often specialise in particular areas like economy or environment, speaker topic preferences can be expected in this corpus. Recordings of the speeches are available online at the German Parliament's Web-TV page. The Bundestag's archive also has corrected manual transcripts of the parliamentary sessions.

Corpus Lists

The audio part of the corpus is described by three lists given below which detail the speeches assigned to the training, development, and test set. The list format is
lastname;firstname;date (DD.MM.YYYY);time;url (permalink),
where lastname and firstname describe the politician/speaker, date is the date of the speech in DD.MM.YYYY format, time is the time of the speech (in 24h format), and url is a permalink to the recording of the speech.

If you want to download and use the recordings in your own research, you must check and accept the Bundestag's Conditions of Use (German) yourself. If you do so, you may want to use an automation tool to retrieve the actual video download urls from the permalinks, such as perl and WWW::Mechanize. Here's an example Perl script which combines the three lists into one long list including video download urls. This script is provided with no warranty - make sure to read the terms of use and to use it responsibly.

Corpus Characteristics

The corpus characteristics are summed up in the followig table:

Number of target speakers235
Number of target speakers with development material120
Number of training files per speaker5
Number of test files per speaker2 - 5
Number of development files per speaker2 - 5
Total number of training files1,175
Total number of test files994
Total number of development files412
Number of target trials994
Number of non-target trials232,596

The next table gives information on the different sets: the number of speeches/videos, the summed duration of the video material, and the number of running words recognised by the speech recognition system used in the experiments.

SetNumber of videosTotal durationNumber of running words
Train1,175150 h1,141,815
Test994120 h927,444
Development41254 h407,661
Total2,581324 h2,476,920

Of the 235 speakers in the corpus, 71 (30%) are female and 164 (70%) are male, which corresponds to the overall proportion of women amongst members of parliament of 33%. You can download a list of speaker genders.
The length of the recordings ranges from 30 seconds (for a short question) to an hour (for the Chancellor's address to the newly elected parliament), but the majority of speeches is between 5 minutes (first quartile) and 9 minutes (third quartile) long, the median being 6:30 minutes.
The acoustic conditions found in the recordings are usually rather good: not studio quality but not heavily degraded either. The controversy of the speaker or discussed topic varies and with it the amount of emotional speech, high vocal effort (shouting), and background noise. Speakers are heckled from time to time, so there is crosstalk, and the microphones are not always optimally set-up, which occasionally leads to reverberation or acoustic feedback. The politicians are semi-professional to professional orators, and although speeches are not normally read, speakers often use prepared notes for guidance. Accordingly, the resulting speech is semi-spontaneous.
Also, it seems that some target speakers are easier to recognise in a test file than others. For example, the error rate of the fused idiolectal + topic-based + spectral system described in [2] varies for each target speaker. The error rate per speaker could serve as an estimate of the "difficulty" of a speaker. You can download a list here.

Text Corpus

The topic-based speaker recognition approaches described in the papers use a background set of written text. The Bundestag corpus background set consists of all available transcripts from October 18th 2005 (1st session of the 16th legislative period) to December 19th 2008 (197th session of the 16th legislative period), downloadable at the Bundestag's transcript archive ("Plenarprotokoll Bundestag" 16/1 through 16/197). This amounts to approximately 10 million running words.

Scores

The scores on the Bundestag devel and test set for the spectral, idiolectal, keyword-based, and topic-model-based systems described in [1] can be downloaded as well: scores.zip (18 MB).

Publications

Useful Software

In setting up the experiments to evaluate topic-based speaker recognition, the following tools were very useful: