The Million Song Dataset (MSD) [1], a collection of one million western popular music pieces, has enabled a large-scale research for many MIR applications. The dataset comes with a set of features extracted by the API of The Echonest, which include tempo, loudness, timings of fade-in and fade-out, and MFCC-like features for a number of segments.
The dataset does however not provide an easy download possibility for the audio files, thus researchers are basically limited to the features provided with the dataset. Using a content provider, for which links with unique IDs to the internal database existed in the MSD, we downloaded audio samples, mostly in the form of 30 or 60 second snippets. Subsequently, we provide a multitude of features extracted from these samples, to allow comparison between them.
To allow for popular tasks in Music Information retrieval research such as musical genre classification, we further provide a categorisation of a subset of the collection into genres obtained from the All Music Guide (allmusic.com).
We further provide a number of splits into training / test sets that should ensure comparability of the experiments. These are in detail:
Information about the Million Song Dataset and the benchmark sets provided on this Web page is also available on the Collaborative platform that has been initiated by the CHORUS+ project. The CHORUS+ collaborative platform is a collaborative effort to provide an overview of resources in the field of audio-visual search and provides a comprehensive overview of the efforts in the multimedia domain.