Million Song Dataset
Domain | Music |
Media | Audio |
Size | 280 GB |
Instances | 1,000,000 |
File Format | HDF5 |
Creation Date | 2011-02-08 |
Task | Retrieval |
Copyright | |
URL | http://labrosa.ee.columbia.edu/millionsong/ |
Description
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.
Its purposes are:
- To encourage research on algorithms that scale to commercial sizes
- To provide a reference dataset for evaluating research
- As a shortcut alternative to creating a large dataset with The Echo Nest's API
- To help new researchers get started in the Music IR field
The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features.
We also provide a subset of 10,000 songs (1%, 1.8 GB compressed) for a quick taste.
Quality
The data set is available in HDF5 data format + a number of SQLite files and .TXT index files.
Source
The Million Song Dataset started as a collaborative project between The Echo Nest and LabROSA. It is supported in part by the NSF.
Ground Truth Annotation
The data set contains both metadata (artist, album, track title, tags etc.) as well as a variety of annotations done through the The Echo Nest's Analysis API (see below).
An Example Track Description showing the available fields is provided here.
Additional data has been added from other sources, e.g. lyrics from musixmatch.
Features
Numerous features (through audio analysis) and additional meta-data, tags and links to additional resources are available for this dataset.
The list of fields is provided here.
Licensing / Copyright
Citation
references or publications