Million Song Dataset

From Chorus
Revision as of 15:58, 18 April 2011 by Lidy (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Million Song Dataset
Domain Music
Media Audio
Size 280 GB
Instances 1,000,000
File Format HDF5
Creation Date 2011-02-08
Task Retrieval


The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Its purposes are:

  • To encourage research on algorithms that scale to commercial sizes
  • To provide a reference dataset for evaluating research
  • As a shortcut alternative to creating a large dataset with The Echo Nest's API
  • To help new researchers get started in the Music IR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features.

We also provide a subset of 10,000 songs (1%, 1.8 GB compressed) for a quick taste.


The data set is available in HDF5 data format + a number of SQLite files and .TXT index files.


The Million Song Dataset started as a collaborative project between The Echo Nest and LabROSA. It is supported in part by the NSF.

Ground Truth Annotation

The data set contains both metadata (artist, album, track title, tags etc.) as well as a variety of annotations done through the The Echo Nest's Analysis API (see below).

An Example Track Description showing the available fields is provided here.

Additional data has been added from other sources, e.g. lyrics from musixmatch.


Numerous features (through audio analysis) and additional meta-data, tags and links to additional resources are available for this dataset.

The list of fields is provided here.

Licensing / Copyright


references or publications

External Links

Personal tools