Music Ngram Viewer

Here are the datasets backing the Music Ngram Viewer. These datasets were generated in April and September 2011; I will update these datasets as the score recognition continues, and the updated versions will have distinct and persistent version identifiers (20110401 for the current set). You can also access data not yet published here via the API

Each of the numbered links below will directly download a fragment of the given corpus. In addition, for each corpus I provide the file total counts, which records the total number of 1-grams contained in the scores that make up the corpus. This file is useful to compute the relative frequencies of n-grams.

Details on the corpus construction but are abbreviated here. Of note, I report only the n-grams that appeared over 3 times in any particular year. Therefore, the sum of the 1-gram occurrences in any given corpus is smaller than the number given in the total counts file.

File format: Each of the numbered files below is gzipped tab-separated data. Each line has the following format:

ngram TAB year TAB match_count NEWLINE

As an example, here are the 7,000,000th and 7,000,001st lines from file of the IMSLP interval 5-grams (imslp-interval-5gram-20110401.csv.gz):

3 -2 4 -5 3   1804   94
3 -2 4 -5 3   1805   21

The first line tells us that in 1804, the melody occurred 94 times overall.

The format of the total counts file is identical, except that the ngram field is absent: there is only one value match_count per year.

Inside each file the ngrams are sorted alphabetically and then chronologically.

Usage: This compilation is licensed under a Creative Commons Attribution 3.0 Unported License.

Petrucci Music Library - Melodies

Version 20110401

total counts , 1-grams , 2-grams , 3-grams , 4-grams , 5-grams , 6-grams , 7-grams ,
8-grams , 9-grams , 10-grams , 11-grams , 12-grams , 13-grams , 14-grams , 15-grams

Petrucci Music Library - Transposed Chord Progressions

This dataset contains chord progressions of up to four chords length and their counts. The chords represent all simultaneously active notes over all voices of a score. This means that the notes must not have the same onset time in order to appear in the same chord.

Counts of progressions contained in scores for which no year of composition/first publication is known are stored under the "?" year.

The entries represent equivalence classes of chord sequences equivalent up to a pitch shift. If the first chord of a sequence consists of multiple notes, the pitch of the lowest note is not stored and the chord is starts with and underscore sign "_". The following number indicates the difference in semitones between the lowest and the second lowest notes. If the first chord consisted of a single note, then the ngram begins with a number indicating the difference in semitones bewteen that single note and the lowest note of the second chord.

Version 20110830

total counts , 1-grams , 2-grams , 3-grams , 4-grams

Petrucci Music Library - Exact Chord Progressions

Counts of progressions contained in scores for which no year of composition/first publication is known are stored under the "?" year.

Version 20110907

total counts , 1-grams , 2-grams , 3-grams , 4-grams

Petrucci Music Library - Composer Specific Ngrams

This dataset contains the ngram/year counts for each composer in the database. This makes detailed analysis of composer styles and connections between them possible.

Version 20120614

pure rhythmic 1-5grams, no pitch information (322 MB), 6-10grams (1.34 GB)
transposed chord 1-5grams, no rhythm information (2.9 GB)
pitched chord 1-4grams with rhythmic information (2.8 GB), 5grams (2.0 GB), 6grams (2.4 GB), 7grams (2.7 GB)