Music Ngram Viewer |
Here are the datasets backing the Music Ngram Viewer. These datasets were generated in April and September 2011; I will update these datasets as the score recognition continues, and the updated versions will have distinct and persistent version identifiers (20110401 for the current set). You can also access data not yet published here via the API
Each of the numbered links below will directly download a fragment of the given corpus. In addition, for each corpus I provide the file total counts, which records the total number of 1-grams contained in the scores that make up the corpus. This file is useful to compute the relative frequencies of n-grams.
Details on the corpus construction but are abbreviated here. Of note, I report only the n-grams that appeared over 3 times in any particular year. Therefore, the sum of the 1-gram occurrences in any given corpus is smaller than the number given in the total counts file.
File format: Each of the numbered files below is gzipped tab-separated data. Each line has the following format:
ngram TAB year TAB match_count NEWLINE
As an example, here are the 7,000,000th and 7,000,001st lines from file of the IMSLP interval 5-grams (imslp-interval-5gram-20110401.csv.gz):
3 -2 4 -5 3 1804 94 3 -2 4 -5 3 1805 21
The first line tells us that in 1804, the melody occurred 94 times overall.
The format of the total counts file is identical,
except that the ngram
field is absent:
there is only one value match_count per year.
Inside each file the ngrams are sorted alphabetically and then chronologically.
Usage: This compilation is licensed under a Creative Commons Attribution 3.0 Unported License.
total counts
,
1-grams
,
2-grams
,
3-grams
,
4-grams
,
5-grams
,
6-grams
,
7-grams
,
8-grams
,
9-grams
,
10-grams
,
11-grams
,
12-grams
,
13-grams
,
14-grams
,
15-grams
This dataset contains chord progressions of up to four chords length and their counts. The chords represent all simultaneously active notes over all voices of a score. This means that the notes must not have the same onset time in order to appear in the same chord.
Counts of progressions contained in scores for which no year of composition/first publication is known are stored under the "?" year.
The entries represent equivalence classes of chord sequences equivalent up to a pitch shift. If the first chord of a sequence consists of multiple notes, the pitch of the lowest note is not stored and the chord is starts with and underscore sign "_". The following number indicates the difference in semitones between the lowest and the second lowest notes. If the first chord consisted of a single note, then the ngram begins with a number indicating the difference in semitones bewteen that single note and the lowest note of the second chord.
This dataset contains chord progressions of up to four chords length and their counts. The chords represent all simultaneously active notes over all voices of a score. This means that the notes must not have the same onset time in order to appear in the same chord.
Counts of progressions contained in scores for which no year of composition/first publication is known are stored under the "?" year.
This dataset contains the ngram/year counts for each composer in the database. This makes detailed analysis of composer styles and connections between them possible.