Project: AI Music Generator || DEEP LEARNING || CODING SNAP ||

Project: AI Music Generator

To create an automatic AI music generator, we first need to understand music theory.

Here in this project, we will build a model that generates piano music using the datasets that we feed in.

Let's check some theory on Piano:
  • A piano has 12 notes in total (including the 5 accidental notes).
  • The notes are: C, D, E, F, G, A, B
  • The accidental notes are: C#, D#, F#, G#, A#
  • One set of notes is called an octave and there are n octaves in pianos depending upon the size(n-keys) of the piano.
  • The left end of the Piano has low frequency whereas the right end of the Piano has a high frequency.
  • A chord is a mixture of notes played simultaneously.


Note: A music is a combination of notes and chords played sequentially. The melodiousness of the music solely depends upon the order and combination of the notes and chords being played.

For this project, we'll be using the Music21 library that makes our life simple :D



Data


The dataset consists of music files in midi format. MIDI stands for Music Instruments Digital Interface.
Download the dataset from below:

Imports

from music21 import converter, instrument, note, chord, stream
import glob
import pickle
import numpy as np
from keras.utils import np_utils
view raw Imports.py hosted with ❤ by GitHub

TASK


So, the idea is we'll read all the midi files and extract the components(notes and chords) from it so that we can use it later in preprocessing and training the model.

1. Reading a midi file:

midi = converter.parse("midi_songs/EyesOnMePiano.mid")

music21.converter contains tools for loading music from various file formats, whether from disk, from the web, or from the text, into music21.stream.:class:~music21.stream.Score objects (or other similar stream objects).
The most powerful and easy to use tool is the parse() function. Simply provide a filename, URL, or text string and, if the format is supported, a Score will be returned.

2. Playing the parsed file:
midi.show('midi')

This method loads the music and provides functionality to play and stop the music.


3. Displaying the song in the text format:

midi.show('text')

Viewing the song in the text format. If we look closely then we will notice that the structure is this way:

  • Container
    • Sub-Container
      • notes
      • notes
      • chords
    • Sub-Container
      • notes
      • chords
    • Sub-Container
      • chords
      • notes
      • notes

This means the main container contains multiple sub-containers within which the notes and chords are present separately. (Note: Here the container is not list but the Score)
We can interpret this as the lists-of-lists containing the notes and the chords.

4. Flattening the object and checking the length:

elements_to_parse = midi.flat.notes
len(elements_to_parse)

So, we'll flatten the list so that all the elements are present within a single list reason being what matters ultimately are the notes and the chords.

5. Checking the timing of the notes and the chords played:

for e in elements_to_parse:
    print(e, e.offset)  

We can call the offset property on the elements and check the timing they were played at.
Some examples of the output are:
<music21.note.Note A> 0.0
<music21.note.Note A> 0.0
<music21.note.Note A> 0.0
<music21.note.Note A> 0.25
<music21.note.Note G> 2/3
<music21.note.Note G> 2/3
<music21.note.Note F#> 1.0
<music21.note.Note F#> 1.25
<music21.note.Note D> 1.5
<music21.note.Note D> 1.5
<music21.note.Note C#> 1.75
<music21.note.Note C#> 1.75
view raw Outputs offset hosted with ❤ by GitHub


By now, it is clear that the iterator "elements_to_parse" contains only the notes and the chords. It's our turn to decide what to do with these tunes?


6. Storing the Pitch:

Okay so let me simplify, the note is a single tune and if we capture its pitch we can utilize it in the formation of new music later. Similarly, the chord has multiples notes so we can extract the notes in the chord and store their pitch.
notes_demo = []
for ele in elements_to_parse:
# If the element is a Note, then store it's pitch
if isinstance(ele, note.Note):
notes_demo.append(str(ele.pitch))
# If the element is a Chord, split each note of chord and join them with +
elif isinstance(ele, chord.Chord):
notes_demo.append("+".join(str(n) for n in ele.normalOrder))
view raw StoringPitch.py hosted with ❤ by GitHub

*In the code above*

  • Calling the pitch on the elements returns the pitch of the note along with its type.
  • Therefore, converted it to a string so that later mapping into integers is efficient while feeding it to LSTM.
  • In case of chords; there is a possibility that multiple notes, as well as single notes, are considered to be a chord, so we are concatenating them using the '+'
  • normalOrder returns the list of values that corresponds to the chord. Eg: Chord: A6+C4, then the normalOrder may return the '1+6'.
How to know if the element is a note or chord?

  • In this scenario, we can seek help from the modules of the library namely note and chord.
  • We can check the instance of the elements if they belong to any of the categories, we can operate on them as specified.
7. Preprocessing All Files:

notes = []
for file in glob.glob("midi_songs/*.mid"):
midi = converter.parse(file) # Convert file into stream.Score Object
print("parsing %s"%file)
elements_to_parse = midi.flat.notes
for ele in elements_to_parse:
# If the element is a Note, then store it's pitch
if isinstance(ele, note.Note):
notes.append(str(ele.pitch))
# If the element is a Chord, split each note of chord and join them with +
elif isinstance(ele, chord.Chord):
notes.append("+".join(str(n) for n in ele.normalOrder))


Now, we need to preprocess all the music files and extract their notes and chords. Using glob to serve the purpose and also printing the file being parsed.


At this point out list contains 60498 elements which are either notes or chords. But all are not unique, so we can check the unique elements by typecasting them into a set and check a few of the elements:

n_vocab = len(set(notes))
print("Total notes- ", len(notes))
print("Unique notes- ", n_vocab)
#output: Total notes- 60498
# Unique notes- 359
print(notes[100:200])
Output:
['1+5+9', 'G#2', '1+5+9', '1+5+9', 'F3', 'F2', 'F2', 'F2', 'F2',
'F2', '4+9', 'E5', '4+9', 'C5', '4+9', 'A5', '4+9', '5+9', 'F5',
'5+9', 'C5', '5+9', 'A5', '5+9', '4+9', 'E5', '4+9', 'C5', '4+9',
'A5', '4+9', 'F5', '5+9', 'C5', '5+9', 'E5', '5+9', 'D5', '5+9',
'E5', '4+9', 'E-5', '4+9', 'B5', '4+9', '4+9', 'A5', '5+9', '5+9',
'5+9', '5+9', '4+9', '4+9', '4+9', '4+9', '5+9', '5+9', '5+9',
'5+9', 'B4', '4+9', 'A4', '4+9', 'E5', '4+9', '4+9', 'E-5', '5+9',
'5+9', '5+9', '5+9', '4+9', '4+9', '4+9', '4+9', '5+9', '5+9', '5+9',
'5+9', 'E5', '4', 'E-5', 'C6', 'E5', '5', 'E-5', 'B5', 'E5', '6',
'E-5', 'C6', 'A5', '5', 'A4', '4', 'C5', 'E5', 'F5', 'E5', '5']



Prepare Sequential Data for LSTM


Steps to consider in this stage:

  • Let's take 100 types of inputs and then predict one new Output. 
  • So we'll set the sequence_length to be 100.
  • As we all know our LSTM is not going to work on the string inputs
  • We need to map these strings into numbers of integers. 
  • So we'll create a dictionary and map each unique element with some integer values. 


# Hoe many elements LSTM input should consider
sequence_length = 100
# All unique classes
pitchnames = sorted(set(notes))
# Mapping between ele to int value
ele_to_int = dict( (ele, num) for num, ele in enumerate(pitchnames) )
network_input = []
network_output = []
for i in range(len(notes) - sequence_length):
seq_in = notes[i : i+sequence_length] # contains 100 values
seq_out = notes[i + sequence_length]
network_input.append([ele_to_int[ch] for ch in seq_in])
network_output.append(ele_to_int[seq_out])

Now,
  • Since our current input ranges from 0 to 359
  • We'll normalize the input values so that it ranges between 0 to 1
Also, we have to define the shape explicitly as desired by the LSTM(samples, time steps, features)
# No. of examples
n_patterns = len(network_input)
print(n_patterns)
# Desired shape for LSTM
network_input = np.reshape(network_input, (n_patterns, sequence_length, 1))
print(network_input.shape)
normalised_network_input = network_input/float(n_vocab)
# Network output are the classes, encode into one hot vector
network_output = np_utils.to_categorical(network_output)
network_output.shape #Output: (60398, 359)
print(normalised_network_input.shape) #Output: (60398, 100, 1)
print(network_output.shape) #Output: (60398, 359)



Creating Model


By now, we have created a supervised data and now we need to create and train a model.

The first layer of the model is LSTM and we'll feed in the values, the units to be 512 and the shape is (100,1). Since this not the last layer, we will return the sequence.

To reduce the overfitting and improve the performance of the model we will assign a dropout of 0.3 after every layer.

Further, we are going to add two more layers of the LSTM

And at the end of the LSTMs add a Dense layer to update the values during propagation.

Also, we are going to use softmax as the activation function.





from keras.models import Sequential, load_model
from keras.layers import *
from keras.callbacks import ModelCheckpoint, EarlyStopping
model = Sequential()
model.add( LSTM(units=512,
input_shape = (normalised_network_input.shape[1], normalised_network_input.shape[2]),
return_sequences = True) )
model.add( Dropout(0.3) )
model.add( LSTM(512, return_sequences=True) )
model.add( Dropout(0.3) )
model.add( LSTM(512) )
model.add( Dense(256) )
model.add( Dropout(0.3) )
model.add( Dense(n_vocab, activation="softmax") )
model.compile(loss="categorical_crossentropy", optimizer="adam")
model.summary()
checkpoint = ModelCheckpoint("model.hdf5", monitor='loss', verbose=0, save_best_only=True, mode='min')
model_his = model.fit(normalised_network_input, network_output, epochs=100, batch_size=64, callbacks=[checkpoint])
view raw Model.py hosted with ❤ by GitHub
*The training of the model on Colab took nearly 10hours. *

Predictions


To predict the music sequence, we need to give some input, but if we check the network input right now, it is a NumPy array but we'll create a list again.

So that the network input is a list of list and each data point consist of 100 elements.

Taking the prediction:
We will take any data point start generating the notes and append this into the sequence of the datapoint, now we'll discard the first element and take the rest of the elements to generate new element. This will be done recursively.

To process this approach, we can start with any random datapoint. And for the output, we need to create a dictionary and map all the integers to their respective string values so that the corresponding notes and chords can be played.

Also, determining the number of elements we want to generate.(Note: these elements will combine to form music).


sequence_length = 100
network_input = []
for i in range(len(notes) - sequence_length):
seq_in = notes[i : i+sequence_length] # contains 100 values
network_input.append([ele_to_int[ch] for ch in seq_in])
# Any random start index
start = np.random.randint(len(network_input) - 1)
# Mapping int_to_ele
int_to_ele = dict((num, ele) for num, ele in enumerate(pitchnames))
# Initial pattern
pattern = network_input[start]
prediction_output = []
# generate 200 elements
for note_index in range(200):
prediction_input = np.reshape(pattern, (1, len(pattern), 1)) # convert into numpy desired shape
prediction_input = prediction_input/float(n_vocab) # normalise
prediction = model.predict(prediction_input, verbose=0)
idx = np.argmax(prediction)
result = int_to_ele[idx]
prediction_output.append(result)
# Remove the first value, and append the recent value..
# This way input is moving forward step-by-step with time..
pattern.append(idx)
pattern = pattern[1:]
view raw Predictions.py hosted with ❤ by GitHub



Creating Midi File


Now, we have to create a midi file from the prediction that we got. So that it can be played.

We will start with the time value 0 and slowly add up according to the notes and chords.

Here, again we will have to check the pattern, if it is a chord or a note and process them accordingly.

If the element is a note then, in this case, we will create a note object using the module 'note' on that element and if the element was a chord, we will create a chord object using the 'chord' module; so that it converts it into a playable note. 

How to check if the element is a chord or not?

If the element has a '+'(multiple notes) in it it must be a chord or if the element is just a digit(single note) then it is also a chord.

Since this whole dataset belongs to the piano music, we can store the notes calling the piano method defining the type of instrument.

In the end, we will add the offset so that the time stamp increases linearly as the notes/chords are appended.

offset = 0 # Time
output_notes = []
for pattern in prediction_output:
# if the pattern is a chord
if ('+' in pattern) or pattern.isdigit():
notes_in_chord = pattern.split('+')
temp_notes = []
for current_note in notes_in_chord:
new_note = note.Note(int(current_note)) # create Note object for each note in the chord
new_note.storedInstrument = instrument.Piano()
temp_notes.append(new_note)
new_chord = chord.Chord(temp_notes) # creates the chord() from the list of notes
new_chord.offset = offset
output_notes.append(new_chord)
else:
# if the pattern is a note
new_note = note.Note(pattern)
new_note.offset = offset
new_note.storedInstrument = instrument.Piano()
output_notes.append(new_note)
offset += 0.5


So, this way we can train a model to generate music. Besides this, there are few drawbacks in this model.

Drawbacks

  • We have taken many samples and generated a new one, the model does not know how a song should start or end.
  • The offset is fixed, it is not variable.
Obviously, they can be fixed and the model can be designed more accurately.
  • Add more instruments, for this, we can create a new container and then proceed
  • Train for more epochs.


Post a Comment

0 Comments