Inspired by this amazing Paper, that used audio signalling processing to analyse 30 second clips, from around 17K pop songs, to understand the evolution of Pop music over the last 50 years. I thought it would be interesting to see if something similar could be done with Pop lyrics.
The Evolution of Pop paper explains how Latent Dirichlet Allocation (LDA) was used to describe musical topics in each song. These topics were based on the chord progressions, timbre, and harmonics in the song, as derived using audio signal processing techniques. The songs were then clustered together, according to these topics, to give 13 major genres of pop music. LDA is traditionally used in text analysis. I was interested to see if I could classify the same songs into the 13 major genres according to the lyrics in the songs.
Firstly I needed to get the lyrics for the songs. The authors of the papers kindly published the data from their paper here. It contains all the song titles and artists. Using this data and the API for ChartLyrics, I wrote a Python script to scrape the lyrics for each song. Unfortunately I wasn’t able to find every song, but I found around 80%. This was also an automated process and I haven’t thoroughly checked the output, so I would imagine there are a few incorrect songs in there. The code I used and the data file it produced can be found on my github. In this process a couple of thing I learnt were; the API had a 20 second hit restriction so I used the time.sleep() function to insert delays into the script. This meant it took days to run and my mac kept going to sleep mid-process. To stop this, I discovered the caffeinate app that stops your mac from dozing off.
After scraping the lyrics, I then built a Python script to process them into a useful format. Again this is can be found on my github. This involved clearing out punctuation, rogue characters etc. then splitting the lyrics into words and removing the stop words. Stop words are words like at, the, if etc. Common words that are not relevant to the topics of the song. The Python stop words package provides a good list. I also experimented with stemming; which takes the word back to its root stemming becomes stem etc. In the end I preferred the results without this, but the stem package was very easy to use.
Next I needed to get my lyrics into a structure ready for LDA modelling. LDA treats each word individually in a “bag of words” style approach. This makes processing the data much easier as you don’t have to consider the sentence structure and other words surrounding each word. For each document (song) I needed to count the frequency of all the unique words (known as the vocabulary) in my entire dataset (known as corpus). Initially I turned my corpus into a pandas data frame and attempted to do this using stack() and get_dummies() functions. However trying to do all of this in one hit required too much RAM. Instead, I created my vocabulary for the corpus, and then went through each document and for each word in the vocabulary counting frequencies in the document. This is where Python is really powerful and it was satisfyingly quick to process. Again the code is available on my github
Now I needed to build the LDA model. I used the lda package. The biggest decisions I needed to make were around how many iterations to run, and how many topics I wanted to find. I ran the process with a variety of iterations and found having over 1500 gave pretty consistent results. To decide how many topics, I took a subjective, brute force approach. I ran the model for 1 to 40 topics. I decided to choose 25 topics, as the new topics formed after that were just subtle variations on the topics already there. It was during this process I also discovered the real art to LDA is in the stop words. I found myself adding many more to the list, particularly more lyric based ‘noises’ rather than words. Things like oohhh, ahhh, doo deee, dah etc. These weren’t adding any value to my topics.
Once I had my topics I needed to try and classify my songs. I used the proportion of lyrics in each topics along with two other metrics (song vocabulary, number of words) to be my predictors, and the genre classifications from the paper to be my independent variable. There were a number of classification techniques I could have used (Random Forests, Logistic Regression, Neural Nets etc.) but I liked the idea of using Linear Discriminant Analysis which also abbreviates to LDA for the symmetry. As always I took a machine learning approach, with a train and test set to find out how accurate the model is.
I used a couple of d3.js to help visualise the results. I combined this word cloud example with a drop down selection list to show the important words in each topic. I used this stream graph implementation to show the topics over time. I used this heat chart to produce a cross tab. The code to construct the jsons can be found on my github
The word cloud below shows the most popular lyrics outside of my stop words. As you would expect Love comes out as a strong theme.
The word cloud can switch to show the top words in each of the 25 topics from the LDA. The word cloud also shows the songs most associated with each topic. Some nice themes emerge from the topics.
There are a few love related topics; topic 4 is classic baby love, topic 17 focussing on true love, topic 21 is more being crazy in love with your boy or girl. Topic 13 covers breaking up and how much it hurts. Topic 24 looks a useful feature for identifying gangsta’ rap. Topic 6 covers jesus, the lord and angels. Topic 14 is about war and fighting. And so on…you can draw your own interpretations.
Assigning a main topic to each song you can see how this changes over time.
It shows topic 24 (gangsta rap) coming more popular from 1990 onwards, and topic 17 (true love) being squeezed out.
Sadly the results for predicting the musical genre purely based on the lyrics aren’t good. Its was only correct for 16% of results. However some genres of music were much easier to predict that others. The genre labelled in the paper ‘hiphop/rap/gansta rap/old school’ had 60% prediction accuracy. The model had a tendency to predict most songs as ‘rock/classic rock/country/singer songwriter’ which itself had 40% accuracy.
The heat chart below shows a cross tab of the main lyric topic (LT) crossed with the musical genre (MG). Its a pretty even spread. Therefore it wouldn’t matter what classification model was used these features aren’t going to predict genres.
I may return to this dataset and try some different language processing techniques to try and create some different features and hopefully improve on the model accuracy…