Up to now I’ve mostly analysed meta data about music, and when I have looked at the track content I’ve focused on the lyrics. Now I want to look at analysing the sound itself. In this post I will demonstrate how to extract some useful information from an audio file using Python.
Its been a while since I did something with Spark. Its one of Apache’s most contributed to open source projects, so I was keen to see what had been developed since I last had a play. One of the big changes has been the development of a side project Zeppelin. Zeppelin is an interactive development environment (IDE) for Spark, much like Hue is to Hive. My aim here was to try have a play with Zeppelin and see if I could use it to develop a machine learning process. I needed some data, and the obvious dataset would be something to do with Led Zeppelin. So I used the Spotify API to download Echo Nest audio features for the songs on all Led Zeppelin’s studio albums. My plan was to do some unsupervised clustering to group songs with similar audio features together.
In the past I’ve built apps with R Shiny, and I’ve also developed a few data visualisations with d3.js. Given that R Shiny is an R based Back End Server that renders a Front End in Java Script, it seemed like it would be possible to integrate a d3.js visualisation into an R Shiny App. After some quick research, it turns out that it is possible, this blog explains how to do it, and here is an example (please note this is hosted on Shiny.io and sometimes runs out of free hours each month)
Inspired by this amazing Paper, that used audio signalling processing to analyse 30 second clips, from around 17K pop songs, to understand the evolution of Pop music over the last 50 years. I thought it would be interesting to see if something similar could be done with Pop lyrics.
The Evolution of Pop paper explains how Latent Dirichlet Allocation (LDA) was used to describe musical topics in each song. These topics were based on the chord progressions, timbre, and harmonics in the song, as derived using audio signal processing techniques. The songs were then clustered together, according to these topics, to give 13 major genres of pop music. LDA is traditionally used in text analysis. I was interested to see if I could classify the same songs into the 13 major genres according to the lyrics in the songs.
Having taught myself a bit of Python I was keen to start using Spark. Spark is an open source project from Apache building on the ideas of MapReduce. It still uses the HDFS but where Hadoop processes on disk, Spark runs things in memory, which can dramatically increase process speeds. Spark also has a great machine learning library you can use with it. Hive on Hadoop limits you to data processing and manipulation, Spark allows you to build models at scale. Spark can be called with its native Scala, or there are APIs so you can use Python or Java (release 1.4 also includes an R API).
I decided to take the lastfm dataset i have been playing with recently and generate some recommendations for the users based on their listening habits.
Continuing on my mission to get better at Python I’ve been learning about the Pandas and sklearn libraries. I was looking for a challenge to use these libraries on and I had recently come across a nice lastfm data extract. The data contains around 360K users and a little demographic data on them (gender, age, country), and a count of listens they have had for each artist. I decided to try and build a model that could predict someone’s gender based on what they have been listening to.
A while back I created an R package to pull data out of the Spotify API and turn it into a d3.js visualisation. Here is the blogpost. I’ve started to teach myself Python and I’ve now re-built this process with it. The exciting part is, as it’s in Python I can use the Google App Engine to create an app that hosts the code online. That means anyone can generate a related artists visualisation. Hurray! Have a go yourself by following this link
To find out more about how its done read on…
Following on from this blog post on how to access the million song dataset I decided to do some analysis of the data. I focussed in particular on the length of songs. I performed the analysis using R and Hive and Hadoop through AWS. This first gist uses R to construct some hadoop code to import all the data into the HDFS. Once the data was in the HDFS I ran the following Hive code to mine the data. Then finally, more R code to analyse and visualise the results. For this I mainly used the ggplot2 package which is great for producing good looking graphs.
The million song dataset was created a few years ago to help encourage research on algorithms for analysing music related data. There was also a Kaggle competition and a Hackathon using it a couple of years ago. It’s freely available through Amazon Web Services (AWS) as a public dataset and also in an S3 bucket. I use AWS at work but access it using a nice front end. I had been wanting to learn what was going on behind the scenes for a while. Putting these two together I gave myself a challenge of accessing the Million Song Database using Hive and Hadoop through AWS.