Following on from this blog post on how to access the million song dataset I decided to do some analysis of the data. I focussed in particular on the length of songs. I performed the analysis using R and Hive and Hadoop through AWS. This first gist uses R to construct some hadoop code to import all the data into the HDFS. Once the data was in the HDFS I ran the following Hive code to mine the data. Then finally, more R code to analyse and visualise the results. For this I mainly used the ggplot2 package which is great for producing good looking graphs.
A basic aggregation of song length shows the most frequent song length is between 3.5 to 4 minutes, with the average duration being 249 secs (just over 4 minutes). The distribution has a long tail which I’ve grouped up here into songs over 7 minutes long. As a fan of longer songs I was interested to know more about them.
I ran an analysis of how song length changes over time by calculating the average song length per year, which is plotted below. The interesting feature from this graph is the jump in average song length that occurs in the late 60s. My guess is this is to do with the growing popularity of the 12″ vinyl format. Prior to 1970 that the data is quite noisy.
When you look into the noise it becomes clear this is due to the number of songs in the database for each of those years. The graph below shows the number of songs in the data increases over time, so the noise maybe down to having a smaller sample for those years.
As I was interested in songs over 7 minutes long I calculated the proportion of songs by song length each year. I then plotted this from 1970 onwards; as that’s where the data is more stable.
I was happy with how it looked but its hard to spot whats going on. So then I just plotted the proportion of songs over 7 minutes long.
The results show that the proportion of songs over 7 minutes long halves from a peak of 8% in 1970 to 4% by 1980. My guess is this is linked with the decline of prog rock and jazz fusion during that time. To my surprise though it picked back up again during the 90s reaching 7% by 2000. I wondered what was driving this. So I looked in the different genre tags available on each song.
The following shows the top 20 genres by average song length for genres tagged against at least 10000 tracks. Jazz fusion and post bop are both there, but there is no sign of progressive rock. It is in fact down in 115th. The top genres relate to Techno, Trance and House which are probably those driving the increase in song length in the late 90s into the 00s.