Following on from this blog post on how to access the million song dataset I decided to do some analysis of the data. I focussed in particular on the length of songs. I performed the analysis using R and Hive and Hadoop through AWS. This first gist uses R to construct some hadoop code to import all the data into the HDFS. Once the data was in the HDFS I ran the following Hive code to mine the data. Then finally, more R code to analyse and visualise the results. For this I mainly used the ggplot2 package which is great for producing good looking graphs.
The million song dataset was created a few years ago to help encourage research on algorithms for analysing music related data. There was also a Kaggle competition and a Hackathon using it a couple of years ago. It’s freely available through Amazon Web Services (AWS) as a public dataset and also in an S3 bucket. I use AWS at work but access it using a nice front end. I had been wanting to learn what was going on behind the scenes for a while. Putting these two together I gave myself a challenge of accessing the Million Song Database using Hive and Hadoop through AWS.