Following on from this blog post on how to access the million song dataset I decided to do some analysis of the data. I focussed in particular on the length of songs. I performed the analysis using R and Hive and Hadoop through AWS. This first gist uses R to construct some hadoop code to import all the data into the HDFS. Once the data was in the HDFS I ran the following Hive code to mine the data. Then finally, more R code to analyse and visualise the results. For this I mainly used the ggplot2 package which is great for producing good looking graphs.