Its been a while since I did something with Spark. Its one of Apache’s most contributed to open source projects, so I was keen to see what had been developed since I last had a play. One of the big changes has been the development of a side project Zeppelin. Zeppelin is an interactive development environment (IDE) for Spark, much like Hue is to Hive. My aim here was to try have a play with Zeppelin and see if I could use it to develop a machine learning process. I needed some data, and the obvious dataset would be something to do with Led Zeppelin. So I used the Spotify API to download Echo Nest audio features for the songs on all Led Zeppelin’s studio albums. My plan was to do some unsupervised clustering to group songs with similar audio features together.
Following on from this blog post on how to access the million song dataset I decided to do some analysis of the data. I focussed in particular on the length of songs. I performed the analysis using R and Hive and Hadoop through AWS. This first gist uses R to construct some hadoop code to import all the data into the HDFS. Once the data was in the HDFS I ran the following Hive code to mine the data. Then finally, more R code to analyse and visualise the results. For this I mainly used the ggplot2 package which is great for producing good looking graphs.