Its been a while since I did something with Spark. Its one of Apache’s most contributed to open source projects, so I was keen to see what had been developed since I last had a play. One of the big changes has been the development of a side project Zeppelin. Zeppelin is an interactive development environment (IDE) for Spark, much like Hue is to Hive. My aim here was to try have a play with Zeppelin and see if I could use it to develop a machine learning process. I needed some data, and the obvious dataset would be something to do with Led Zeppelin. So I used the Spotify API to download Echo Nest audio features for the songs on all Led Zeppelin’s studio albums. My plan was to do some unsupervised clustering to group songs with similar audio features together.
It now includes getArtistsAlbums which takes the output from a getArtists search and finds the albums by that artist and outputs a data.frame. This can be followed up by a getAlbumsTracks which will find all the tracks from those albums and create a data.frame. Finally I’ve added visDiscography. This uses both those functions along with the get artist function to create an interactive visualisation of an artists discography.