Its been a while since I did something with Spark. Its one of Apache’s most contributed to open source projects, so I was keen to see what had been developed since I last had a play. One of the big changes has been the development of a side project Zeppelin. Zeppelin is an interactive development environment (IDE) for Spark, much like Hue is to Hive. My aim here was to try have a play with Zeppelin and see if I could use it to develop a machine learning process. I needed some data, and the obvious dataset would be something to do with Led Zeppelin. So I used the Spotify API to download Echo Nest audio features for the songs on all Led Zeppelin’s studio albums. My plan was to do some unsupervised clustering to group songs with similar audio features together.
Having taught myself a bit of Python I was keen to start using Spark. Spark is an open source project from Apache building on the ideas of MapReduce. It still uses the HDFS but where Hadoop processes on disk, Spark runs things in memory, which can dramatically increase process speeds. Spark also has a great machine learning library you can use with it. Hive on Hadoop limits you to data processing and manipulation, Spark allows you to build models at scale. Spark can be called with its native Scala, or there are APIs so you can use Python or Java (release 1.4 also includes an R API).
I decided to take the lastfm dataset i have been playing with recently and generate some recommendations for the users based on their listening habits.