Its been a while since I did something with Spark. Its one of Apache’s most contributed to open source projects, so I was keen to see what had been developed since I last had a play. One of the big changes has been the development of a side project Zeppelin. Zeppelin is an interactive development environment (IDE) for Spark, much like Hue is to Hive. My aim here was to try have a play with Zeppelin and see if I could use it to develop a machine learning process. I needed some data, and the obvious dataset would be something to do with Led Zeppelin. So I used the Spotify API to download Echo Nest audio features for the songs on all Led Zeppelin’s studio albums. My plan was to do some unsupervised clustering to group songs with similar audio features together.
Firstly I grabbed the data. I created a quick R process to do loop through and get the features from the API using the httr package. The script is here. Then I uploaded the data onto s3 as a csv ready to use.
Now I needed to get an EMR cluster running with Spark and Zeppelin installed. This blog provided some helpful instructions. I decided to use Python and pyspark again. Although SparkR using R is now available, there seems to be a stronger online community around pyspark, with more of the Q&As on stackoverflow about pyspark. One of the first changes in Spark I noticed was that the MLlib was no longer using “RDD” as its input data format, it was now using “dataframes”. Dataframes are the next step on from RDDs and take the distributed dataset of an RDD and organize them into a structure with named columns. Anyone familiar with R or pandas in python will easily pick up the concept. However getting my csv file to a Dataframe was a little tricky. Firstly I had to read the file in and create an RDD, then I could convert the RDD to a dataframe. It all seemed a bit clunky. However there are a couple of projects being worked on to help simplify this.
For example to show columns 1 and 3 from the dataframe
Or to show all columns where col1>1
To calculate the average of col3 for each value of col1
To join dataframe1 to dataframe2
Alternatively you could also do all of the above using the sql driver that comes with pyspark
newdataframe = spark.sql("SELECT * FROM dataframe where col1>1")
One of the great things about Zeppelin is it lets you switch language.
- So if you are running some pyspark you just need to declare %pyspark
- Equally you could declare %sql. Then you can just work in native SQL
- %md if you want to add some markdown text
- %python if you want python
- %sh for shell command
This makes it a really nice dev environment for your multilingual data scientist. One of the nice things about using %sql is the output becomes easily graph-able with a few clicks, so its quick and easy to interrogate the results, or download them as a csv. As you can see here In Through the Out Door is Zeps most danceable album.
One drawback I noticed is when you submitted a coding mistake, rather than return an error, it had a tendency to just hang while trying to run your incorrect query.
Once I’d spent some time exploring Zeppelin and exploring the dataframe, next I wanted to do the clustering. Sparks machine learning library MLlib offers three different clustering methods. Classic K-means. A Gaussian mixed model – which is like k means but provides a probabilistic estimate for belonging to each cluster according to Gaussian density distribution. Whats nice about this method is it means that your clusters don’t have to be spherical like with Kmeans. Finally a bisecting Kmeans which is more like a hierarchical clustering in reverse. Starting with all you data in one cluster and bisecting the data down to the level of number of clusters required.
In order to run the models you need to standardize the features and then store them as a dense vector. Spark offers a variety of scaling options including Normalizer(), MinMaxScaler(), MaxAbsScaler(). I chose to use the StandardScaler() to scale by Mean and Standard Deviation. Then I used the VectorAssembler() to put the features I wanted into a dense vector ready for modelling. I ran all three models to see how the solutions differed KMeans() ,BisectingKMeans(), GaussianMixture(). My Spark code can be found here.
For each method I chose 9 clusters as there were 9 albums. Looking at the three solutions I preferred the K-Means – though this was subjective. It also produced the solution with the smallest average distances from each song to its cluster centre. Working with this solution I produced the following plot in R using plotly in R, here is the code. The colours show the cluster each song belongs to, by album, over time, the size represents the length of the song.
It highlights a real diversity of styles on Physical Graffiti, whereas Coda is limited to just two styles of songs. You can also see how their style changed over time as blue style songs replace purple song after Houses of the Holy.