Sparkling Song Recommendations

sparkling song recs

Having taught myself a bit of Python I was keen to start using Spark. Spark is an open source project from Apache building on the ideas of MapReduce. It still uses the HDFS but where Hadoop processes on disk, Spark runs things in memory, which can dramatically increase process speeds. Spark also has a great machine learning library you can use with it. Hive on Hadoop limits you to data processing and manipulation, Spark allows you to build models at scale. Spark can be called with its native Scala, or there are APIs so you can use Python or Java (release 1.4 also includes an R API).

I decided to take the lastfm dataset i have been playing with recently and generate some recommendations for the users based on their listening habits.

Continue reading

Accessing a million songs with Hive and Hadoop on AWS

aws cloud

The million song dataset was created a few years ago to help encourage research on algorithms for analysing music related data. There was also a Kaggle competition and a Hackathon using it a couple of years ago. It’s freely available through Amazon Web Services (AWS) as a public dataset and also in an S3 bucket. I use AWS at work but access it using a nice front end. I had been wanting to learn what was going on behind the scenes for a while. Putting these two together I gave myself a challenge of accessing the Million Song Database using Hive and Hadoop through AWS.

Continue reading