The million song dataset was created a few years ago to help encourage research on algorithms for analysing music related data. There was also a Kaggle competition and a Hackathon using it a couple of years ago. It’s freely available through Amazon Web Services (AWS) as a public dataset and also in an S3 bucket. I use AWS at work but access it using a nice front end. I had been wanting to learn what was going on behind the scenes for a while. Putting these two together I gave myself a challenge of accessing the Million Song Database using Hive and Hadoop through AWS.