Accessing a million songs with Hive and Hadoop on AWS

aws cloud

The million song dataset was created a few years ago to help encourage research on algorithms for analysing music related data. There was also a Kaggle competition and a Hackathon using it a couple of years ago. It’s freely available through Amazon Web Services (AWS) as a public dataset and also in an S3 bucket. I use AWS at work but access it using a nice front end. I had been wanting to learn what was going on behind the scenes for a while. Putting these two together I gave myself a challenge of accessing the Million Song Database using Hive and Hadoop through AWS.

Before I go into the details I thought I’d pre-empt a whole load of AWS jargon:


————————————————————————————-
Amazon Web Service (AWS) is a cloud based databasing and analytics service.

Elastic Cloud Computers (EC2) are cheap rentable cloud computers

The Simple Storage Service (S3) is cheap cloud storage for storing any files

An Elastic MapReduce (EMR) is a web service that will create a cloud computing cluster containing a number of EC2s, and will install Hive and Hadoop, and provide access to S3

Hive is a data warehousing system installed on an EMR that will generate a Hadoop job from HiveQL

HiveQL is a SQL like language used by analysts to query data stored on S3

A file stored on S3 can be pulled into the Hadoop Distributed File System (HDFS) by Hadoop ready for processing

The HDFS stores data across a series of EC2 machines in a Hadoop cluster

Hadoop is a processing system installed on an EMR that ‘Maps’ parallel tasks across a number of EC2 nodes, before ‘Reducing’ the results of the parallel processing into a final result. It uses the MapReduce paradigm

Hadoop can write files to S3
————————————————————————————-

OK. So here’s how i did it.

Firstly you need to sign up for AWS. If you have an amazon shopping account you can use that.

Next you need to set up a key pair between your home computer and AWS. This is just for security and allows access from your computer at home to the cloud based computers you hire from AWS.

Now you can create an EMR cluster – Note in step 8 you don’t need to enter a Hive program. I submit some HiveQl to the cluster later on.

While that is being created for you (it will take 5-10mins). Go to s3 and create an area for you to store your own data in

NB: I use a mac at home and am unable to confirm whether the following instructions would work on a PC.

The next step is to access your cluster using a secure shell. Wait for the cluster to be running then in the terminal enter the following code, but with the Public DNS of your EMR cluster and the location of your pair key .pem file.

ssh hadoop@ec2-xx-xxx-xx-xx.compute-1.amazonaws.com -i /Key_pair/location_folder/KEY_PAIR.pem

You should now have access to the cluster.

Firstly lets have a look at the Million Song datafiles on S3.

hadoop fs -ls s3://tbmmsd/

Now to copy them into the HDFS. This code creates a folder on HDFS to put the files in and then copies a file across. You’ll need to repeat the copy step to get all the data.

hadoop fs -mkdir /data
hadoop fs -mkdir /data/files

hadoop fs -cp s3://tbmmsd/A.tsv.a /data/files/A.tsv.a
hadoop fs -cp s3://tbmmsd/A.tsv.a /data/files/A.tsv.b
hadoop fs -cp s3://tbmmsd/A.tsv.a /data/files/A.tsv.c

Now we can process the data using HiveQL. Firstly type “Hive” into the terminal to signify you are switching into HiveQL. Then enter the following to read the data to Hive and then count the number of records.

create external table if not exists MSD
(
analysis_sample_rate float ,
artist_7digitalid int ,
artist_familiarity float ,
artist_hotttnesss float ,
artist_id string ,
artist_latitude float ,
artist_location string ,
artist_longitude float ,
artist_mbid string ,
artist_mbtags string ,
artist_mbtags_count int ,
artist_name string ,
artist_playmeid int ,
artist_terms string ,
artist_terms_freq float ,
artist_terms_weight float ,
audio_md5 string ,
bars_confidence float ,
bars_start float ,
beats_confidence float ,
beats_start float ,
danceability float ,
duration float ,
end_of_fade_in float ,
energy float ,
key int ,
key_confidence float ,
loudness float ,
mode int ,
mode_confidence float ,
release string ,
release_7digitalid int ,
sections_confidence float ,
sections_start float ,
segments_confidence float ,
segments_loudness_max float ,
segments_loudness_max_time float ,
segments_loudness_max_start float ,
segments_pitches float ,
segments_start float ,
segments_timbre float ,
similar_artists string ,
song_hotttnesss float ,
song_id string ,
start_of_fade_out float ,
tatums_confidence float ,
tatums_start float ,
tempo float ,
time_signature int ,
time_signature_confidence float ,
title string ,
track_id string ,
track_7digitalid int ,
year int
) row format delimited fields terminated by ',' stored as textfile
location '/data/files';

select count(1) from MSD;

The next part writes the first 100 rows to one of my s3 folders


create external table if not exists MSD_sample
(
analysis_sample_rate float ,
artist_7digitalid int ,
artist_familiarity float ,
artist_hotttnesss float ,
artist_id string ,
artist_latitude float ,
artist_location string ,
artist_longitude float ,
artist_mbid string ,
artist_mbtags string ,
artist_mbtags_count int ,
artist_name string ,
artist_playmeid int ,
artist_terms string ,
artist_terms_freq float ,
artist_terms_weight float ,
audio_md5 string ,
bars_confidence float ,
bars_start float ,
beats_confidence float ,
beats_start float ,
danceability float ,
duration float ,
end_of_fade_in float ,
energy float ,
key int ,
key_confidence float ,
loudness float ,
mode int ,
mode_confidence float ,
release string ,
release_7digitalid int ,
sections_confidence float ,
sections_start float ,
segments_confidence float ,
segments_loudness_max float ,
segments_loudness_max_time float ,
segments_loudness_max_start float ,
segments_pitches float ,
segments_start float ,
segments_timbre float ,
similar_artists string ,
song_hotttnesss float ,
song_id string ,
start_of_fade_out float ,
tatums_confidence float ,
tatums_start float ,
tempo float ,
time_signature int ,
time_signature_confidence float ,
title string ,
track_id string ,
track_7digitalid int ,
year int
) row format delimited fields terminated by ',' stored as textfile
location 's3n://folder/project/file/filename';

insert overwrite table MSD_sample
select * from MSD limit 100;

That’s it. I’ll definitely be doing some analysis on this data in the future. Many thanks to Simon Gauld for his assistance with AWS.

One thought on “Accessing a million songs with Hive and Hadoop on AWS

  1. Thanks for sharing nice information. Your post explained advantages of Hadoop over AWS and HIVE. While storing the music this AWS and HIVE is going to play important role. Great Article and we hope in future more technical posts from you.

Leave a Reply

Your email address will not be published. Required fields are marked *