Continuing on my mission to get better at Python I’ve been learning about the Pandas and sklearn libraries. I was looking for a challenge to use these libraries on and I had recently come across a nice lastfm data extract. The data contains around 360K users and a little demographic data on them (gender, age, country), and a count of listens they have had for each artist. I decided to try and build a model that could predict someone’s gender based on what they have been listening to.
Firstly I imported the data using the read_table function. I liked the usecols option it has that enables you to pull in only the relevant columns. I then dropped any users who didn’t have a gender and converted the gender field into binary dummy data. Next I needed to construct some music listening features for the model. To do this I pulled out artists who had been listened to by at least 1% of the users. I did this because I didn’t want my dataset to be too sparse in my next step. I then to created a wide dataframe with records as user and artists as columns, with number of listens as the data. Next I used principle components analysis to reduce this data down to users and the principle components that represent their listening habits. These principle components would be the features that fed into my machine learning process.
To build the model i took a machine learning approach. I split my data 3 ways. One hold out test set and a train and cross validation set. This was really easy using the train_test_split function sklearn provides. Using my train set I then applied three different algorithms KNN, Trees, Random Forests. For each algorithm I looped through a sequence of its main complexity parameter. In each case I recorded the accuracy, precision and MSE against the cross validation set. I then plotted this data using matplotlib and identified the best algorithm and optimum complexity parameter. As you can see the random forest with a maximum depth of 10 seems to give the best solution. The KNN was has yet to reach its optimum value so may have been better had i taken it further.
Finally I build a model using the optimum algorithm and CP and got a final estimate for my accuracy, precision and MSE against my test set. The Accuracy of the model is 0.75 which suggests based purely on people’s musical preferences the model can predict their gender correctly 75% of the time. With a bit more time and computing power I could have included the number of artists, and number of principle components in my decision loop to optimise those.
My code is stored on this gist