A Simple Movie Recommender System

Posted on Sun 20 January 2019 in Data Science

A few days back when I was sifting through movies on Netflix, I decided to have some fun building my own movie recommender system. With the number of choices for movies, books, restaurants, news items as well as people to follow or become friends with on the Internet increasing every day, recommender systems have become intrinsic part of people’s lives. It does not mean that they were not existing 20-30 years back, because they were. But at that time, the choices were not as many as they are today and recommendations were based solely on popular likes and dislikes, such as Top 10 best sellers or blockbusters. In current scenario recommendation systems cater to unique tastes of individuals and are not just restricted to top ratings and votes of popular items. Such recommendation systems are everywhere, whether we are watching movies/TV shows on Netflix, shopping on Amazon or following and making connections on LinkedIn, Twitter, and Facebook. The aim of any such recommender system is to filter through the noise and suggest to us what we will like.

There are majorly two kinds of recommender systems – collaborative filtering and content-based recommender system. As the name indicates collaborative filtering uses collaborative information of group of users who have similar likes or dislikes to a customer \(X\) and then recommends items to \(X\) that these similar users liked and \(X\) has not yet watched/purchased. On the other hand, a content-based recommender system is solely based on the history of items liked by \(X\). Thus, it focuses on the content of the items that \(X\) has liked in the past and recommends items with content similar to the items \(X\) liked. The content of any item is its description. For example, in the case of books it can comprise the author, the plot and the genre. Content-based recommender systems are highly useful in recommending new and not-very-popular items, scenarios where collaborative filtering fails. Additionally, content-based recommenders need only data about the customer to whom it has to recommend items and does not need data of ‘similar users’. For these reasons, I chose to build a content-based recommender system for movies.

Data Collection

I used IMDb to gather data about movies produced in the last 10 years. This data contained information about 48,158 English movies from 2009-2018 and contained information about movie titles, directors, actors, genre, ratings, votes, metascore (a score from Metacritic, a review aggregator), year of release, revenue generated, duration and certificate. However, I built my recommender system based on the directors, actors and genre of the movies.

Data Cleaning and Preprocessing

Here, I first fixed the missing values. The next thing to do was to pick a few actors for each movie as there were many actors listed for many movies and not all actors are in a leading role. As only the leading actors are the determining factor whether we will like a movie or not, I kept the first three. In addition, I changed all words to lowercase and combined first and second names into a single word. This prevents the later algorithm in wrongly matching ‘James’ of James Cameron and James Wan.

Next, I made a bag of words, which literally is a bag (a collection with no order among members) containing names of directors, actors and genre for each movie. Directors seem more important than the genre and stars (at least that is my preference), I gave higher weight to the directors than the other two by putting their names twice for each movie

Building the Model

In a bag-of-words model the item is first split into tokens, then a weight is assigned to each token depending upon the frequency of its occurrence in the item or in a collection of all the items and then a matrix is made where each item is a row and each column is a token. This operation gives us an item vector, also called item profile. In our case, items are the movies and tokens are the various genre, names of the actors and the directors. By this I mean that each movie is a vector with the weighted values for each attribute (that is a director’s name or star’s name or genre). CountVectorizer from scikit-learn can be directly used for this purpose. It gives a matrix where the words represent columns and movies are the rows and the weight is equal to the number of times a word occurs in a movie’s attributes.

With item profiles in hand, we are ready to find similar movies. Item profiles are vectors in a high dimensional space. A good distance metric for measuring similarity between two item vectors is the angle between them. We can estimate the angle between the two item vectors using cosine similarity:

$$ \cos \theta = \frac{A \cdot B}{\lVert A \rVert \lVert B \rVert} $$

where \(A\cdot B\) represents the dot product between \(A\) and \(B\) and \(\lVert \cdot \rVert\) is the euclidean norm. As the angle \(\theta\) decreases, \(\cos \theta\) increases. Thus, the cosine gives us a measure of the similarity between any two movies.

There is one glitch in the above similarity metric. Suppose a movie has an ‘action’ genre but no information about actors and directors is available. In this case, it will match with other ‘action’ movies quite well and appear in the results for those movies. But this movie really does not have enough information that can be used for recommending it. So, I additionally penalize the similarity score if the size of the bag is too small by subtracting the inverse of the bag size.

I made a function that takes a movie title and gives 10 movies that are similar to it. This function encapsulates my recommender system. It is worth listing the function code here. I believe the comments and the variable names give sufficient context to follow what is going on.

def recommendations(title):

    # get index of the movie in the dataframe
    idx = indices[title]

    # get movie matrix for the selected movie
    chosen_movie_matrix = movies_matrix[idx]

    #get cosine similarity of all the movies with the selected movie
    cos_sim = cosine_similarity(chosen_movie_matrix, movies_matrix)

    # get a list of tuple where first is movie index and second is cos similarity score for all 
    #the movies with the provided movie. Similarity score is penalised based on the number of contents of the bag. 
    scores = [(i, sim - 1/(movies.iloc[i]['bag_len']+.0001)) for i, sim in enumerate(cos_sim[0])]

    # sorting all the movies based on the similarity scores in an ascending order
    scores.sort(key=lambda x: x[1], reverse=True)

    # get the indices of the top 10 similar movies
    movie_indices = [i[0] for i in scores[1:11]]

    # get the names of the most similar movies
    return movies.iloc[movie_indices]['movie']

Testing and Using the Recommender System

Now, my recommender system was ready to recommend movies. But before using that I wanted to check if it is working properly. For this purpose, I provided it a movie that I like – ‘Inception’ – and knew to certain extent what will be the recommendations. The first 3 movies in the recommendation were from the same director, Christopher Nolan, and had some overlap with the genre of ‘Inception’. Others in the recommended list had a match with the genre.

recommendations('Inception')
22440                      Interstellar
32080             The Dark Knight Rises
6513                            Dunkirk
4024                     The Star Kings
24037       Star Wars: The New Republic
28336                           Don Jon
599                    Ready Player One
695                           Bumblebee
1314     Jurassic World: Fallen Kingdom
1373                            Rampage

I had actually watched some of these movies and liked them. So, for my movie night the movie I provided to my recommender system was ‘Boyhood’ and out of the 10 recommendations provided I decided to watch ‘Before Midnight’. Needless to say, I was not disappointed.

If interested, you may find the entire source code on Github.