
Let’s start calculating some correlations using the corrwith() function.
#MOVIE SUGGESTER MOVIE#
Essentially, we’re mimicking the format of the movie names (or column names) in our movies DataFrame so that the computer can accurately find the right movie column based on our inputted name. Notice that we need to put the release date of the movie at the end. Inception (2010), Interstellar (2014), and Arrival (2016) were my top three most favorite movies of all-time.

#MOVIE SUGGESTER CODE#
I used an array of strings which keeps my code to a minimum as opposed to setting three of my favorite movies to three separate strings. However, I would recommend only choosing a maximum of five movies that have similar genres so the algorithm can perform optimally. Technically, you can choose whatever number of movies you want for your most favorite movies. Now, we can input three of our most favorite movies to start generating some correlations between movies. After a minute or so, you should have your movies DataFrame generated. If you’re still having trouble running this code, a trick I’ve noticed is that once you import your MovieLens data to JupyterLab, immediately run all of the code that you currently have (including the line above). movies = pd.crosstab(data, data, values = data, aggfunc = 'sum') For me, it took around two minutes maximum for my first time making this DataFrame. This line of code may take a while for the computer to execute as we are taking all of the 25 million movie reviews along with all of the movie titles in our dataset and combining them to creating a whole new DataFrame (of course this depends on what computer you have and what dataset you’re working with). The code below will create such DataFrame. We also need to have user ratings as the values of our DataFrame because our final goal is to compare the similarity between how each user rated the different movies, and then use the highest similarities to suggest movie recommendations. Thus, we want the movie titles as columns so that we can take one column (which represents one movie) and compare it with the others. As stated above, using the Pearson correlation coefficients, the corrwith() function can calculate the similarity between two DataFrame columns. The reason why we do this is mainly due to the nature of the corrwith() function. To start, we must first make a DataFrame which consists of titles as its column and userId as its rows with the values of the DataFrame being the ratings of each viewer. The corrwith() function is extremely useful for us since it can calculate the correlation between columns of two DataFrames assuming the DataFrames are in the correct format. Titles and data should look like the images below, respectively.Īll of the parts in step four revolve around the corrwith() function in pandas. We can do this with the following head() function: titles.head(10) data.head(10) Let’s look at how our datasets are organized. titles = pd.read_csv("movies.csv") data = pd.read_csv("ratings.csv") I will be using titles and data to represent my datasets. Using the read_csv() function from pandas, we are able to set specific variables to represent our imported datasets. import pandas as pdĪfter importing pandas, we can read the data that we downloaded from MovieLens. Our algorithm will still run smoothly with or without NumPy. Usually, programmers will import NumPy along with pandas, however, we won’t use any methods from NumPy in this guide so I will chose not to import NumPy. We do this by writing the line of code below onto the first line of your. ipynb file, giving us access to all of the functions and methods found in pandas. Of course, this isn’t always the case, but the odds become increasingly in our favor when we start analyzing larger datasets of user reviews.Īfter you import movies.csv and ratings.csv, it should look like this. It makes sense for me and my friend to enjoy our movie recommendations because we share similar preferences. The collaborative filtering algorithm will most likely suggest that I will enjoy watching movie D, and my friend will enjoy watching movie C based on our previous positive preferences. Say I enjoyed watching movie A, movie B, and movie C, and my friend enjoyed watching movie A, movie B, and movie D. To understand this, I will illustrate an example of user-user collaborative filtering which utilizes the nearest neighborhood algorithm. Collaborative filtering bases its suggestions only on users’ past data and preferences, mostly in the form of reviews (albeit there are other methods of gathering user preferences). The main component of our movie recommendation system relies on a learning concept called collaborative filtering.

Introduction to Data Analysis: Movie Recommendations
