Analysing ‘Superhero’ Tagged Movie Lens Data, Using Python

September 29, 2018 2:21 pm Published by Leave your thoughts

I recently stumbled across a film data set called Movie Lens and decided to look at how the Marvel and DC tagged films, within this system have been rated by users. This, therefore, would be another Marvel v DC associated article.

Movie Lens

The Movie Lens website allows you to download, free of charge, a number of data sets. Firstly the one that I have downloaded for this article (190mb) , which contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. This was last updated on 10/2015. There is also a data set that is recommended for education and development, these being a small data set of 100,000 and the full data set which contains 6,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags. Last updated 8/2017.

Once I had downloaded the data set, I was ready to start the Python code for analysis.

Installing the Libraries

I needed a number of libraries for my analysis to be both effective and timely. I used the python pip installer to install Pandas, Numpy and Matplotlib on to Python 2.7.

To install, I opened the CMD prompt (Windows) and used the CD command to navigate to the Python directory. The installation line used was:

python m pip install <packagename>

Using Python and the data sets provided by Movie Lens, I was then able to run a check on the data.

Has the rating of films improved?

My first task was to look at the movie release year and the average rating (mean).

As you can see, this shows a significant rise in the number of films that were released, with a user rating, from the 1990’s. This could be looked at in a little more detail – how do users rate? have they changed their rating ideals over time? Do users prefer certain genres? How does this affect the ratings? This could be something that I look at in more detail in the future. However, for the purpose of this article, I wanted to take a look at the theme of Superhero tagged films.

Firstly, let’s look at the mean average of the movies that contains the user tag “superhero”.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

movies_pd=pd.read_csv(‘movie.csv’)
rates_pd =pd.read_csv(‘rating.csv’)
links_pd=pd.read_csv(‘link.csv’)
tags_pd=pd.read_csv(‘tag.csv’)
genome_scores_pd=pd.read_csv(‘genome_scores.csv’)
genome_tags_pd=pd.read_csv(‘genome_tags.csv’)

tags_pd=tags_pd.dropna()

hero_movies=tags_pd[‘tag’].str.match(‘superhero’, case = False)

moviedata=tags_pd[hero_movies].merge(movies_pd,on = ‘movieId’,how = ‘inner’)
moviedata=moviedata.merge(rates_pd,on = ‘movieId’,how = ‘inner’)
moviedata[‘year’] =moviedata[‘title’].str.extract(‘.*\((.*)\).*’,expand = False)

avg_rates_titles = moviedata[[‘title’,’rating’,’year’]].groupby(‘title’).mean()
dataframe = avg_rates_titles.sort_values(by=[‘rating’], ascending = False)
print dataframe

The output below shows the highest rated superhero films. The Dark Knight (2008) being the highest rated superhero film within the dataset of 4.22. This doesn’t take in to account any form of weighting that takes in to account the number of reviews etc.

I then repeated the process above, to take a look at the Superhero movie ratings against the average rating. This shows, as expected, an increase in the number of Superhero films that were released post-2000. This is expected as a response to the release of the MCU and pre-DCU films.

Superhero Tagged Films

One thing that is important is the sample size of the data. Let’s next take a look at the number of reviews. The figure below shows that, although the there are a range of films to sample from, there are a significant number of rating scores that are based on <10 user reviews. This will therefore have an impact on any result and needs to be taken in to account.

The above figure shows a scatter plot of all users and individual ratings count. As you can see, some of the movies have not had a large number of ratings. 

DC v Marvel – Up to 2015

Let’s take a look at the visualisation of DC v Marvel films, up to and including 10/2016. The figure below shows the DC (red) and Marvel (blue) distribution.

This would therefore not include films:

Marvel – Avengers: Age of Ultron (2015), Ant-man (2015), Captain America: Civil War (2016), Doctor Strange (2016), GoTG 2 (2017), Spider-man Homecoming (2017), Thor: Ragnarok (2017), Black Panther (2018), Avengers Infinity War (2018), Ant-Man and the Wasp (2018).

DC – Batman V Superman (2016), Suicide Squad (2016), Wonder Woman (2017), Justice League (2017).

It is expected that this data would show a significant increase within the mean ratings for the Marvel tagged films. Marvel films would, in addition to this, show a significant increase in the number of films that have been released in comparison to DC tagged films. Note that the tags include non-universe based animated films and spin-offs, including (for example) the animated DC film Justice League: Flash Point Paradox.

Let’s next take a look at how many films both labels have released. Remember that these are based on user tags – so include non-universe films.

This shows that there’s a continued release of significantly more releases with those tagged as ‘marvel’, over those tagged with ‘DC’.

 

Please follow and like us:
Tags: , , , ,

Categorised in: ,

This post was written by noxford

Leave a Reply

Your email address will not be published. Required fields are marked *