IMDB vs Metascore Distribitions

October 14, 2018 3:28 pm Published by Leave your thoughts

It’s a Sunday afternoon and I want to escape and watch a film for the afternoon. Which one do I watch? There’s a lot of factors to consider here, such as the talent that is connected to the film, the budget, a review, a trailer or by checking the ratings that is attached to the movie.

If you’ve not watched the film, you tend to try and avoid spoilers and do you really want to spend time reading about the film you are just about to watch? If the review is either negative or positive, this will tend to be engrained within your memory and be difficult to forget.

Therefore a large proportion of people tend to favour a numeric rating system. However, I couldn’t help but notice that there are a large number of differences between the data sets. In this article, I will explore the distributions between IMDB and Metascore movie ratings to see if there’s anything interesting.

Scraping the Data

The first process is to scrape the web data from IMDB. You’ll notice that the IMDB pages, by year, contain the title, the IMDB rating and also the Metascore rating.

As you can see, the film ‘Venom’ shows a distinct difference between the IMDB score (7.1 / 71) and the Metascore rating of 35.

Distribution Values

Firstly, let’s take a look at the distribution values between IMDB and Metascore. We would expect to see a normal distribution pattern, whereby most of the values sit within the middle and a few others at the extremes. See below normal (Guassian) distribution:

It should be said that the movie ratings should reflect the movie quality. Taking from my own film experiences, some of the films that I have seen are outstanding and I would re-watch these, a few films are really really bad and I instantly regret watching them and the rest of the films are average whereby I may struggle to remember what happened.

I have collected data for 5,146 movies which range from 1927 to 2018. I only collected the data if they had both a metascore and an IMDB score.

Let’s take a look:

Firstly, it can be seen that Metascore Histogram shows a classic normal distribution pattern. However, IMDB shows that the majority of reviews are within the average area, but note how the ratings are skewed towards the higher end of the graph. Further to this, look at how the lower end of the ratings are empty.

Let’s take a look at a scatter graph when comparing one film rating between IMDB and Metascore.

At the start of this article, I would have expected a linear line between the two data sets. However, as we know that IMDB dataset is skewed towards the upper values, we wouldn’t therefore expect this to be the case.

When looking at Pearson’s r value between IMDB and Metascore, this returns a value of +0.63. Here, if the value is +1.0 it means that there is a perfect correlation and -1.0 there is perfect negative correlation.

Let’s take a look:

Taking this information in to account.

I would recommend using Metascore when looking for a movie rating.

The Metascore is weighted from an average of reputed critics. You can read more about how their system works here. To finish, there are a number of points that need to be taken in to account when using Metascore. Firstly, Metascore was created in 1999 and therefore any film before this may not have a score or rating attached.

Python 3 code:

from requests import get

import unicodecsv

from lxml import html
from bs4 import BeautifulSoup
from warnings import warn

pages = [str(i) for i in range(1,5)]
years_url = [str(i) for i in range(1900,2019)]

from time import sleep
from random import randint
from IPython.core.display import clear_output

from time import time

headers = {“Accept-Language”: “en-US, en;q=0.5”}

# Redeclaring the lists to store data in
names = []
years = []
imdb_ratings = []
metascores = []
votes = []
certificates = []
genres = []

# Preparing the monitoring of the loop
start_time = time()
requests = 0

# For every year in the interval 2000-2017
for year_url in years_url:

# For every page in the interval 1-4
for page in pages:

# Make a get request
response = get(‘http://www.imdb.com/search/title?release_date=’ + year_url +
‘&sort=num_votes,desc&page=’ + page, headers = headers)

# Pause the loop
sleep(randint(8,15))

# Monitor the requests
requests += 1
elapsed_time = time() – start_time
print(‘Request:{}; Frequency: {} requests/s’.format(requests, requests/elapsed_time))
clear_output(wait = True)

# Throw a warning for non-200 status codes
#if response.status_code != 200:
#warn(‘Request: {}; Status code: {}’.format(requests, response.status_code))

# Break the loop if the number of requests is greater than expected
#if requests > 72:
# warn(‘Number of requests was greater than expected.’)
# break

# Parse the content of the request with BeautifulSoup
page_html = BeautifulSoup(response.text, ‘html.parser’)

# Select all the 50 movie containers from a single page
mv_containers = page_html.find_all(‘div’, class_ = ‘lister-item mode-advanced’)

# For every movie of these 50
for container in mv_containers:
# If the movie has a Metascore, then:
if container.find(‘div’, class_ = ‘ratings-metascore’) is not None:

# Scrape the name
name = container.h3.a.text
names.append(name)

# Scrape the year
year = container.h3.find(‘span’, class_ = ‘lister-item-year’).text
years.append(year)

# Scrape the cert
#certificate = container.h3.find(‘span’, class_ = ‘certificate’).text
#certificates.append(certificate)

# Scrape the cert
#genre = container.h3.find(‘span’, class_ = ‘genre’).text
#genres.append(genre)

# Scrape the IMDB rating
imdb = float(container.strong.text)
imdb_ratings.append(imdb)

# Scrape the Metascore
m_score = container.find(‘span’, class_ = ‘metascore’).text
metascores.append(int(m_score))

# Scrape the number of votes
vote = container.find(‘span’, attrs = {‘name’:’nv’})[‘data-value’]
votes.append(int(vote))

import pandas as pd
movie_ratings = pd.DataFrame({‘movie’: names,
‘year’: years,
‘imdb’: imdb_ratings,
‘metascore’: metascores,
‘votes’: votes})

movie_ratings = movie_ratings[[‘movie’, ‘year’, ‘imdb’, ‘metascore’,’votes’]]
movie_ratings[‘n_imdb’] = movie_ratings[‘imdb’] * 10
#
movie_ratings.to_csv(‘movie_ratings.csv’)
#print movie_ratings.head()

import matplotlib.pyplot as plt
#%matplotlib inline

fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (16,4))
ax1, ax2, ax3 = fig.axes

ax1.hist(movie_ratings[‘imdb’], bins = 10, range = (0,10)) # bin range = 1
ax1.set_title(‘IMDB rating’)

ax2.hist(movie_ratings[‘metascore’], bins = 10, range = (0,100)) # bin range = 10
ax2.set_title(‘Metascore’)

ax3.hist(movie_ratings[‘n_imdb’], bins = 10, range = (0,100), histtype = ‘step’)
ax3.hist(movie_ratings[‘metascore’], bins = 10, range = (0,100), histtype = ‘step’)
ax3.legend(loc = ‘upper left’)
ax3.set_title(‘The Two Normalized Distributions’)

for ax in fig.axes:
ax.spines[‘top’].set_visible(False)
ax.spines[‘right’].set_visible(False)

Please follow and like us:
Tags: , , , , , , ,

Categorised in:

This post was written by noxford

Leave a Reply

Your email address will not be published. Required fields are marked *