The Movie Review Dataset is a set of film reviews that were gathered by Bo Pang and Lillian Lee in the early 2000’s. They released the data set from their research as v2.0 in 2004 and contains 1,000 positive and negative film reviews which originated from IMDB.com.
This data has been cleaned to contain only English reviews, text converted to lowercase characters, white space around punctuation and text has been split to show one sentence per line.
This article will cover a natural language processing model, named: Bag of Words.
According to Wikipedia:
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.
Using Python (code below), we can identify the vocabulary within the dataset. The dataset contains 46,557 unique words across all reviews.
The top 3 words are: ‘film: 8860’, ‘one: 5521’, and ‘movie: 5540’.
It could be said that words that are least popular couldn’t be used in any predictive model, but this could also be said for the most popular words. Taking this in to account, I look out the least used words, where minimum occurrence is 5, with the outcome showing a reduction on the number of unique words from 46,557 to 14,803.
The top 15 words within this list are:
So, what could be done with this now?
N-Grams could be looked at sequence of words (n words), example: Justice League (2 grams / bi-gram). This could then be used to look at words in future context.
Python code used in this article:
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords
# load doc into memory
# open the file as read only
file = open(filename, ‘r’)
# read all text
text = file.read()
# close the file
# turn a doc into clean tokens
# split into tokens by white space
tokens = doc.split()
# remove punctuation from each token
table = str.maketrans(”, ”, punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words(‘english’))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
# load doc
doc = load_doc(filename)
# clean doc
tokens = clean_doc(doc)
# update counts
# load all docs in a directory
def process_docs(directory, vocab):
# walk through all files in the folder
for filename in listdir(directory):
# skip files that do not have the right extension
if not filename.endswith(“.txt”):
# create the full path of the file to open
path = directory + ‘/’ + filename
# add doc to vocab
# save list to file
def save_list(lines, filename):
data = ‘\n’.join(lines)
file = open(filename, ‘w’)
# define vocab
vocab = Counter()
# add all docs to vocab
# print the size of the vocab
# print the top words in the vocab
# keep tokens with > 5 occurrence
min_occurane = 5
tokens = [k for k,c in vocab.items() if c >= min_occurane]
# save tokens to a vocabulary file
This post was written by noxford