How The Green Mile is like American History X

I built a film recommendation engine, and now you can too

Posted by RDJ on January 16, 2016

Header image is Birdseye Photography of City Buildings Near Trees and Mountains by Felix Mittermeier used under license CC0

People who know me know that I love movies. I always have. Whether as a child in the downstairs living room, or in a packed theatre with friends, or at drive-ins in the back of a pick-up truck, or entirely relaxed on my own. I can watch them in all kinds of places with all kinds of crowds. There’s something for every mood.

The Beauty of Film

And the types of movies! Over the years I’ve had to branch further out, and I’m so glad for that. Stylized heist films, love served with loss, the wonder that is Denmark, dramas with personality, creepy and darkly funny fantasies…I’m down for anything original with story, pace, and attitude. Not to forget, there’s a special place in my heart for the cops and bad guys, war heroes, and macho westerns on which I was raised.

Why do we like movies? There’s a social aspect, yet many of us will gladly watch them on our own. It’s because we love a good story. Those of us who appreciate film the most see it as an art form with incredible complexity and beauty. The best films have been cherished and nurtured by their creators, with a passion put into their stories, depth and purpose imbued in their characters and represented skillfully by their actors, and specifically chosen positioning and lighting by the cinematographers that makes every frame a painting. When it works, a great film is a piece of art composed by hundreds of creative and talented participants, who have collectively created something of great beauty and complexity.

Spud Great stories are where we find great heroes

So why film lists? I’m talking about things such as the IMDB Top 250 and Roger Ebert’s Great Movies. A list of films is just a grouping, which might have a ranking but often does not. For many like myself, they’re a way of figuring out what to watch, through association with other films. It’s possible this incarnation becomes obsolete, as personalized recommendation engines fulfill this purpose better and better. However that wasn’t the case in the mid-2000s, when I first became interested in these lists.

Lists are a good thing for people who are a bit obsessive. When I first made a serious hobby out of movies and found the IMDB Top 250, it felt like there was so much out there that I had missed out on. At some point I wrote a little program, imdb.pl, to help me track it as I made progress (for the curious, I’ve been around 190 for a long time now). Some time later the website icheckmovies.com, or ICM, emerged, doing it so much more thoroughly and better, and imdb.pl was history. That’s an example of creative destruction in action.

Getting back into the list game

Following up on some curiosities, I decided to get my hands on data again. The existence of iCheckMovies, which killed imdb.pl, is now the very thing that enabled me to do it again, but better. I scraped all of the lists on ICM and put those into a local database. This time I’m sharing my entire code on github, so readers can follow along easily.

Here are the pieces:

Let’s talk briefly about how some of it works, since this is a pretty common and powerful analysis paradigm. First I plan out how I want to store my data (the DB schema). Planning out my data requires clarity of thought about what questions I want to answer, what data I need to answer them, and the strongest way to represent the data to enable that. I chose SQLite for this because this is a single user application with minimal data requirements and no real-time needs, so storing the database as a file is appropriate. Then I gather it through what I call scrape-and-store, but which more generically may be described as an ETL framework. I use Beautiful Soup to parse out the data, since it can use CSS selectors to isolate information, and iterate through items programmatically.

After acquiring the data, I switched gears (and tools) to analyze it. I used a simple SQL client to perform some basic analysis and start to explore more complicated ideas. Then I used Jupyter/IPython. Switching from writing Python programs to writing Python snippets comes naturally here. The programs had defined purposes: specific outputs that they were intended to create. But analysis is open-ended, so we needs tools that let us evaluate hypotheses quickly and store the results. Jupyter let’s us interlace code and output, such that we can tell a story of what we learned and how we progressed through the analysis.

Some things I learned

Want to see an ugly error message? Try putting an extra slash in any iCheckMovies URL.

Usually when something makes you think your code has a bug, you’re right. But it’s nice to be surprised! There’s a movie that ICM thinks is named <—> (Yes, like that, with the HTTP ampersand character codes). The link doesn’t work for it, so all I know is the two lists it is on and that it was released in 1969.

Apollo 1969, known for the Apollo 11 moon landing, and the film <–>

Image from NASA

Most common tags for lists:

Tag #
country 41
imdb 38
website 34
award 21
institute 17
critics 14
misc 5
director 4

Top 10 most common tags for films:

Tag #
drama 7693
comedy 3040
crime 1780
romance 1751
short 1593
action 1337
documentary 1192
horror 1120
thriller 1118
adventure 1059

Films in the most lists. It sure seems like it takes a while to get on a lot of lists:

Film Year #
2001: A Space Odyssey 1968 40
Lawrence of Arabia 1962 39
Vertigo 1958 38
The Godfather 1972 36
Citizen Kane 1941 35
Ladri di biciclette 1948 35
Psycho 1960 35
Star Wars 1977 35
1963 34
M 1931 34
Sunrise: A Song of Two Humans 1927 34
A Clockwork Orange 1971 33
Casablanca 1942 33
Apocalypse Now 1979 32
On the Waterfront 1954 32
One Flew Over the Cuckoo’s Nest 1975 32
Pulp Fiction 1994 32
Sen to Chihiro no kamikakushi 2001 32
Singin’ in the Rain 1952 32
Sunset Blvd. 1950 32

We can check if that conclusion is true. Here are years with the most films listed. It seems like the early 70’s and the 2000’s produced a lot of listable films:

Year #
1972 226
1969 224
2004 223
1968 222
1971 222
2002 222
2000 211
2001 210
1970 209
2003 208
2006 206
2007 201
1967 199
1986 199
1965 196
1989 196
1990 196
2005 196
1988 194
2008 192

Here’s a chart of progression by year. You can see that the last few years have some drop off, so we can expect their list progression to increase over time. The late 60’s were peak list popularity.

Films over time

What happens if we weight the years by how many lists each film is in? The 60’s are looking even more popular.

Year #
1971 594
2000 592
1968 579
2001 579
1972 573
2002 566
1967 561
1964 560
1973 535
1966 527
1969 526
1960 525
1962 525
2004 522
1988 521
1955 511
1959 511
1982 508
2003 508
1957 506

You know what they say, the first list is the hardest.

Power of film

Pairwise evaluation of films

Ranking the popularity of films and years is both fun and easy, but we can reach more powerful insights through analyzing the connections more deeply. Every time two films are on the same list, that creates a connection between those two films, bound together by a classification or a critic’s seal of approval.

Evaluating these pairwise calculations is harder, because it involves calculating across every pair across each film or list, so it does not scale nicely with size. In this case, with 14,159 films, there are 100,231,561 such pairs to make.

Hands It took me a while to count that high

Image from Nate Steiner used under the public domain

That works because evaluating N films gives us (N*N - N)/2 comparisons. N*N is the matrix size. Divided by two because order doesn’t matter in a pair and we don’t want to double count the pairs, and subtract N because we don’t include the diagonal vector of the film in a pair with itself.

One way to evaluate this is with cluster analysis. Treat each list or each film as an observation, and then have the films on the list, or the lists that the film is on, as the attributes. In this case, given the amount of single-list films, this is a fairly sparse matrix. Because of the sparsity, I chose to look at a subset of films, selecting those that are represented on at least 10 lists. This is just 586 of the 14,159 total films, yet those 586 films have representation on 157 of the 174 lists. This drastically reduces the dimensionality of our data, without much reducing the connections that we want to evaluate.

If we perform some simple k-means clustering on the films, we can get some interesting clusters. I did it with 10 clusters (cluster count is a parameter into k-means). We can do a lot more nuanced cluster analysis, but I felt I wouldn’t go into too much detail unless I wanted to gather more information about the films.

Cluster Size Arbitrary Description
1 25 I’m not sure!
2 50 Science fiction and fantasy
3 146 Classics
4 155 Foreign
5 26 Classics
6 95 90’s-now and highest rated
7 17 Italian
8 24 Animated
9 8 Documentary
10 40 Very old

Some of those are surprisingly clean cut, while on the other hand clusters 3 and 5 seem closer together to me. However, my running hypothesis is that cluster 3 is predominantly 30’s and 40’s, while cluster 5 is heavy on 50’s and 60’s movies. This would fit my impression (although validating your biases is a dangerous thing to do…) that many of these lists are grouped more on time period than anything else.

We can also perform a similar analysis across lists, rather than films. Here I use the correlation coefficient (for these binary variables this is the Phi coefficient) to evaluate similarity across all lists. I’m not surprised that They Shoot Zombies, Don’t They? and Top 500 Horror Movies have a coefficient of 0.63, Animation and Family have a 0.58, and 100 Korean Films and 100 Korean Masterpieces have 0.56. But I am surprised at the one pairing that’s much higher on the list than all of those. Sight and Sound 2012 Combined List might as well just take They Shoot Pictures, Don’t They? since they have a ridiculous correlation of 0.79.

It seems the IMDB Top 250 is something to imitate, since the Reddit Top 250, Fok! Top 250, and FilmTotal Top 250 have correlations of 0.61, 0.6, and 0.54, respectively, with it.

List 1 List 2 phi abs_phi
/lists/sight+and+sound+2012+-+combined+list/ /lists/they+shoot+pictures+dont+they/ 0.78909 0.78909
/lists/they+shoot+zombies+dont+theyquestion/ /lists/top+500+horror+movies/ 0.631314 0.631314
/lists/filmtotaal+top+250/ /lists/fokexclamation+top+250/ 0.625386 0.625386
/lists/reddit+top+250/ /lists/top+250/ 0.61445 0.61445
/lists/silent+but+not+forgotten+the+best+silen… /lists/the+top+300+silent+era+films/ 0.605662 0.605662
/lists/fokexclamation+top+250/ /lists/top+250/ 0.600954 0.600954
/lists/fokexclamation+top+250/ /lists/reddit+top+250/ 0.59405 0.59405
/lists/1001+movies+you+must+see+before+you+die/ /lists/they+shoot+pictures+dont+they/ 0.58028 0.58028
/lists/animation/ /lists/family/ 0.578512 0.578512
/lists/filmtotaal+top+250/ /lists/reddit+top+250/ 0.573651 0.573651

It’s perhaps worth noting that very few lists actually have a negative correlation with each other. For this reason, Doubling the Canon is either a good list to use or a good list to avoid. It’ll get you films that aren’t as common on other lists, for better or worse.

The next thing I looked at was more direct pairwise comparison of films. This is probably more useful, since finding lists is not a likely end goal for many people, but finding films is a common task.

I go about doing this by computing the Chi-square kernel using scikit-learn. This will give us a number between 0 and 1 for every pairwise comparison, with a 1 indicating a perfect match. For my features I’m using which lists a film is on - I’m not even bothering including the film tags, list tags, or year.

For the second time today, I’ve been given the small blessing of thinking my code had a bug when it (surprisingly) did not. There were two films with the exact same 11 lists! Those are The Olympiad and, surprise surprise, The Olympiad Part 2. I haven’t seen either, but I have a feeling that if I watch the first I’ll be compelled to watch the second.

Conclusions

And this, finally, brings us to our title question. How is The Green Mile like American History X? If you evaluate by lists, they’re both each other’s most similar film. The Green Mile is on 13 lists, and American History X is on 11 lists, with 9 lists of overlap! They might not seem that connected stylistically or share the same themes, but they have been strongly grouped together.

I’ve covered a lot in this piece, from my love of movies to how to do pairwise correlations using film list data. We can get meaningful insight out of basic analysis with SQL, and more if we venture into more powerful tools. We have to be flexible, and adjust our techniques to the type of data we have available.

Even without gathering any real data about the films themselves, and not even using the genre or years, we can connect them based on how others have made groupings. What I’ve shown here is the basis of a very simple recommendation engine. This is a powerful purpose, and one used to great effect by organizations like Netflix and Amazon. Instead of using a finite set of published lists, we could choose to treat each user’s preferences as a list on its own, and then have a dynamic recommendation engine that changes with the audience. It could be then personalized to each user by weighting the other users according to their similarity with the user in question. With immense data, the tools to analyze it, and time, we can create great work.