Udacity Project — Investigate a Dataset
As part of my goal to change careers to enter the data field, I’m going back to school at Western Governors University for their Data Management/Data Analytics Bachelor’s Degree. The later classes in the program take you through Udacity’s Data Analyst Nanodegree, broken up into 5 classes. This is the project for class 1 of 5, the Intro to Data Analysis class.
For the project, you are given the choice of 5 data sets to look at: TMDb Movie Data, No-show appointments, Gapminder World, Soccer Database, and FBI Gun Data. I chose the TMDb Movie Data to look into. Who doesn’t like movies, right?
The csv file contained data on movies from 1960 to the middle of 2015. It contains data on films such as popularity, budget, revenue, cast, director, runtime, genres, release date, and even had inflation adjusted numbers for budget and revenue.
After importing all the usual suspects into Jupyter (along with a little trick to get numbers to show as a float instead of scientific notation), I loaded the csv into a dataframe. This is where I got into looking at the data to see what I was working with, and see what needed to be cleaned up. First, I took a look at the size of the dataset, and some statistics on some rows.
Here we see there are 10,866 rows and 21 columns. Some rows have a minimum value of 0, which means we have some missing data. Specifically, there are some films where the budget, revenue, and runtime are 0. I dropped the budget columns. For the runtime, I decided to replace the entries of 0 with the average for the column.
I see there is still a movie with a runtime of 2 minutes. I left this as someone could have made a short animated film. This could have been looked into further to see why the movie was so short. It could be another issue to clean up, but for now, I’ll assume that it is an accurate runtime.
Exploratory Data Analysis
My first thought was to look at how many movies are released each year. Looking at the information below, the early 1960’s started out with around 30–40 movies released each year. By the end of the dataset, we’re looking at 600–700 movies released each year once we get into the 2010's.
What does this look like in chart form? I’ll show you:
This shows us that in the early 1980’s, we first saw a year where over 100 films were released per year, over 200 by the mid 1990’s, and over 500 by the late 2000's
I also wanted to see how average runtimes have changed over the years
Again in chart form:
This shows us that movies in the 1960’s started around the 2 hour mark on average, and slowly went down (apart from a few years that peaked up and down for average runtime) to around a 97 minute average runtime in 2015. Must be getting harder to hold people’s attention as the years go by.
Next, I wanted to see what the relationship between budget and release year looked like. I made a function for this and the following chart, then created a scatter plot.
This is the actual revenue, not the inflation adjusted number. I wanted to see how the revenue rose over the years without being adjusted for inflation at all. The late 1970’s saw the first movie to make more than $500 million, and in the mid 1990’s is when we saw the first $1 billion movie, followed by the first $2 billion movie in the late 2000's.
Finally, I wanted to see what the relationship between runtime and revenue looked like. Using the function I made earlier, I created a new scatter plot.
This shows that the movies that make the most money are between 90 minutes and 3 hours.
Conclusions
For the movies per year, we can see that in the last 4 decades, and especially since 2005, the releases per year is generally on the upswing. This could be that as the population grows and the studios make more and more money, the pool of everyone — writers, directors, producers, actors — gets bigger and bigger and stuidos have more money to put towards movies, so more movies are made to try and get more opportunites to make more money, keeping that cycle going. This would be something that would need further research to gather more information.
The drop-off in movie times in 2005 coincides with the explosion of movie releases in that same year. Perhaps studios are rushing to get more movies out, spending less time on each movie, part of which could be big edits to content that affects runtime. Or maybe the movies are getting shorter so the special effects departments have less CGI to put into the movies. It could also be an attention-span issue of the audiences. With rare exceptions, no one really likes seeing 2–3 hour long movies anymore, so studios are catering towards that. This would be an interesting thing to look into further as well.
The dataset has a total of 10,865 movies. Removing some columns as part of the data cleanup process, the current dataset doesn’t have any issues with missing data. The complete dataset does have some missing data, so this would need to be addressed if other analysis is done.
The data set also has a good size of data to look at, but the data isn’t very symmetrical. For example, look at the movie_by_year.head()
and movie_by_year.tail()
lines of code. The first set of code shows 30-40 movies a year in the early 1960's, and the second set of code shows 540 all the way up to 700 movies a year in the early 2010's. The early 2010's acount for 15-20 times more movies per year than the early 1960's, so the data is much more detailed as the years go on. The data might be more accurate if there were approximately an even number of movies per year for all years in the data set. To be even more accurate, it would be best to get a dataset that has all the movies made for every year in the dataset.