Udacity R Programming
As part of my goal to change careers to enter the data field, I’m going back to school at Western Governors University for their Data Management/Data Analytics Bachelor’s Degree. The later classes in the program take you through Udacity’s Data Analyst Nanodegree, broken up into 5 classes. This is the project for class 3 of 5, the R programming class.
For this project, we were given 3 csv files that had bike share data from different cities. We used R to analyze the data. After loading up ggplot, the csv files were read in. We were instructed to come up with 3 different questions to answer.
ny = read.csv('new_york_city.csv')
wash = read.csv('washington.csv')
chi = read.csv('chicago.csv')
The first question I had was “what do the trip durations in Washington look like?” The trip durations were in seconds, so I divided the duration by 60. After setting the binwidth, limits, and labels, this is what I came up with:
I cut the duration at 120 minutes since there weren’t that many trips longer than that, and extending the duration to include everything wouldn’t give us a good look at the bulk of the data. Most trips are under 20 minutes, and the trip duration drops off sharply after about 30 minutes.
The 2nd question I had was “are there more male or female subscribers in Chicago?”. The gender column had a few blank entries. I changed those to be “NA” with chi[chi==””] <- NA
, then excluded the “NA” entries. Outside of the blank/NA entries, I didn’t need to limit the data in any other way. I used facet_wrap()
to group the results by gender, and here is the chart:
Chicago has about 3x as many male subscribers as female. Female subscribers number around 1,700, while male subscribers number over 5,000.
Finally, I wanted to know “what does birth year by gender look like in New York?”. For this, I used the same bit of code as I used in Chicago to change the blank entries to “NA”. I used a scatterplot for this one, adding in some jitter and making the bars wider to give a better idea how the birth year is grouped for each gender. I changed colors too because I like green. This is the result:
I cut the birth year at 1935. The rental data is from 2017, so that would be an 82 year old riding a bike. That isn’t too out of the ordinary, but the number of people riding rental bikes past that age is going to be relatively minor. In the results, we see most female riders are born in the 80’s to mid 90’s. Male riders are mostly in that range as well, but they are heavily spread into the 1950’s for their birth year as well