Background and Objectives:
In the past few years we have witnessed a new economic modus operandi emerge, refe
ed to as the “sharing economy.” The salient feature of this economy is, as the name suggests, sharing of resources rather than maintaining exclusive use rights. One area which has now entered mainstream use is bike sharing systems, adopted as transportation solutions in many cities worldwide. (In NY this system goes under the name of CitiBike and has been championed by the Bloomberg administration, resulting in the creation of a multitude of docking stations that now litter the city streets.) The fundamental concept in the bike sharing system is primarily to provide short (last mile) transportation solutions that will complement traditional public transport, and provide agile solutions that are green and cost effective. Running an effective bike share system is a complex undertaking that involves a mix of strategic decision (determining location and sizing of bike docs) and tactical ones (how frequently to load balance the bikes). The bike share system operator keeps track of usage data with the intent to better understand patterns of travel, station utilization and inventory levels throughout the day.
In this dataset you will have the opportunity to explore some of the above characteristics, determine their statistical significance and in what way they impact counts of bike rentals. To that end, the dataset provides the count of total rentals of bikes for 732 days along with various characteristics for each day. (The dataset is based on publicly available data from a major bike share operator over a two year horizon.) Your goal is to perform exploratory data analysis leading up to regressions that will help elucidate the factors that influence the usage, and provide any recommendations, both qualitative and quantitative, that pertain. (Think of this as building a decision support diagnostic tool to help the bike share operator in day-to-day operations.)
Below we provide some details concerning the nomenclature of the characteristics given in the Excel file:
- cnt: count of total rental bikes including both casual and registered users
- season : season (1:springer, 2:summer, 3:fall, 4:winter).
- yr : year (0: 2011, 1:2012).
- mnth : month ( 1 to 12).
- hr : hour (0 to 23).
- holiday : whether the given day is a holiday or not.
- weekday : day of the week.
· workingday : if day is neither a weekend nor holiday then 1, otherwise 0.
· weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy.
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist.
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds.
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog.
- temp : Normalized temperature in Celsius.
- atemp: Normalized real-feel temperature in Celsius.
- hum: Normalized humidity.
- windspeed: Normalized wind speed.
- casual: count of casual users.
· registered: count of registered users.
· Broken bike: the number of
oken bikes detected
eported per given day.
· Broken docks: the number of
oken docks detected
eported per given day.
· Number street: total number of street closures per given day.
Frac_trains_time: fraction of trains running on time on given day
First Look at The Data – The Structure
# Load an R data frame.
li
ary(readxl)
ike_sha
- read_excel("~/R
ike-sharing-data-bdvxdndh.xlsx")
View(bike_shar)
str(bike_shar)
ti
le [731 x 24] (S3: tbl_df/tbl/data.frame)
$ cnt : num [1:731] XXXXXXXXXX1600 ...
$ season : Factor w/ 4 levels "Spring","Summer",..: XXXXXXXXXX ...
$ yr : Factor w/ 2 levels "2011","2012": XXXXXXXXXX ...
$ Broken Docks : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ mnth : num [1:731] XXXXXXXXXX ...
$ holiday : Factor w/ 2 levels "Working Day",..: XXXXXXXXXX ...
$ Broken bikes : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ weekday : num [1:731] XXXXXXXXXX ...
$ Number street closures : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ workingday : num [1:731] XXXXXXXXXX ...
$ fraction of trains running on time: num [1:731] XXXXXXXXXX XXXXXXXXXX89 0.67 ...
$ weathersit : Factor w/ 4 levels "Good:Sunny","Moderate:Cloudy",..: XXXXXXXXXX
$ temp : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ atemp : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ hum : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ windspeed : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ casual : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ registered : num [1:731] XXXXXXXXXX1518 ...
$ dteday : POSIXct[1:731], format: " XXXXXXXXXX" " XXXXXXXXXX" " XXXXXXXXXX" ...
$ actual_temp : num [1:731] XXXXXXXXXX 9.31 ...
$ actual_feel_temp : num [1:731] XXXXXXXXXX XXXXXXXXXX ...
$ actual_windspeed : num [1:731] XXXXXXXXXX 12.5 ...
$ actual_humidity : num [1:731] XXXXXXXXXX43.7 ...
$ mean_acttemp_feeltemp : num [1:731] XXXXXXXXXX4 10.38 ...
li
ary(dplyr)
li
ary(co
plot)
li
ary(ggplot2)
li
ary(stats)
Second Look at The Data – Data Quality
We will now assess data quality, using a few simple descriptive statistics, including the minimum value, mean, and max, and creating a number of counts.
summary(bike_shar)
cnt XXXXXXXXXXseason yr Broken Docks XXXXXXXXXXmnth XXXXXXXXXXholiday Broken bikes
Min. : 22 Spring: XXXXXXXXXX:365 Min. : XXXXXXXXXXMin. : XXXXXXXXXXWorking Day:710 Min. :150.0
1st Qu.:3152 Summer: XXXXXXXXXX:366 1st Qu.: XXXXXXXXXX1st Qu.: XXXXXXXXXXHoliday : 21 1st Qu.:175.0
Median :4548 Fall : XXXXXXXXXXMedian : XXXXXXXXXXMedian : 7.00 XXXXXXXXXXMedian :201.0
Mean :4504 Winter: XXXXXXXXXXMean : XXXXXXXXXXMean : 6.52 XXXXXXXXXXMean :200.8
3rd Qu.:5956