MTA Data Challenge¶

A. Data analysis:¶

Which station has the most number of units?
What is the total number of entries & exits across the subway system for February 1, 2013?
Let’s define the busy-ness as sum of entry & exit count. What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?
What stations have seen the most usage growth/decline in 2013?
What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?
Bonus: What hour is the busiest for station CANAL ST in Q1 2013?

B. Visualization:¶

Plot the daily row counts for data files in Q1 2013.
Plot the daily total number of entries & exits across the system for Q1 2013.
Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q1 2013.

Notes/Caveats on data definition:

Date Range: Year 2013 for questions except question 1. (May 19, 2018 for question 1)
Turnstile is defined by the author as control_area, unit, and scp combination.
Only ‘Regular’ in description variable is considered for analysis.

# Read HTML module
from pyquery import PyQuery as pq

# common modules
import os
import re
import pandas as pd
import numpy as np
from datetime import datetime
import statsmodels.stats.api as sms

# plot modules
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')

# random seed
seed = 2

A.1. Which station has the most number of units?¶

I assume this question asks as of May 19 2018, which station has the most number of units. So for this question we use this data: http://web.mta.info/developers/data/nyct/turnstile/turnstile_180519.txt.

Field description can be found from http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt.

# download the data
path = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_180519.txt'
col_name_new = ['C/A','UNIT','SCP','STATION','LINENAME','DIVISION','DATE','TIME','DESC','ENTRIES','EXITS']
MTA_data_20180519w = pd.read_csv(path, sep=",", header=0, names = col_name_new)

MTA_data_20180519w.head()

Calculate the number of unique Unit for each Station on 05/18/2018 and select the Station having the most number of Unit.

# MTA_data_20180518 = MTA_data_20180519w.loc[MTA_data_20180519w.DATE == '05/18/2018',['STATION','UNIT']]
MTA_data_20180519w[MTA_data_20180519w.DATE == '05/18/2018'].groupby(
    ['STATION'],sort = False).nunique().UNIT.sort_values().tail(1)

STATION
23 ST    6
Name: UNIT, dtype: int64

Answer:¶

23 ST station having 6 units is the station has the most number of units.

For following questions, I first download data for entire year 2013, convert, clean and wrangle the data, and then do the required analysis.

Concretely, below steps are used on the raw data:

Download the year 2013 datasets from MTA webset. Append different datasets into a single dataframe.
Since the original data is formated in a way that multiple data points included in each row, I convert the wide form DF to long form DF where there is only one data point per row.
Remove the irregular data, by filtering "DESC" variable
Add a new column "HOURLY_ENTRIES" representing the incremental number of entries since the last recording time.
Add a new column "HOURLY_EXITS" representing the incremental number of exits since the last recording time.
Extract hour from "TIME" variable, extract "MONTH" and "YEAR" from "DATE" variable.
Identify and impute outliers in "HOURLY_ENTRIES" and "HOURLY_EXITS".
Compute "BUSYNESS" as "HOURLY_ENTRIES" plus "HOURLY_EXITS".
Merge the "Station" information to the dataset.

Field description for data prior to 10/18/14 can be found in http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt

Station, Unit, Control Area information can be found in http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls

# download dataset from MTA webset

def download_data(url,start,stop,log_name):
    """
    This function download selected datasets(prior 10/18/14) from MTA website, by finding the selected datasets by HTML elements.
    Then consolidate all the downloaded datasets to one large dataframe.
    Also, log the downloaded files information, such as file date, file_link, number of rows, and number of columns.
    """
    
    col_name = ['C/A','UNIT','SCP','DATE1','TIME1','DESC1','ENTRIES1','EXITS1','DATE2','TIME2','DESC2','ENTRIES2','EXITS2',
            'DATE3','TIME3','DESC3','ENTRIES3','EXITS3','DATE4','TIME4','DESC4','ENTRIES4','EXITS4',
            'DATE5','TIME5','DESC5','ENTRIES5','EXITS5','DATE6','TIME6','DESC6','ENTRIES6','EXITS6',
            'DATE7','TIME7','DESC7','ENTRIES7','EXITS7','DATE8','TIME8','DESC8','ENTRIES8','EXITS8']
    df = pd.DataFrame(columns = col_name)

    # Log of downloading
    logFile  = open(log_name, 'w')  # 'Download_Log.txt'
    logFile.write('file_date;file_link;number_of_rows;number_of_columns')
    
    # get HTML code
    jpy = pq(url) 

    for i in range(start, stop, 2):
        # retrieve file path
        item = jpy('#contentbox > div > div > a:nth-child({})'.format(i))
        fileDate = item.text()
        filePath = 'http://web.mta.info/developers/' + item.attr('href')

        # get data from txt, save to csv, and append to datafram
        data = pd.read_csv(filePath, sep=",", header=None, names = col_name)
        df = df.append(data, ignore_index=True)
        # data.to_csv('./raw_data/MTA_data_{}.csv'.format(re.sub(r',','',fileDate)))

        # log the downloaded file information
        record = str(fileDate) + ';' + str(filePath) + ';' + str(data.shape[0]) + ';' + str(data.shape[1])
        logFile.write('\n' + record)
    
    logFile.close()
    
    return df

# convert wide form dataset to long form dataset

def convert_wide_to_long(df):
    """
    In original data, there are multiple data points included in each row. 
    This function is to convert the wide form DF to long form DF where there is only one data point per row.
    Also, set the 'ENTRIES' and 'EXITS' variables type to float.
    """
    
    col_name_long = ['C/A','UNIT','SCP','DATE','TIME','DESC','ENTRIES','EXITS']
    df_long = pd.DataFrame(columns = col_name_long)

    for i in range(0,8):
        ind = list(range(0,3)) + [5*i+3, 5*i+4, 5*i+5, 5*i+6, 5*i+7]
        temp = df.iloc[:,ind]
        temp.columns = col_name_long
        df_long = df_long.append(temp, ignore_index = True)
        df_long = df_long.sort_values(['C/A','UNIT','SCP','DATE','TIME'])
        df_long[['ENTRIES','EXITS']] = df_long[['ENTRIES','EXITS']].apply(pd.to_numeric)
        
    return df_long.reset_index(drop = True).dropna(axis=0)

def remove_irregular_event(df):
    """
    In the original data, there are factors that may impact the data. (ie. Hardware failure, "IRREGULAR" audit event)
    Clean the data to filter out 'IRREGULAR' audit event.
    """
    # Remove records where DESC (audit event) != REGULAR
    df = df[df.DESC == 'REGULAR']

    return df

def add_hourly_entries(df):
    """
    The 'ENTRIES' variable recorded in the MTA data are cumulative entries of the turnstile per row. 
    Considering the data for a single turnstile machine (unique SCP, C/A, and UNIT),
    we want to add a new column symbolizing the incremental number of entries since the last recording time.
    
    This function is to add a new column, calculate the difference between ENTRIES in the current row and the
    previous row, and assign the difference to the new column. When there is NaN, fill it with 0.
    """
    
    HOURLY_ENTRIES = df.ENTRIES - df.ENTRIES.shift(1) 
    df['HOURLY_ENTRIES'] = HOURLY_ENTRIES.fillna(0)
    return df


def add_hourly_exits(df):
    """
    The 'EXITS' variable recorded in the MTA data are cumulative exits of the turnstile per row. 
    Considering the data for a single turnstile machine (unique SCP, C/A, and UNIT),
    we want to add a new column symbolizing the incremental number of exits since the last recording time.
    
    This function is to add a new column, calculate the difference between EXITS in the current row and the
    previous row, and assign the difference to the new column. When there is NaN, fill it with 0.
    """
    
    HOURLY_EXITS = df.EXITS - df.EXITS.shift(1) 
    df['HOURLY_EXITS'] = HOURLY_EXITS.fillna(0)
    return df

def add_busyness(df):
    """
    Define busyness as sum of ebtries and exits. Add a new column and assign busyness to it.
    """
    
    BUSYNESS = df.HOURLY_ENTRIES + df.HOURLY_EXITS 
    df['BUSYNESS'] = BUSYNESS
    return df

def time_to_hour(time):
    """
    Input 00:00:00 (hour:minute:second).
    Extract and return the hour from input.
    """
    # return pd.to_datetime(time).hour
    return int(time.split(':')[0])

def date_to_month(date):
    """
    Input mm-dd-yy (month-day-year).
    Extract and return the month from input date.
    """
    # return pd.to_datetime(date).month
    return int(date.split('-')[0])
               
def date_to_year(date):
    """
    Input sting mm-dd-yy (month-day-year).
    Extract and return the year from input date.
    """
    # return pd.to_datetime(date)
    return 2000 + int(date.split('-')[2])

def wrangle_MTA_data(url,start,stop,log_name,date_tag):
    
    # download data
    df = download_data(url,start,stop,log_name)
    df.to_csv('MTA_data_{}.csv'.format(date_tag))
    
    # convert wide form data to long form
    df_long=convert_wide_to_long(df)
    df_long.to_csv('MTA_data_long_{}.csv'.format(date_tag))
    
    # filter illegitimate data
    df_long = remove_irregular_event(df_long)
    df_long.to_csv('MTA_data_regular_{}.csv'.format(date_tag))
    
    # add hourly incremental entries
    df_long = df_long.groupby(['C/A','UNIT','SCP']).apply(
        add_hourly_entries)
    
    # add hourly incremental exits
    df_long = df_long.groupby(['C/A','UNIT','SCP']).apply(
        add_hourly_exits)

    # add a 'HOUR', 'MONTH' and 'YEAR' column 
    df_long['HOUR'] = df_long['TIME'].map(time_to_hour)
    df_long['MONTH'] = df_long['DATE'].map(date_to_month)
    df_long['YEAR'] = df_long['DATE'].map(date_to_year)
    
    df_long.to_csv('MTA_data_hour_{}.csv'.format(date_tag))
    
    return df_long

url = "http://web.mta.info/developers/turnstile.html"
# as of May 25, the start=459 is the December 28, 2013 data , and stop = 563 is the January 05, 2013 data 
# change start and stop parameters if necessary when using this code at other times
start = 459 
stop = 564 
log_name = 'Download_Log.txt'
date_tag = 'y2013' # change here accordingly
df_long = wrangle_MTA_data(url,start,stop,log_name,date_tag)

print(df_long.shape)
df_long.head()

(10134532, 13)

def clean_data(df):
    """
    HOURLY_ENTRIES and HOURLY_EXITS contains negative value and abnormally large value (ie. the max is 931476882).
    Assuming it takes 1 second for 1 people enter the turnstile, 
    there can be at max 14,400 people entering turnstile in 4 hours. 
    So in theory, considering a buffer, any HOURLY_ENTRIES (or HOURLY_EXITS) greater than 20000 is not possible. 
    Also HOURLY_ENTRIES and HOURLY_EXITS obviously cannot be negative.
    
    This function replace the negative and greater than 20000 HOURLY_ENTRIES by the mean of the group(ie.SCP,MONTH) that they are in.
    Then calculate the "BUSYNESS".
    """

    # clean 'HOURLY_ENTRIES'
    df['HOURLY_ENTRIES'] = df.groupby(['SCP','MONTH']).HOURLY_ENTRIES.transform(
        lambda x: np.where((x<0)|(x>20000),x.mask((x<0)|(x>20000)).mean(),x))
    
    # clean 'HOURLY_EXITS'
    df['HOURLY_EXITS'] = df.groupby(['SCP','MONTH']).HOURLY_EXITS.transform(
        lambda x: np.where((x<0)|(x>20000),x.mask((x<0)|(x>20000)).mean(),x))
    
    # add busyness
    df['BUSYNESS'] = df.HOURLY_ENTRIES + df.HOURLY_EXITS 
    
    return df

df_final = clean_data(df_long)

Now, we could load the Station, Unit, Control Area mapping information from a csv file, which can be downloaded from MTA website as well.

Fields in this data:

Remote (= UNIT variable in MTA dataset)
Booth (= C/A variable in MTA dataset)
Station
Line Name
Division

# get the Station, Unit, Control Area mapping
mapping = pd.read_csv('Remote-Booth-Station.csv',header = 0,names = ['UNIT','C/A','Station','Line Name','Division'])

# Merge station information to the long form dataset
# for turnstiles that don't exist in mapping file, tag their station information as 'Unknown'

df_master = pd.merge(df_final, mapping, how='left', on=['C/A','UNIT'], sort=True, copy=True, indicator=True)
df_master.loc[df_master._merge =='left_only',['Station','Line Name','Division']] = 'Unknown' 

# save to csv
df_master.to_csv('MTA_data_master_{}.csv'.format('y2013'))

A.2. What is the total number of entries & exits across the subway system for February 1, 2013?¶

df_master[df_master.DATE == '02-01-13'].groupby('DATE')['HOURLY_ENTRIES','HOURLY_EXITS'].sum().round(2).rename(
    columns = {'HOURLY_ENTRIES': "TOTAL_ENTRIES",'HOURLY_EXITS': 'TOTAL_EXITS'})

Answer:¶

The total number of entries & exits across the subway system for February 1, 2013 are 5,808,765 and 4,497,579.

A.3. Let’s define the busy-ness as sum of entry & exit count. What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?¶

# select Feb 1, 2013 data
df_master_020113 = df_master[df_master.DATE == '02-01-13']

# The busiest station
df_master_020113.groupby('Station').BUSYNESS.sum().sort_values(ascending=False)[:3]

Station
34 ST-PENN STA     360435.0
42 ST-GRD CNTRL    327422.0
86 ST              208602.0
Name: BUSYNESS, dtype: float64

# the busiest turnstile
idx = df_master_020113['BUSYNESS'].idxmax()
df_master_020113.loc[idx,]

C/A                         J021
UNIT                        R434
SCP                     00-00-02
DATE                    02-01-13
TIME                    00:00:00
DESC                     REGULAR
ENTRIES              4.43972e+06
EXITS                3.51653e+06
HOURLY_ENTRIES              3645
HOURLY_EXITS                3693
HOUR                           0
MONTH                          2
YEAR                        2013
BUSYNESS                    7338
Station           VAN SICLEN AVE
Line Name                     JZ
Division                     BMT
_merge                      both
Name: 1905847, dtype: object

Answer:¶

The busiest station on February 1, 2013 is 34 ST-PENN STATION.

The busiest turnstile is C/A J021, UNIT R434, SCP 00-00-02, at VAN SICLEN AVE station.

A.4. What stations have seen the most usage growth/decline in 2013?¶

Define the change of usage of a station as difference between the average monthly usage in second half year and the average monthly usage in first half year.

So the most usage growth station is the station with the largest change; the most usage decline station is the station with the smallest change. Also do a t-test to observe the test interval of the mean difference.

# calculate the stations' total usage in 2013
df_master_totalusage_2013 = df_master[df_master.YEAR == 2013].groupby(['Station','MONTH'])['BUSYNESS'].sum().reset_index().rename(
    columns = {"BUSYNESS":"TOTAL_USAGE"})

# list of stations
stations = list(set(df_master['Station']))

def test_usage_change(df,stations):
    res = pd.DataFrame(columns = ['Station','mean_diff','confidence_interval'])
    row = 0
    df_firsth = df[df.MONTH.isin([1,2,3,4,5,6])]
    df_lasth = df[df.MONTH.isin([7,8,9,10,11,12])]

    for station in stations:
        firstHalf = list(df_firsth[df_firsth.Station == station].TOTAL_USAGE)
        secondHalf = list(df_lasth[df_lasth.Station == station].TOTAL_USAGE)
        cm = sms.CompareMeans(sms.DescrStatsW(secondHalf),sms.DescrStatsW(firstHalf))
        confit_int = cm.tconfint_diff(usevar='unequal')
        mean_diff = np.mean(secondHalf) - np.mean(firstHalf)

        res.loc[row,'Station'] = station
        res.loc[row,'mean_diff'] = mean_diff
        res.loc[row,'confidence_interval'] = confit_int
        row += 1
    return res

res = test_usage_change(df_master_totalusage_2013,stations)

print(res.sort_values(by="mean_diff")[:3])

            Station mean_diff                         confidence_interval
272    WHITEHALL ST   -317063  (-504388.28706794145, -129737.83452309942)
98   JOURNAL SQUARE   -196321   (-416372.77975003584, 23730.684120606835)
54   HOWARD BCH-JFK   -120662  (-215232.01122646796, -26091.565284261975)

print(res.sort_values(by="mean_diff",ascending = False)[:3])

            Station mean_diff                       confidence_interval
63          Unknown    640587  (165202.74038649764, 1115971.2803469985)
10    BOWLING GREEN    299289   (154954.82184405794, 443622.8789167648)
337  42 ST-TIMES SQ    257084  (-186355.38486461504, 700522.6609060073)

Answer:¶

WHITEHALL ST stations have seen the most usage decline in 2013.

BOWLING GREEN stations have seen the most usage growth in 2013.

A.5. What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?¶

Assume this question is asking what dates are the least busy in year 2013.

# select year 2013 data
df_master_2013 = df_master[df_master.YEAR == 2013]

def least_busy_day(df):
    """
    calculate the daily sum of BUSYNESS across the subway system and find the 3 least busy days.
    """
    print('The top 3 least busy days:')
    print(df.groupby(['DATE'])['BUSYNESS'].sum().round(2).sort_values()[:3])

least_busy_day(df_master_2013)

The top 3 least busy days:
DATE
12-25-13    3444355.21
01-01-13    3775090.44
11-28-13    4279270.57
Name: BUSYNESS, dtype: float64

def days_with_closed_stations(df):
    """
    Define the station is closed as that station has 0 BUSYNESS on a day. 
    Firstly, for each station calculate the daily busyness.
    Then, find out the days on which any station has 0 daily busyness, also the name count of the closed stations.
    """
    
    daily_sum = df.groupby(['Station','DATE'])['BUSYNESS'].sum(axis = 0).reset_index()

    res = pd.DataFrame(columns = ['DATE','CLOSED_STATION','COUNT_CLOSED_STATION'])
    row = 0
    for name, group in daily_sum[daily_sum.BUSYNESS == 0].groupby('DATE'): 
        res.loc[row,'DATE'] = name
        res.loc[row,'CLOSED_STATION'] = list(group.Station)
        res.loc[row,'COUNT_CLOSED_STATION'] = group.count().Station
        row += 1
    return res

res_closed_stations_2013 = days_with_closed_stations(df_master_2013)
print('First 5 days in 2013 on which stations are entirely closed.')
res_closed_stations_2013.head()

First 5 days in 2013 on which stations are entirely closed.

# print(list(res_closed_stations_2013.DATE))

def days_with_not_fully_operat_station(df):
    """
    In MTA dataset, 'C/A' variable is the indicator of 'booth'. A station may have multiple booths, 
    and each booth consist of multiple devices (ie. SCP). 
    Define the station is not fully operating as that at least 1 booth in that station has 0 busyness over a day. 
    Firstly, for each booth(ie.unique 'Station','C/A') calculate the daily busyness.
    Then, find out the days on which any station is not fully operated, also the name and count of those stations.
    """
    
    daily_booth_sum = df.groupby(['Station','C/A','DATE'])['BUSYNESS'].sum().reset_index()
    daily_countclosebooth = daily_booth_sum.groupby(['Station','DATE']).apply(
        lambda column: (column == 0).sum()).BUSYNESS.reset_index().rename(columns={'BUSYNESS':'CNT_CLOSE_BOOTH'})
    
    res = pd.DataFrame(columns = ['DATE','NOT_FULLY_OPERATE_STATION','CNT_NOT_FULLY_OPERATE_STATION'])
    row = 0
    for name, group in daily_countclosebooth[daily_countclosebooth.CNT_CLOSE_BOOTH > 0].groupby('DATE'): 
        res.loc[row,'DATE'] = name
        res.loc[row,'NOT_FULLY_OPERATE_STATION'] = list(group.Station)
        res.loc[row,'CNT_NOT_FULLY_OPERATE_STATION'] = group.count().Station
        row += 1
    return res

res_not_fully_operate_station_2013 = days_with_not_fully_operat_station(df_master_2013)
print('First 5 days in 2013 on which stations are not fully operated.')
res_not_fully_operate_station_2013.head()

First 5 days in 2013 on which stations are not fully operated.

# print(list(res_not_fully_operate_station_2013.DATE))

Answer:¶

The least busy day in 2013 is December 25, 2013.

The days on which stations were not operating at full capacity or closed entirely could be find in resulting datasets(top 5 rows showed as example).

A.Bonus. What hour is the busiest for station CANAL ST in Q1 2013?¶

# select data for station CANAL ST in Q1 2013
df_CANAL_2013Q1 = df_master_2013[(df_master_2013.MONTH.isin([1,2,3]))&(df_master_2013.Station == 'CANAL ST')]

plot_hourly_busyness = df_CANAL_2013Q1.groupby(['HOUR']).BUSYNESS.mean().round(2).sort_values(ascending = False)
plot_hourly_busyness[:3]

HOUR
21    676.02
17    672.86
20    672.17
Name: BUSYNESS, dtype: float64

Answer:¶

HOUR 21 is the busiest hour for staion CANAL ST in Q1 2013

B.1. Plot the daily row counts for data files in Q1 2013.¶

In original data, there are multiple (8) data points included in each row.

So I firstly converted the original data to long form data set where there is only one data point per row, assuming we more care about how many data points in Q1 2013. Then, after filtering irregular data points from the long form data, I count the rows for each day, which is, in other word, the count of regular data points.

Just in case, we also want the row counts for the original wide form data files. That number should be around 1/8 of the count of data points. (Also, in our downloading log, we recorded the number of rows of the original datafile).

# select data for 2013Q1
df_master_2013q1 = df_master_2013[df_master_2013.MONTH.isin([1,2,3])]

plot_daily_count = df_master_2013q1.groupby(['DATE']).count().iloc[:,0].reset_index().rename(
    columns={'C/A':'row_counts'})

plot_daily_count.tail()

fig, ax = plt.subplots(figsize=(10, 30))
y_pos = np.arange(plot_daily_count.shape[0])

ax.barh(y_pos, plot_daily_count.row_counts, align='center',
        color='coral', ecolor='black')

# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_daily_count.DATE)
ax.set_ylim((-1, plot_daily_count.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Row Counts')
ax.set_ylabel('Date')
ax.set_title('Daily Row Counts for Q1 2013')

# add value label for each bar
for i in ax.patches:
    ax.text(i.get_width()+.3, i.get_y()+.38,'{0:.2f}k'.format(i.get_width()/1000) , fontsize=10,color='black')

B.2. Plot the daily total number of entries & exits across the system for Q1 2013.¶

plot_daily_entries_exits = \
df_master_2013q1.groupby(['DATE']).sum()[['HOURLY_ENTRIES','HOURLY_EXITS']].rename(
    columns = {'HOURLY_ENTRIES':'daily_entries','HOURLY_EXITS':'daily_exits'}).reset_index()

plot_daily_entries_exits.head()

fig, ax = plt.subplots(figsize=(10, 40))

ind = np.arange(plot_daily_entries_exits.shape[0])  # the y locations for the groups
width = 0.4 # the width of the bars

rects1 = ax.barh(ind, plot_daily_entries_exits.daily_entries, width, color='coral')
rects2 = ax.barh(ind + width, plot_daily_entries_exits.daily_exits, width, color='tan')

# add some text for labels, title and axes ticks
ax.set_xlabel('Count')
ax.set_title('Daily Total Number of Entries and Exits')
ax.set_yticks(ind + width / 2)
ax.set_yticklabels(plot_daily_count.DATE)
ax.set_ylim((-1, plot_daily_count.shape[0]))
ax.invert_yaxis()

# add value label for each bar    
def add_value_label(rects):
    for i in rects:
        height = i.get_width()
        ax.text(i.get_width()+.3, i.get_y()+.38,
                '{0:.2f}m'.format(height/1000000),fontsize=10,color='black')
        
add_value_label(rects1)
add_value_label(rects2)

Note: The chart shows daily entries always greater than daily exits. This actually make sense on a second thought, since many people may use emergency exit door when they get out of the station, hence they are not counted by any turnstile.

B.3. Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.¶

# get data from 34ST-PENN STA in Q1 2013
df_34penn_2013q1 = df_master_2013q1[df_master_2013q1.Station == '34 ST-PENN STA']

# calculate the daily entries and exits for Q1 2013
daily_entries_exits_34penn_2013q1 = df_34penn_2013q1.groupby(['MONTH','DATE'])[['HOURLY_ENTRIES','HOURLY_EXITS']]\
    .sum().rename(columns = {'HOURLY_ENTRIES':'daily_entries','HOURLY_EXITS':'daily_exits'}).reset_index()

# calculate the mean and standard deviation of the daily total number of entries & exits for each month
plot_mean_err = daily_entries_exits_34penn_2013q1.groupby('MONTH')['daily_entries','daily_exits'].agg(
    [np.mean,np.std]).reset_index()

plot_mean_err

fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10, 5),sharey = True)

ind = list(plot_mean_err.MONTH)

ax1.errorbar(x=ind,y=list(plot_mean_err.daily_entries['mean']), \
             yerr=list(plot_mean_err.daily_entries['std']),fmt='o',capsize = 5)
ax1.set_ylabel('Daily Entries')
ax1.set_xlabel('MONTH')
ax1.set_xticks(ind)

ax2.errorbar(x=ind,y=list(plot_mean_err.daily_exits['mean']),\
             yerr=list(plot_mean_err.daily_exits['std']),fmt='o',capsize = 5)
ax2.set_ylabel('Daily Exits')
ax2.set_xlabel('MONTH')
ax2.set_xticks(ind)

fig.suptitle('Mean and Standard Deviation of the daily total number of entries & exits')

Text(0.5,0.98,'Mean and Standard Deviation of the daily total number of entries & exits')

B.4. Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.

plot_entries_stat = daily_entries_exits_34penn_2013q1.groupby('MONTH').daily_entries.describe(percentiles =[0.25,0.75])
plot_entries_stat

def plot_25_50_75_perc(summary,var):
    fig, ax = plt.subplots(figsize=(5, 5))
    err = np.vstack([list(summary['25%']),list(summary['75%'])])
    ax.errorbar(x=ind,y=list(summary['50%']), \
                yerr=err, fmt='o',capsize = 5)
    
    ax.set_title('Plot of 25/50/75 percentile of the daily total {} for each month in Q1 2013'.format(var))
    ax.set_xlabel('MONTH')
    ax.set_xticks(ind)

plot_25_50_75_perc(plot_entries_stat,'ENTRIES')

plot_exits_stat = daily_entries_exits_34penn_2013q1.groupby('MONTH').daily_exits.describe(percentiles =[0.25,0.75])
plot_exits_stat

plot_25_50_75_perc(plot_exits_stat,'EXITS')

B.5 Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q1 2013.¶

Let's use the two result datasets from part A question 4: res_closed_stations_2013 & res_not_fully_operate_station_2013.

# select data for Q1 2013
df_closed_station_2013q1 = res_closed_stations_2013[res_closed_stations_2013.DATE.apply(
    lambda x:pd.to_datetime(x)) < '2013-04-01']

# convert DATE variable to date format
df_closed_station_2013q1['DATE'] = df_closed_station_2013q1['DATE'].map(
    lambda x: pd.to_datetime(x))

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys

# fill in missing date to the data
start = datetime(2013,1,1)
end = datetime(2013,3,31)
ts = pd.DataFrame(pd.date_range(start,end, freq='D'),columns= ["DATE"])

df_time = pd.merge(ts, plot_closed_station_2013q1[['DATE','COUNT_CLOSED_STATION']], how='left', on=['DATE']).fillna(0)

# plot the daily number of closed station
fig, ax = plt.subplots(figsize=(10, 20))
y_pos = np.arange(plot_daily_count.shape[0])

ax.barh(y_pos, df_time.COUNT_CLOSED_STATION, align='center',
        color='coral', ecolor='black')

# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(df_time.DATE.apply(lambda x: datetime.strftime(x, '%m-%d-%y')))
ax.set_ylim((-1, df_time.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Count of closed station')
ax.set_ylabel('Date')
ax.set_title('Daily count of closed station for Q1 2013')

# add value label for each bar
for i in ax.patches:
    ax.text(i.get_width()+.1, i.get_y()+.38,'{}'.format(i.get_width()) , fontsize=10,color='black')

# select data for Q1 2013
df_notfullyoper_station_2013q1 = res_not_fully_operate_station_2013[res_not_fully_operate_station_2013.DATE.apply(
    lambda x:pd.to_datetime(x)) < '2013-04-01']

# convert DATE variable to date format
df_notfullyoper_station_2013q1['DATE'] = df_notfullyoper_station_2013q1['DATE'].map(
    lambda x: pd.to_datetime(x))

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys

df_notfullyoper_station_2013q1.head()

# fill in missing date to the data
df_time_notfullyoper = pd.merge(ts, df_notfullyoper_station_2013q1[['DATE','CNT_NOT_FULLY_OPERATE_STATION']], how='left', on=['DATE']).fillna(0)

# plot the daily number of closed station
fig, ax = plt.subplots(figsize=(10, 20))
y_pos = np.arange(df_time_notfullyoper.shape[0])

ax.barh(y_pos, df_time_notfullyoper.CNT_NOT_FULLY_OPERATE_STATION, align='center',
        color='coral', ecolor='black')

# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(df_time_notfullyoper.DATE.apply(lambda x: datetime.strftime(x, '%m-%d-%y')))
ax.set_ylim((-1, df_time.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Count of not fully operated station')
ax.set_ylabel('Date')
ax.set_title('Daily count of not fully operated  station for Q1 2013')

# add value label for each bar
for i in ax.patches:
    ax.text(i.get_width()+.1, i.get_y()+.38,'{}'.format(i.get_width()) , fontsize=10,color='black')

	C/A	UNIT	SCP	STATION	LINENAME	DIVISION	DATE	TIME	DESC	ENTRIES	EXITS
0	A002	R051	02-00-00	59 ST	NQR456W	BMT	05/12/2018	00:00:00	REGULAR	6616743	2242108
1	A002	R051	02-00-00	59 ST	NQR456W	BMT	05/12/2018	04:00:00	REGULAR	6616762	2242113
2	A002	R051	02-00-00	59 ST	NQR456W	BMT	05/12/2018	08:00:00	REGULAR	6616781	2242144
3	A002	R051	02-00-00	59 ST	NQR456W	BMT	05/12/2018	12:00:00	REGULAR	6616886	2242229
4	A002	R051	02-00-00	59 ST	NQR456W	BMT	05/12/2018	16:00:00	REGULAR	6617105	2242278

	C/A	UNIT	SCP	DATE	TIME	DESC	ENTRIES	EXITS	HOURLY_ENTRIES	HOURLY_EXITS	HOUR	MONTH	YEAR
0	A002	R051	02-00-00	01-01-13	03:00:00	REGULAR	3932284.0	1355714.0	0.0	0.0	3	1	2013
1	A002	R051	02-00-00	01-01-13	07:00:00	REGULAR	3932299.0	1355721.0	15.0	7.0	7	1	2013
2	A002	R051	02-00-00	01-01-13	11:00:00	REGULAR	3932327.0	1355774.0	28.0	53.0	11	1	2013
3	A002	R051	02-00-00	01-01-13	15:00:00	REGULAR	3932427.0	1355811.0	100.0	37.0	15	1	2013
4	A002	R051	02-00-00	01-01-13	19:00:00	REGULAR	3932662.0	1355849.0	235.0	38.0	19	1	2013

	DATE	CLOSED_STATION	COUNT_CLOSED_STATION
0	01-05-13	[BEACH 90 ST]	1
1	01-06-13	[14TH STREET, 9TH STREET, KINGSTON AVE, TWENTY...	4
2	01-08-13	[BEACH 90 ST]	1
3	01-09-13	[BEACH 90 ST]	1
4	01-10-13	[BEACH 90 ST]	1

	DATE	NOT_FULLY_OPERATE_STATION	CNT_NOT_FULLY_OPERATE_STATION
0	01-01-13	[161 ST-YANKEE, 28 ST, 42 ST-GRD CNTRL, 51 ST,...	8
1	01-02-13	[Unknown]	1
2	01-03-13	[Unknown]	1
3	01-04-13	[Unknown]	1
4	01-05-13	[161 ST-YANKEE, 34 ST-PENN STA, 42 ST-GRD CNTR...	9

	DATE	row_counts
85	03-27-13	27213
86	03-28-13	27528
87	03-29-13	26786
88	03-30-13	26944
89	03-31-13	26858

	DATE	daily_entries	daily_exits
0	01-01-13	2.063555e+06	1.711535e+06
1	01-02-13	6.073735e+06	4.848887e+06
2	01-03-13	6.248716e+06	4.995638e+06
3	01-04-13	6.180395e+06	4.962462e+06
4	01-05-13	3.189770e+06	2.593668e+06

	MONTH	daily_entries		daily_exits
		mean	std	mean	std
0	1	142796.123376	46750.193938	125120.611422	38360.966903
1	2	145029.237484	48820.965770	126102.077118	40653.041929
2	3	147582.452106	43576.112025	130827.580553	35225.521156

	count	mean	std	min	25%	50%	75%	max
MONTH
1	31.0	142796.123376	46750.193938	61601.00000	86212.5	170646.000000	173842.000000	188766.702635
2	28.0	145029.237484	48820.965770	53684.00000	91983.0	176177.324778	178376.500000	190943.000000
3	31.0	147582.452106	43576.112025	66498.11801	96723.5	173828.000000	177832.448637	191379.000000

	DATE	NOT_FULLY_OPERATE_STATION	CNT_NOT_FULLY_OPERATE_STATION
0	2013-01-01	[161 ST-YANKEE, 28 ST, 42 ST-GRD CNTRL, 51 ST,...	8
1	2013-01-02	[Unknown]	1
2	2013-01-03	[Unknown]	1
3	2013-01-04	[Unknown]	1
4	2013-01-05	[161 ST-YANKEE, 34 ST-PENN STA, 42 ST-GRD CNTR...	9

	TOTAL_ENTRIES	TOTAL_EXITS
DATE
02-01-13	5808765.13	4497579.28