MTA Data Challenge

A. Data analysis:

  1. Which station has the most number of units?

  2. What is the total number of entries & exits across the subway system for February 1, 2013?

  3. Let’s define the busy-ness as sum of entry & exit count. What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?

  4. What stations have seen the most usage growth/decline in 2013?

  5. What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?

  6. Bonus: What hour is the busiest for station CANAL ST in Q1 2013?

B. Visualization:

  1. Plot the daily row counts for data files in Q1 2013.

  2. Plot the daily total number of entries & exits across the system for Q1 2013.

  3. Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.

  4. Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.

  5. Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q1 2013.

Notes/Caveats on data definition:

  • Date Range: Year 2013 for questions except question 1. (May 19, 2018 for question 1)
  • Turnstile is defined by the author as control_area, unit, and scp combination.
  • Only ‘Regular’ in description variable is considered for analysis.
In [70]:
# Read HTML module
from pyquery import PyQuery as pq

# common modules
import os
import re
import pandas as pd
import numpy as np
from datetime import datetime
import statsmodels.stats.api as sms

# plot modules
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')

# random seed
seed = 2

A.1. Which station has the most number of units?

I assume this question asks as of May 19 2018, which station has the most number of units. So for this question we use this data: http://web.mta.info/developers/data/nyct/turnstile/turnstile_180519.txt.

Field description can be found from http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt.

In [35]:
# download the data
path = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_180519.txt'
col_name_new = ['C/A','UNIT','SCP','STATION','LINENAME','DIVISION','DATE','TIME','DESC','ENTRIES','EXITS']
MTA_data_20180519w = pd.read_csv(path, sep=",", header=0, names = col_name_new)
In [36]:
MTA_data_20180519w.head()
Out[36]:
C/A UNIT SCP STATION LINENAME DIVISION DATE TIME DESC ENTRIES EXITS
0 A002 R051 02-00-00 59 ST NQR456W BMT 05/12/2018 00:00:00 REGULAR 6616743 2242108
1 A002 R051 02-00-00 59 ST NQR456W BMT 05/12/2018 04:00:00 REGULAR 6616762 2242113
2 A002 R051 02-00-00 59 ST NQR456W BMT 05/12/2018 08:00:00 REGULAR 6616781 2242144
3 A002 R051 02-00-00 59 ST NQR456W BMT 05/12/2018 12:00:00 REGULAR 6616886 2242229
4 A002 R051 02-00-00 59 ST NQR456W BMT 05/12/2018 16:00:00 REGULAR 6617105 2242278

Calculate the number of unique Unit for each Station on 05/18/2018 and select the Station having the most number of Unit.

In [37]:
# MTA_data_20180518 = MTA_data_20180519w.loc[MTA_data_20180519w.DATE == '05/18/2018',['STATION','UNIT']]
MTA_data_20180519w[MTA_data_20180519w.DATE == '05/18/2018'].groupby(
    ['STATION'],sort = False).nunique().UNIT.sort_values().tail(1)
Out[37]:
STATION
23 ST    6
Name: UNIT, dtype: int64

Answer:

23 ST station having 6 units is the station has the most number of units.

For following questions, I first download data for entire year 2013, convert, clean and wrangle the data, and then do the required analysis.

Concretely, below steps are used on the raw data:

  1. Download the year 2013 datasets from MTA webset. Append different datasets into a single dataframe.
  2. Since the original data is formated in a way that multiple data points included in each row, I convert the wide form DF to long form DF where there is only one data point per row.
  3. Remove the irregular data, by filtering "DESC" variable
  4. Add a new column "HOURLY_ENTRIES" representing the incremental number of entries since the last recording time.
  5. Add a new column "HOURLY_EXITS" representing the incremental number of exits since the last recording time.
  6. Extract hour from "TIME" variable, extract "MONTH" and "YEAR" from "DATE" variable.
  7. Identify and impute outliers in "HOURLY_ENTRIES" and "HOURLY_EXITS".
  8. Compute "BUSYNESS" as "HOURLY_ENTRIES" plus "HOURLY_EXITS".
  9. Merge the "Station" information to the dataset.

Field description for data prior to 10/18/14 can be found in http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt

Station, Unit, Control Area information can be found in http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls

In [71]:
# download dataset from MTA webset

def download_data(url,start,stop,log_name):
    """
    This function download selected datasets(prior 10/18/14) from MTA website, by finding the selected datasets by HTML elements.
    Then consolidate all the downloaded datasets to one large dataframe.
    Also, log the downloaded files information, such as file date, file_link, number of rows, and number of columns.
    """
    
    col_name = ['C/A','UNIT','SCP','DATE1','TIME1','DESC1','ENTRIES1','EXITS1','DATE2','TIME2','DESC2','ENTRIES2','EXITS2',
            'DATE3','TIME3','DESC3','ENTRIES3','EXITS3','DATE4','TIME4','DESC4','ENTRIES4','EXITS4',
            'DATE5','TIME5','DESC5','ENTRIES5','EXITS5','DATE6','TIME6','DESC6','ENTRIES6','EXITS6',
            'DATE7','TIME7','DESC7','ENTRIES7','EXITS7','DATE8','TIME8','DESC8','ENTRIES8','EXITS8']
    df = pd.DataFrame(columns = col_name)

    # Log of downloading
    logFile  = open(log_name, 'w')  # 'Download_Log.txt'
    logFile.write('file_date;file_link;number_of_rows;number_of_columns')
    
    # get HTML code
    jpy = pq(url) 

    for i in range(start, stop, 2):
        # retrieve file path
        item = jpy('#contentbox > div > div > a:nth-child({})'.format(i))
        fileDate = item.text()
        filePath = 'http://web.mta.info/developers/' + item.attr('href')

        # get data from txt, save to csv, and append to datafram
        data = pd.read_csv(filePath, sep=",", header=None, names = col_name)
        df = df.append(data, ignore_index=True)
        # data.to_csv('./raw_data/MTA_data_{}.csv'.format(re.sub(r',','',fileDate)))

        # log the downloaded file information
        record = str(fileDate) + ';' + str(filePath) + ';' + str(data.shape[0]) + ';' + str(data.shape[1])
        logFile.write('\n' + record)
    
    logFile.close()
    
    return df
In [14]:
# convert wide form dataset to long form dataset

def convert_wide_to_long(df):
    """
    In original data, there are multiple data points included in each row. 
    This function is to convert the wide form DF to long form DF where there is only one data point per row.
    Also, set the 'ENTRIES' and 'EXITS' variables type to float.
    """
    
    col_name_long = ['C/A','UNIT','SCP','DATE','TIME','DESC','ENTRIES','EXITS']
    df_long = pd.DataFrame(columns = col_name_long)

    for i in range(0,8):
        ind = list(range(0,3)) + [5*i+3, 5*i+4, 5*i+5, 5*i+6, 5*i+7]
        temp = df.iloc[:,ind]
        temp.columns = col_name_long
        df_long = df_long.append(temp, ignore_index = True)
        df_long = df_long.sort_values(['C/A','UNIT','SCP','DATE','TIME'])
        df_long[['ENTRIES','EXITS']] = df_long[['ENTRIES','EXITS']].apply(pd.to_numeric)
        
    return df_long.reset_index(drop = True).dropna(axis=0)
In [15]:
def remove_irregular_event(df):
    """
    In the original data, there are factors that may impact the data. (ie. Hardware failure, "IRREGULAR" audit event)
    Clean the data to filter out 'IRREGULAR' audit event.
    """
    # Remove records where DESC (audit event) != REGULAR
    df = df[df.DESC == 'REGULAR']

    return df
In [16]:
def add_hourly_entries(df):
    """
    The 'ENTRIES' variable recorded in the MTA data are cumulative entries of the turnstile per row. 
    Considering the data for a single turnstile machine (unique SCP, C/A, and UNIT),
    we want to add a new column symbolizing the incremental number of entries since the last recording time.
    
    This function is to add a new column, calculate the difference between ENTRIES in the current row and the
    previous row, and assign the difference to the new column. When there is NaN, fill it with 0.
    """
    
    HOURLY_ENTRIES = df.ENTRIES - df.ENTRIES.shift(1) 
    df['HOURLY_ENTRIES'] = HOURLY_ENTRIES.fillna(0)
    return df


def add_hourly_exits(df):
    """
    The 'EXITS' variable recorded in the MTA data are cumulative exits of the turnstile per row. 
    Considering the data for a single turnstile machine (unique SCP, C/A, and UNIT),
    we want to add a new column symbolizing the incremental number of exits since the last recording time.
    
    This function is to add a new column, calculate the difference between EXITS in the current row and the
    previous row, and assign the difference to the new column. When there is NaN, fill it with 0.
    """
    
    HOURLY_EXITS = df.EXITS - df.EXITS.shift(1) 
    df['HOURLY_EXITS'] = HOURLY_EXITS.fillna(0)
    return df

def add_busyness(df):
    """
    Define busyness as sum of ebtries and exits. Add a new column and assign busyness to it.
    """
    
    BUSYNESS = df.HOURLY_ENTRIES + df.HOURLY_EXITS 
    df['BUSYNESS'] = BUSYNESS
    return df
In [17]:
def time_to_hour(time):
    """
    Input 00:00:00 (hour:minute:second).
    Extract and return the hour from input.
    """
    # return pd.to_datetime(time).hour
    return int(time.split(':')[0])

def date_to_month(date):
    """
    Input mm-dd-yy (month-day-year).
    Extract and return the month from input date.
    """
    # return pd.to_datetime(date).month
    return int(date.split('-')[0])
               
def date_to_year(date):
    """
    Input sting mm-dd-yy (month-day-year).
    Extract and return the year from input date.
    """
    # return pd.to_datetime(date)
    return 2000 + int(date.split('-')[2])       
In [18]:
def wrangle_MTA_data(url,start,stop,log_name,date_tag):
    
    # download data
    df = download_data(url,start,stop,log_name)
    df.to_csv('MTA_data_{}.csv'.format(date_tag))
    
    # convert wide form data to long form
    df_long=convert_wide_to_long(df)
    df_long.to_csv('MTA_data_long_{}.csv'.format(date_tag))
    
    # filter illegitimate data
    df_long = remove_irregular_event(df_long)
    df_long.to_csv('MTA_data_regular_{}.csv'.format(date_tag))
    
    # add hourly incremental entries
    df_long = df_long.groupby(['C/A','UNIT','SCP']).apply(
        add_hourly_entries)
    
    # add hourly incremental exits
    df_long = df_long.groupby(['C/A','UNIT','SCP']).apply(
        add_hourly_exits)

    # add a 'HOUR', 'MONTH' and 'YEAR' column 
    df_long['HOUR'] = df_long['TIME'].map(time_to_hour)
    df_long['MONTH'] = df_long['DATE'].map(date_to_month)
    df_long['YEAR'] = df_long['DATE'].map(date_to_year)
    
    df_long.to_csv('MTA_data_hour_{}.csv'.format(date_tag))
    
    return df_long
In [9]:
url = "http://web.mta.info/developers/turnstile.html"
# as of May 25, the start=459 is the December 28, 2013 data , and stop = 563 is the January 05, 2013 data 
# change start and stop parameters if necessary when using this code at other times
start = 459 
stop = 564 
log_name = 'Download_Log.txt'
date_tag = 'y2013' # change here accordingly
df_long = wrangle_MTA_data(url,start,stop,log_name,date_tag)
In [10]:
print(df_long.shape)
df_long.head()
(10134532, 13)
Out[10]:
C/A UNIT SCP DATE TIME DESC ENTRIES EXITS HOURLY_ENTRIES HOURLY_EXITS HOUR MONTH YEAR
0 A002 R051 02-00-00 01-01-13 03:00:00 REGULAR 3932284.0 1355714.0 0.0 0.0 3 1 2013
1 A002 R051 02-00-00 01-01-13 07:00:00 REGULAR 3932299.0 1355721.0 15.0 7.0 7 1 2013
2 A002 R051 02-00-00 01-01-13 11:00:00 REGULAR 3932327.0 1355774.0 28.0 53.0 11 1 2013
3 A002 R051 02-00-00 01-01-13 15:00:00 REGULAR 3932427.0 1355811.0 100.0 37.0 15 1 2013
4 A002 R051 02-00-00 01-01-13 19:00:00 REGULAR 3932662.0 1355849.0 235.0 38.0 19 1 2013
In [19]:
def clean_data(df):
    """
    HOURLY_ENTRIES and HOURLY_EXITS contains negative value and abnormally large value (ie. the max is 931476882).
    Assuming it takes 1 second for 1 people enter the turnstile, 
    there can be at max 14,400 people entering turnstile in 4 hours. 
    So in theory, considering a buffer, any HOURLY_ENTRIES (or HOURLY_EXITS) greater than 20000 is not possible. 
    Also HOURLY_ENTRIES and HOURLY_EXITS obviously cannot be negative.
    
    This function replace the negative and greater than 20000 HOURLY_ENTRIES by the mean of the group(ie.SCP,MONTH) that they are in.
    Then calculate the "BUSYNESS".
    """

    # clean 'HOURLY_ENTRIES'
    df['HOURLY_ENTRIES'] = df.groupby(['SCP','MONTH']).HOURLY_ENTRIES.transform(
        lambda x: np.where((x<0)|(x>20000),x.mask((x<0)|(x>20000)).mean(),x))
    
    # clean 'HOURLY_EXITS'
    df['HOURLY_EXITS'] = df.groupby(['SCP','MONTH']).HOURLY_EXITS.transform(
        lambda x: np.where((x<0)|(x>20000),x.mask((x<0)|(x>20000)).mean(),x))
    
    # add busyness
    df['BUSYNESS'] = df.HOURLY_ENTRIES + df.HOURLY_EXITS 
    
    return df
In [12]:
df_final = clean_data(df_long)

Now, we could load the Station, Unit, Control Area mapping information from a csv file, which can be downloaded from MTA website as well.

Fields in this data:

  • Remote (= UNIT variable in MTA dataset)
  • Booth (= C/A variable in MTA dataset)
  • Station
  • Line Name
  • Division
In [13]:
# get the Station, Unit, Control Area mapping
mapping = pd.read_csv('Remote-Booth-Station.csv',header = 0,names = ['UNIT','C/A','Station','Line Name','Division'])
In [36]:
# Merge station information to the long form dataset
# for turnstiles that don't exist in mapping file, tag their station information as 'Unknown'

df_master = pd.merge(df_final, mapping, how='left', on=['C/A','UNIT'], sort=True, copy=True, indicator=True)
df_master.loc[df_master._merge =='left_only',['Station','Line Name','Division']] = 'Unknown' 

# save to csv
df_master.to_csv('MTA_data_master_{}.csv'.format('y2013'))

A.2. What is the total number of entries & exits across the subway system for February 1, 2013?

In [18]:
df_master[df_master.DATE == '02-01-13'].groupby('DATE')['HOURLY_ENTRIES','HOURLY_EXITS'].sum().round(2).rename(
    columns = {'HOURLY_ENTRIES': "TOTAL_ENTRIES",'HOURLY_EXITS': 'TOTAL_EXITS'})
Out[18]:
TOTAL_ENTRIES TOTAL_EXITS
DATE
02-01-13 5808765.13 4497579.28

Answer:

The total number of entries & exits across the subway system for February 1, 2013 are 5,808,765 and 4,497,579.

A.3. Let’s define the busy-ness as sum of entry & exit count. What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?

In [19]:
# select Feb 1, 2013 data
df_master_020113 = df_master[df_master.DATE == '02-01-13']
In [21]:
# The busiest station
df_master_020113.groupby('Station').BUSYNESS.sum().sort_values(ascending=False)[:3]
Out[21]:
Station
34 ST-PENN STA     360435.0
42 ST-GRD CNTRL    327422.0
86 ST              208602.0
Name: BUSYNESS, dtype: float64
In [22]:
# the busiest turnstile
idx = df_master_020113['BUSYNESS'].idxmax()
df_master_020113.loc[idx,]
Out[22]:
C/A                         J021
UNIT                        R434
SCP                     00-00-02
DATE                    02-01-13
TIME                    00:00:00
DESC                     REGULAR
ENTRIES              4.43972e+06
EXITS                3.51653e+06
HOURLY_ENTRIES              3645
HOURLY_EXITS                3693
HOUR                           0
MONTH                          2
YEAR                        2013
BUSYNESS                    7338
Station           VAN SICLEN AVE
Line Name                     JZ
Division                     BMT
_merge                      both
Name: 1905847, dtype: object

Answer:

The busiest station on February 1, 2013 is 34 ST-PENN STATION.

The busiest turnstile is C/A J021, UNIT R434, SCP 00-00-02, at VAN SICLEN AVE station.

A.4. What stations have seen the most usage growth/decline in 2013?

Define the change of usage of a station as difference between the average monthly usage in second half year and the average monthly usage in first half year.

So the most usage growth station is the station with the largest change; the most usage decline station is the station with the smallest change. Also do a t-test to observe the test interval of the mean difference.

In [30]:
# calculate the stations' total usage in 2013
df_master_totalusage_2013 = df_master[df_master.YEAR == 2013].groupby(['Station','MONTH'])['BUSYNESS'].sum().reset_index().rename(
    columns = {"BUSYNESS":"TOTAL_USAGE"})

# list of stations
stations = list(set(df_master['Station']))
In [65]:
def test_usage_change(df,stations):
    res = pd.DataFrame(columns = ['Station','mean_diff','confidence_interval'])
    row = 0
    df_firsth = df[df.MONTH.isin([1,2,3,4,5,6])]
    df_lasth = df[df.MONTH.isin([7,8,9,10,11,12])]

    for station in stations:
        firstHalf = list(df_firsth[df_firsth.Station == station].TOTAL_USAGE)
        secondHalf = list(df_lasth[df_lasth.Station == station].TOTAL_USAGE)
        cm = sms.CompareMeans(sms.DescrStatsW(secondHalf),sms.DescrStatsW(firstHalf))
        confit_int = cm.tconfint_diff(usevar='unequal')
        mean_diff = np.mean(secondHalf) - np.mean(firstHalf)

        res.loc[row,'Station'] = station
        res.loc[row,'mean_diff'] = mean_diff
        res.loc[row,'confidence_interval'] = confit_int
        row += 1
    return res
In [66]:
res = test_usage_change(df_master_totalusage_2013,stations)
In [67]:
print(res.sort_values(by="mean_diff")[:3])
            Station mean_diff                         confidence_interval
272    WHITEHALL ST   -317063  (-504388.28706794145, -129737.83452309942)
98   JOURNAL SQUARE   -196321   (-416372.77975003584, 23730.684120606835)
54   HOWARD BCH-JFK   -120662  (-215232.01122646796, -26091.565284261975)
In [68]:
print(res.sort_values(by="mean_diff",ascending = False)[:3])
            Station mean_diff                       confidence_interval
63          Unknown    640587  (165202.74038649764, 1115971.2803469985)
10    BOWLING GREEN    299289   (154954.82184405794, 443622.8789167648)
337  42 ST-TIMES SQ    257084  (-186355.38486461504, 700522.6609060073)

Answer:

WHITEHALL ST stations have seen the most usage decline in 2013.

BOWLING GREEN stations have seen the most usage growth in 2013.

A.5. What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?

Assume this question is asking what dates are the least busy in year 2013.

In [23]:
# select year 2013 data
df_master_2013 = df_master[df_master.YEAR == 2013]
In [26]:
def least_busy_day(df):
    """
    calculate the daily sum of BUSYNESS across the subway system and find the 3 least busy days.
    """
    print('The top 3 least busy days:')
    print(df.groupby(['DATE'])['BUSYNESS'].sum().round(2).sort_values()[:3])
In [27]:
least_busy_day(df_master_2013)
The top 3 least busy days:
DATE
12-25-13    3444355.21
01-01-13    3775090.44
11-28-13    4279270.57
Name: BUSYNESS, dtype: float64
In [138]:
def days_with_closed_stations(df):
    """
    Define the station is closed as that station has 0 BUSYNESS on a day. 
    Firstly, for each station calculate the daily busyness.
    Then, find out the days on which any station has 0 daily busyness, also the name count of the closed stations.
    """
    
    daily_sum = df.groupby(['Station','DATE'])['BUSYNESS'].sum(axis = 0).reset_index()

    res = pd.DataFrame(columns = ['DATE','CLOSED_STATION','COUNT_CLOSED_STATION'])
    row = 0
    for name, group in daily_sum[daily_sum.BUSYNESS == 0].groupby('DATE'): 
        res.loc[row,'DATE'] = name
        res.loc[row,'CLOSED_STATION'] = list(group.Station)
        res.loc[row,'COUNT_CLOSED_STATION'] = group.count().Station
        row += 1
    return res
In [139]:
res_closed_stations_2013 = days_with_closed_stations(df_master_2013)
print('First 5 days in 2013 on which stations are entirely closed.')
res_closed_stations_2013.head()
First 5 days in 2013 on which stations are entirely closed.
Out[139]:
DATE CLOSED_STATION COUNT_CLOSED_STATION
0 01-05-13 [BEACH 90 ST] 1
1 01-06-13 [14TH STREET, 9TH STREET, KINGSTON AVE, TWENTY... 4
2 01-08-13 [BEACH 90 ST] 1
3 01-09-13 [BEACH 90 ST] 1
4 01-10-13 [BEACH 90 ST] 1
In [ ]:
# print(list(res_closed_stations_2013.DATE))
In [30]:
def days_with_not_fully_operat_station(df):
    """
    In MTA dataset, 'C/A' variable is the indicator of 'booth'. A station may have multiple booths, 
    and each booth consist of multiple devices (ie. SCP). 
    Define the station is not fully operating as that at least 1 booth in that station has 0 busyness over a day. 
    Firstly, for each booth(ie.unique 'Station','C/A') calculate the daily busyness.
    Then, find out the days on which any station is not fully operated, also the name and count of those stations.
    """
    
    daily_booth_sum = df.groupby(['Station','C/A','DATE'])['BUSYNESS'].sum().reset_index()
    daily_countclosebooth = daily_booth_sum.groupby(['Station','DATE']).apply(
        lambda column: (column == 0).sum()).BUSYNESS.reset_index().rename(columns={'BUSYNESS':'CNT_CLOSE_BOOTH'})
    
    res = pd.DataFrame(columns = ['DATE','NOT_FULLY_OPERATE_STATION','CNT_NOT_FULLY_OPERATE_STATION'])
    row = 0
    for name, group in daily_countclosebooth[daily_countclosebooth.CNT_CLOSE_BOOTH > 0].groupby('DATE'): 
        res.loc[row,'DATE'] = name
        res.loc[row,'NOT_FULLY_OPERATE_STATION'] = list(group.Station)
        res.loc[row,'CNT_NOT_FULLY_OPERATE_STATION'] = group.count().Station
        row += 1
    return res
In [32]:
res_not_fully_operate_station_2013 = days_with_not_fully_operat_station(df_master_2013)
print('First 5 days in 2013 on which stations are not fully operated.')
res_not_fully_operate_station_2013.head()
First 5 days in 2013 on which stations are not fully operated.
Out[32]:
DATE NOT_FULLY_OPERATE_STATION CNT_NOT_FULLY_OPERATE_STATION
0 01-01-13 [161 ST-YANKEE, 28 ST, 42 ST-GRD CNTRL, 51 ST,... 8
1 01-02-13 [Unknown] 1
2 01-03-13 [Unknown] 1
3 01-04-13 [Unknown] 1
4 01-05-13 [161 ST-YANKEE, 34 ST-PENN STA, 42 ST-GRD CNTR... 9
In [ ]:
# print(list(res_not_fully_operate_station_2013.DATE))

Answer:

The least busy day in 2013 is December 25, 2013.

The days on which stations were not operating at full capacity or closed entirely could be find in resulting datasets(top 5 rows showed as example).

A.Bonus. What hour is the busiest for station CANAL ST in Q1 2013?

In [40]:
# select data for station CANAL ST in Q1 2013
df_CANAL_2013Q1 = df_master_2013[(df_master_2013.MONTH.isin([1,2,3]))&(df_master_2013.Station == 'CANAL ST')]
In [72]:
plot_hourly_busyness = df_CANAL_2013Q1.groupby(['HOUR']).BUSYNESS.mean().round(2).sort_values(ascending = False)
plot_hourly_busyness[:3]
Out[72]:
HOUR
21    676.02
17    672.86
20    672.17
Name: BUSYNESS, dtype: float64

Answer:

HOUR 21 is the busiest hour for staion CANAL ST in Q1 2013

B.1. Plot the daily row counts for data files in Q1 2013.

In original data, there are multiple (8) data points included in each row.

So I firstly converted the original data to long form data set where there is only one data point per row, assuming we more care about how many data points in Q1 2013. Then, after filtering irregular data points from the long form data, I count the rows for each day, which is, in other word, the count of regular data points.

Just in case, we also want the row counts for the original wide form data files. That number should be around 1/8 of the count of data points. (Also, in our downloading log, we recorded the number of rows of the original datafile).

In [82]:
# select data for 2013Q1
df_master_2013q1 = df_master_2013[df_master_2013.MONTH.isin([1,2,3])]
In [84]:
plot_daily_count = df_master_2013q1.groupby(['DATE']).count().iloc[:,0].reset_index().rename(
    columns={'C/A':'row_counts'})
In [85]:
plot_daily_count.tail()
Out[85]:
DATE row_counts
85 03-27-13 27213
86 03-28-13 27528
87 03-29-13 26786
88 03-30-13 26944
89 03-31-13 26858
In [106]:
fig, ax = plt.subplots(figsize=(10, 30))
y_pos = np.arange(plot_daily_count.shape[0])

ax.barh(y_pos, plot_daily_count.row_counts, align='center',
        color='coral', ecolor='black')

# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_daily_count.DATE)
ax.set_ylim((-1, plot_daily_count.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Row Counts')
ax.set_ylabel('Date')
ax.set_title('Daily Row Counts for Q1 2013')

# add value label for each bar
for i in ax.patches:
    ax.text(i.get_width()+.3, i.get_y()+.38,'{0:.2f}k'.format(i.get_width()/1000) , fontsize=10,color='black')
    

B.2. Plot the daily total number of entries & exits across the system for Q1 2013.

In [96]:
plot_daily_entries_exits = \
df_master_2013q1.groupby(['DATE']).sum()[['HOURLY_ENTRIES','HOURLY_EXITS']].rename(
    columns = {'HOURLY_ENTRIES':'daily_entries','HOURLY_EXITS':'daily_exits'}).reset_index()
In [97]:
plot_daily_entries_exits.head()
Out[97]:
DATE daily_entries daily_exits
0 01-01-13 2.063555e+06 1.711535e+06
1 01-02-13 6.073735e+06 4.848887e+06
2 01-03-13 6.248716e+06 4.995638e+06
3 01-04-13 6.180395e+06 4.962462e+06
4 01-05-13 3.189770e+06 2.593668e+06
In [105]:
fig, ax = plt.subplots(figsize=(10, 40))

ind = np.arange(plot_daily_entries_exits.shape[0])  # the y locations for the groups
width = 0.4 # the width of the bars

rects1 = ax.barh(ind, plot_daily_entries_exits.daily_entries, width, color='coral')
rects2 = ax.barh(ind + width, plot_daily_entries_exits.daily_exits, width, color='tan')

# add some text for labels, title and axes ticks
ax.set_xlabel('Count')
ax.set_title('Daily Total Number of Entries and Exits')
ax.set_yticks(ind + width / 2)
ax.set_yticklabels(plot_daily_count.DATE)
ax.set_ylim((-1, plot_daily_count.shape[0]))
ax.invert_yaxis()

# add value label for each bar    
def add_value_label(rects):
    for i in rects:
        height = i.get_width()
        ax.text(i.get_width()+.3, i.get_y()+.38,
                '{0:.2f}m'.format(height/1000000),fontsize=10,color='black')
        
add_value_label(rects1)
add_value_label(rects2)

Note: The chart shows daily entries always greater than daily exits. This actually make sense on a second thought, since many people may use emergency exit door when they get out of the station, hence they are not counted by any turnstile.

B.3. Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.

In [107]:
# get data from 34ST-PENN STA in Q1 2013
df_34penn_2013q1 = df_master_2013q1[df_master_2013q1.Station == '34 ST-PENN STA']

# calculate the daily entries and exits for Q1 2013
daily_entries_exits_34penn_2013q1 = df_34penn_2013q1.groupby(['MONTH','DATE'])[['HOURLY_ENTRIES','HOURLY_EXITS']]\
    .sum().rename(columns = {'HOURLY_ENTRIES':'daily_entries','HOURLY_EXITS':'daily_exits'}).reset_index()
In [108]:
# calculate the mean and standard deviation of the daily total number of entries & exits for each month
plot_mean_err = daily_entries_exits_34penn_2013q1.groupby('MONTH')['daily_entries','daily_exits'].agg(
    [np.mean,np.std]).reset_index()
In [109]:
plot_mean_err
Out[109]:
MONTH daily_entries daily_exits
mean std mean std
0 1 142796.123376 46750.193938 125120.611422 38360.966903
1 2 145029.237484 48820.965770 126102.077118 40653.041929
2 3 147582.452106 43576.112025 130827.580553 35225.521156
In [121]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10, 5),sharey = True)

ind = list(plot_mean_err.MONTH)

ax1.errorbar(x=ind,y=list(plot_mean_err.daily_entries['mean']), \
             yerr=list(plot_mean_err.daily_entries['std']),fmt='o',capsize = 5)
ax1.set_ylabel('Daily Entries')
ax1.set_xlabel('MONTH')
ax1.set_xticks(ind)

ax2.errorbar(x=ind,y=list(plot_mean_err.daily_exits['mean']),\
             yerr=list(plot_mean_err.daily_exits['std']),fmt='o',capsize = 5)
ax2.set_ylabel('Daily Exits')
ax2.set_xlabel('MONTH')
ax2.set_xticks(ind)

fig.suptitle('Mean and Standard Deviation of the daily total number of entries & exits')
Out[121]:
Text(0.5,0.98,'Mean and Standard Deviation of the daily total number of entries & exits')

B.4. Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.

In [111]:
plot_entries_stat = daily_entries_exits_34penn_2013q1.groupby('MONTH').daily_entries.describe(percentiles =[0.25,0.75])
plot_entries_stat
Out[111]:
count mean std min 25% 50% 75% max
MONTH
1 31.0 142796.123376 46750.193938 61601.00000 86212.5 170646.000000 173842.000000 188766.702635
2 28.0 145029.237484 48820.965770 53684.00000 91983.0 176177.324778 178376.500000 190943.000000
3 31.0 147582.452106 43576.112025 66498.11801 96723.5 173828.000000 177832.448637 191379.000000
In [117]:
def plot_25_50_75_perc(summary,var):
    fig, ax = plt.subplots(figsize=(5, 5))
    err = np.vstack([list(summary['25%']),list(summary['75%'])])
    ax.errorbar(x=ind,y=list(summary['50%']), \
                yerr=err, fmt='o',capsize = 5)
    
    ax.set_title('Plot of 25/50/75 percentile of the daily total {} for each month in Q1 2013'.format(var))
    ax.set_xlabel('MONTH')
    ax.set_xticks(ind)
In [118]:
plot_25_50_75_perc(plot_entries_stat,'ENTRIES')
In [119]:
plot_exits_stat = daily_entries_exits_34penn_2013q1.groupby('MONTH').daily_exits.describe(percentiles =[0.25,0.75])
plot_exits_stat
Out[119]:
count mean std min 25% 50% 75% max
MONTH
1 31.0 125120.611422 38360.966903 59263.000000 80206.0 145841.0 151670.50 165758.651362
2 28.0 126102.077118 40653.041929 50916.000000 84468.0 149998.0 154088.25 169492.000000
3 31.0 130827.580553 35225.521156 61100.901727 94321.0 148672.0 155684.50 167867.000000
In [120]:
plot_25_50_75_perc(plot_exits_stat,'EXITS')

B.5 Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q1 2013.

Let's use the two result datasets from part A question 4: res_closed_stations_2013 & res_not_fully_operate_station_2013.

In [186]:
# select data for Q1 2013
df_closed_station_2013q1 = res_closed_stations_2013[res_closed_stations_2013.DATE.apply(
    lambda x:pd.to_datetime(x)) < '2013-04-01']

# convert DATE variable to date format
df_closed_station_2013q1['DATE'] = df_closed_station_2013q1['DATE'].map(
    lambda x: pd.to_datetime(x))
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
In [187]:
# fill in missing date to the data
start = datetime(2013,1,1)
end = datetime(2013,3,31)
ts = pd.DataFrame(pd.date_range(start,end, freq='D'),columns= ["DATE"])

df_time = pd.merge(ts, plot_closed_station_2013q1[['DATE','COUNT_CLOSED_STATION']], how='left', on=['DATE']).fillna(0)
In [189]:
# plot the daily number of closed station
fig, ax = plt.subplots(figsize=(10, 20))
y_pos = np.arange(plot_daily_count.shape[0])

ax.barh(y_pos, df_time.COUNT_CLOSED_STATION, align='center',
        color='coral', ecolor='black')

# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(df_time.DATE.apply(lambda x: datetime.strftime(x, '%m-%d-%y')))
ax.set_ylim((-1, df_time.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Count of closed station')
ax.set_ylabel('Date')
ax.set_title('Daily count of closed station for Q1 2013')

# add value label for each bar
for i in ax.patches:
    ax.text(i.get_width()+.1, i.get_y()+.38,'{}'.format(i.get_width()) , fontsize=10,color='black')
In [190]:
# select data for Q1 2013
df_notfullyoper_station_2013q1 = res_not_fully_operate_station_2013[res_not_fully_operate_station_2013.DATE.apply(
    lambda x:pd.to_datetime(x)) < '2013-04-01']

# convert DATE variable to date format
df_notfullyoper_station_2013q1['DATE'] = df_notfullyoper_station_2013q1['DATE'].map(
    lambda x: pd.to_datetime(x))
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
In [192]:
df_notfullyoper_station_2013q1.head()
Out[192]:
DATE NOT_FULLY_OPERATE_STATION CNT_NOT_FULLY_OPERATE_STATION
0 2013-01-01 [161 ST-YANKEE, 28 ST, 42 ST-GRD CNTRL, 51 ST,... 8
1 2013-01-02 [Unknown] 1
2 2013-01-03 [Unknown] 1
3 2013-01-04 [Unknown] 1
4 2013-01-05 [161 ST-YANKEE, 34 ST-PENN STA, 42 ST-GRD CNTR... 9
In [193]:
# fill in missing date to the data
df_time_notfullyoper = pd.merge(ts, df_notfullyoper_station_2013q1[['DATE','CNT_NOT_FULLY_OPERATE_STATION']], how='left', on=['DATE']).fillna(0)
In [196]:
# plot the daily number of closed station
fig, ax = plt.subplots(figsize=(10, 20))
y_pos = np.arange(df_time_notfullyoper.shape[0])

ax.barh(y_pos, df_time_notfullyoper.CNT_NOT_FULLY_OPERATE_STATION, align='center',
        color='coral', ecolor='black')

# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(df_time_notfullyoper.DATE.apply(lambda x: datetime.strftime(x, '%m-%d-%y')))
ax.set_ylim((-1, df_time.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Count of not fully operated station')
ax.set_ylabel('Date')
ax.set_title('Daily count of not fully operated  station for Q1 2013')

# add value label for each bar
for i in ax.patches:
    ax.text(i.get_width()+.1, i.get_y()+.38,'{}'.format(i.get_width()) , fontsize=10,color='black')