Which station has the most number of units?
What is the total number of entries & exits across the subway system for February 1, 2013?
Let’s define the busy-ness as sum of entry & exit count. What station was the busiest on February 1, 2013? What turnstile was the busiest on that date?
What stations have seen the most usage growth/decline in 2013?
What dates are the least busy? Could you identify days on which stations were not operating at full capacity or closed entirely?
Bonus: What hour is the busiest for station CANAL ST in Q1 2013?
Plot the daily row counts for data files in Q1 2013.
Plot the daily total number of entries & exits across the system for Q1 2013.
Plot the mean and standard deviation of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
Plot the daily number of closed stations and number of stations that were not operating at full capacity in Q1 2013.
Notes/Caveats on data definition:
# Read HTML module
from pyquery import PyQuery as pq
# common modules
import os
import re
import pandas as pd
import numpy as np
from datetime import datetime
import statsmodels.stats.api as sms
# plot modules
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')
# random seed
seed = 2
I assume this question asks as of May 19 2018, which station has the most number of units. So for this question we use this data: http://web.mta.info/developers/data/nyct/turnstile/turnstile_180519.txt.
Field description can be found from http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt.
# download the data
path = 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_180519.txt'
col_name_new = ['C/A','UNIT','SCP','STATION','LINENAME','DIVISION','DATE','TIME','DESC','ENTRIES','EXITS']
MTA_data_20180519w = pd.read_csv(path, sep=",", header=0, names = col_name_new)
MTA_data_20180519w.head()
Calculate the number of unique Unit for each Station on 05/18/2018 and select the Station having the most number of Unit.
# MTA_data_20180518 = MTA_data_20180519w.loc[MTA_data_20180519w.DATE == '05/18/2018',['STATION','UNIT']]
MTA_data_20180519w[MTA_data_20180519w.DATE == '05/18/2018'].groupby(
['STATION'],sort = False).nunique().UNIT.sort_values().tail(1)
23 ST station having 6 units is the station has the most number of units.
For following questions, I first download data for entire year 2013, convert, clean and wrangle the data, and then do the required analysis.
Concretely, below steps are used on the raw data:
Field description for data prior to 10/18/14 can be found in http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt
Station, Unit, Control Area information can be found in http://web.mta.info/developers/resources/nyct/turnstile/Remote-Booth-Station.xls
# download dataset from MTA webset
def download_data(url,start,stop,log_name):
"""
This function download selected datasets(prior 10/18/14) from MTA website, by finding the selected datasets by HTML elements.
Then consolidate all the downloaded datasets to one large dataframe.
Also, log the downloaded files information, such as file date, file_link, number of rows, and number of columns.
"""
col_name = ['C/A','UNIT','SCP','DATE1','TIME1','DESC1','ENTRIES1','EXITS1','DATE2','TIME2','DESC2','ENTRIES2','EXITS2',
'DATE3','TIME3','DESC3','ENTRIES3','EXITS3','DATE4','TIME4','DESC4','ENTRIES4','EXITS4',
'DATE5','TIME5','DESC5','ENTRIES5','EXITS5','DATE6','TIME6','DESC6','ENTRIES6','EXITS6',
'DATE7','TIME7','DESC7','ENTRIES7','EXITS7','DATE8','TIME8','DESC8','ENTRIES8','EXITS8']
df = pd.DataFrame(columns = col_name)
# Log of downloading
logFile = open(log_name, 'w') # 'Download_Log.txt'
logFile.write('file_date;file_link;number_of_rows;number_of_columns')
# get HTML code
jpy = pq(url)
for i in range(start, stop, 2):
# retrieve file path
item = jpy('#contentbox > div > div > a:nth-child({})'.format(i))
fileDate = item.text()
filePath = 'http://web.mta.info/developers/' + item.attr('href')
# get data from txt, save to csv, and append to datafram
data = pd.read_csv(filePath, sep=",", header=None, names = col_name)
df = df.append(data, ignore_index=True)
# data.to_csv('./raw_data/MTA_data_{}.csv'.format(re.sub(r',','',fileDate)))
# log the downloaded file information
record = str(fileDate) + ';' + str(filePath) + ';' + str(data.shape[0]) + ';' + str(data.shape[1])
logFile.write('\n' + record)
logFile.close()
return df
# convert wide form dataset to long form dataset
def convert_wide_to_long(df):
"""
In original data, there are multiple data points included in each row.
This function is to convert the wide form DF to long form DF where there is only one data point per row.
Also, set the 'ENTRIES' and 'EXITS' variables type to float.
"""
col_name_long = ['C/A','UNIT','SCP','DATE','TIME','DESC','ENTRIES','EXITS']
df_long = pd.DataFrame(columns = col_name_long)
for i in range(0,8):
ind = list(range(0,3)) + [5*i+3, 5*i+4, 5*i+5, 5*i+6, 5*i+7]
temp = df.iloc[:,ind]
temp.columns = col_name_long
df_long = df_long.append(temp, ignore_index = True)
df_long = df_long.sort_values(['C/A','UNIT','SCP','DATE','TIME'])
df_long[['ENTRIES','EXITS']] = df_long[['ENTRIES','EXITS']].apply(pd.to_numeric)
return df_long.reset_index(drop = True).dropna(axis=0)
def remove_irregular_event(df):
"""
In the original data, there are factors that may impact the data. (ie. Hardware failure, "IRREGULAR" audit event)
Clean the data to filter out 'IRREGULAR' audit event.
"""
# Remove records where DESC (audit event) != REGULAR
df = df[df.DESC == 'REGULAR']
return df
def add_hourly_entries(df):
"""
The 'ENTRIES' variable recorded in the MTA data are cumulative entries of the turnstile per row.
Considering the data for a single turnstile machine (unique SCP, C/A, and UNIT),
we want to add a new column symbolizing the incremental number of entries since the last recording time.
This function is to add a new column, calculate the difference between ENTRIES in the current row and the
previous row, and assign the difference to the new column. When there is NaN, fill it with 0.
"""
HOURLY_ENTRIES = df.ENTRIES - df.ENTRIES.shift(1)
df['HOURLY_ENTRIES'] = HOURLY_ENTRIES.fillna(0)
return df
def add_hourly_exits(df):
"""
The 'EXITS' variable recorded in the MTA data are cumulative exits of the turnstile per row.
Considering the data for a single turnstile machine (unique SCP, C/A, and UNIT),
we want to add a new column symbolizing the incremental number of exits since the last recording time.
This function is to add a new column, calculate the difference between EXITS in the current row and the
previous row, and assign the difference to the new column. When there is NaN, fill it with 0.
"""
HOURLY_EXITS = df.EXITS - df.EXITS.shift(1)
df['HOURLY_EXITS'] = HOURLY_EXITS.fillna(0)
return df
def add_busyness(df):
"""
Define busyness as sum of ebtries and exits. Add a new column and assign busyness to it.
"""
BUSYNESS = df.HOURLY_ENTRIES + df.HOURLY_EXITS
df['BUSYNESS'] = BUSYNESS
return df
def time_to_hour(time):
"""
Input 00:00:00 (hour:minute:second).
Extract and return the hour from input.
"""
# return pd.to_datetime(time).hour
return int(time.split(':')[0])
def date_to_month(date):
"""
Input mm-dd-yy (month-day-year).
Extract and return the month from input date.
"""
# return pd.to_datetime(date).month
return int(date.split('-')[0])
def date_to_year(date):
"""
Input sting mm-dd-yy (month-day-year).
Extract and return the year from input date.
"""
# return pd.to_datetime(date)
return 2000 + int(date.split('-')[2])
def wrangle_MTA_data(url,start,stop,log_name,date_tag):
# download data
df = download_data(url,start,stop,log_name)
df.to_csv('MTA_data_{}.csv'.format(date_tag))
# convert wide form data to long form
df_long=convert_wide_to_long(df)
df_long.to_csv('MTA_data_long_{}.csv'.format(date_tag))
# filter illegitimate data
df_long = remove_irregular_event(df_long)
df_long.to_csv('MTA_data_regular_{}.csv'.format(date_tag))
# add hourly incremental entries
df_long = df_long.groupby(['C/A','UNIT','SCP']).apply(
add_hourly_entries)
# add hourly incremental exits
df_long = df_long.groupby(['C/A','UNIT','SCP']).apply(
add_hourly_exits)
# add a 'HOUR', 'MONTH' and 'YEAR' column
df_long['HOUR'] = df_long['TIME'].map(time_to_hour)
df_long['MONTH'] = df_long['DATE'].map(date_to_month)
df_long['YEAR'] = df_long['DATE'].map(date_to_year)
df_long.to_csv('MTA_data_hour_{}.csv'.format(date_tag))
return df_long
url = "http://web.mta.info/developers/turnstile.html"
# as of May 25, the start=459 is the December 28, 2013 data , and stop = 563 is the January 05, 2013 data
# change start and stop parameters if necessary when using this code at other times
start = 459
stop = 564
log_name = 'Download_Log.txt'
date_tag = 'y2013' # change here accordingly
df_long = wrangle_MTA_data(url,start,stop,log_name,date_tag)
print(df_long.shape)
df_long.head()
def clean_data(df):
"""
HOURLY_ENTRIES and HOURLY_EXITS contains negative value and abnormally large value (ie. the max is 931476882).
Assuming it takes 1 second for 1 people enter the turnstile,
there can be at max 14,400 people entering turnstile in 4 hours.
So in theory, considering a buffer, any HOURLY_ENTRIES (or HOURLY_EXITS) greater than 20000 is not possible.
Also HOURLY_ENTRIES and HOURLY_EXITS obviously cannot be negative.
This function replace the negative and greater than 20000 HOURLY_ENTRIES by the mean of the group(ie.SCP,MONTH) that they are in.
Then calculate the "BUSYNESS".
"""
# clean 'HOURLY_ENTRIES'
df['HOURLY_ENTRIES'] = df.groupby(['SCP','MONTH']).HOURLY_ENTRIES.transform(
lambda x: np.where((x<0)|(x>20000),x.mask((x<0)|(x>20000)).mean(),x))
# clean 'HOURLY_EXITS'
df['HOURLY_EXITS'] = df.groupby(['SCP','MONTH']).HOURLY_EXITS.transform(
lambda x: np.where((x<0)|(x>20000),x.mask((x<0)|(x>20000)).mean(),x))
# add busyness
df['BUSYNESS'] = df.HOURLY_ENTRIES + df.HOURLY_EXITS
return df
df_final = clean_data(df_long)
Now, we could load the Station, Unit, Control Area mapping information from a csv file, which can be downloaded from MTA website as well.
Fields in this data:
# get the Station, Unit, Control Area mapping
mapping = pd.read_csv('Remote-Booth-Station.csv',header = 0,names = ['UNIT','C/A','Station','Line Name','Division'])
# Merge station information to the long form dataset
# for turnstiles that don't exist in mapping file, tag their station information as 'Unknown'
df_master = pd.merge(df_final, mapping, how='left', on=['C/A','UNIT'], sort=True, copy=True, indicator=True)
df_master.loc[df_master._merge =='left_only',['Station','Line Name','Division']] = 'Unknown'
# save to csv
df_master.to_csv('MTA_data_master_{}.csv'.format('y2013'))
df_master[df_master.DATE == '02-01-13'].groupby('DATE')['HOURLY_ENTRIES','HOURLY_EXITS'].sum().round(2).rename(
columns = {'HOURLY_ENTRIES': "TOTAL_ENTRIES",'HOURLY_EXITS': 'TOTAL_EXITS'})
The total number of entries & exits across the subway system for February 1, 2013 are 5,808,765 and 4,497,579.
# select Feb 1, 2013 data
df_master_020113 = df_master[df_master.DATE == '02-01-13']
# The busiest station
df_master_020113.groupby('Station').BUSYNESS.sum().sort_values(ascending=False)[:3]
# the busiest turnstile
idx = df_master_020113['BUSYNESS'].idxmax()
df_master_020113.loc[idx,]
The busiest station on February 1, 2013 is 34 ST-PENN STATION.
The busiest turnstile is C/A J021, UNIT R434, SCP 00-00-02, at VAN SICLEN AVE station.
Define the change of usage of a station as difference between the average monthly usage in second half year and the average monthly usage in first half year.
So the most usage growth station is the station with the largest change; the most usage decline station is the station with the smallest change. Also do a t-test to observe the test interval of the mean difference.
# calculate the stations' total usage in 2013
df_master_totalusage_2013 = df_master[df_master.YEAR == 2013].groupby(['Station','MONTH'])['BUSYNESS'].sum().reset_index().rename(
columns = {"BUSYNESS":"TOTAL_USAGE"})
# list of stations
stations = list(set(df_master['Station']))
def test_usage_change(df,stations):
res = pd.DataFrame(columns = ['Station','mean_diff','confidence_interval'])
row = 0
df_firsth = df[df.MONTH.isin([1,2,3,4,5,6])]
df_lasth = df[df.MONTH.isin([7,8,9,10,11,12])]
for station in stations:
firstHalf = list(df_firsth[df_firsth.Station == station].TOTAL_USAGE)
secondHalf = list(df_lasth[df_lasth.Station == station].TOTAL_USAGE)
cm = sms.CompareMeans(sms.DescrStatsW(secondHalf),sms.DescrStatsW(firstHalf))
confit_int = cm.tconfint_diff(usevar='unequal')
mean_diff = np.mean(secondHalf) - np.mean(firstHalf)
res.loc[row,'Station'] = station
res.loc[row,'mean_diff'] = mean_diff
res.loc[row,'confidence_interval'] = confit_int
row += 1
return res
res = test_usage_change(df_master_totalusage_2013,stations)
print(res.sort_values(by="mean_diff")[:3])
print(res.sort_values(by="mean_diff",ascending = False)[:3])
WHITEHALL ST stations have seen the most usage decline in 2013.
BOWLING GREEN stations have seen the most usage growth in 2013.
Assume this question is asking what dates are the least busy in year 2013.
# select year 2013 data
df_master_2013 = df_master[df_master.YEAR == 2013]
def least_busy_day(df):
"""
calculate the daily sum of BUSYNESS across the subway system and find the 3 least busy days.
"""
print('The top 3 least busy days:')
print(df.groupby(['DATE'])['BUSYNESS'].sum().round(2).sort_values()[:3])
least_busy_day(df_master_2013)
def days_with_closed_stations(df):
"""
Define the station is closed as that station has 0 BUSYNESS on a day.
Firstly, for each station calculate the daily busyness.
Then, find out the days on which any station has 0 daily busyness, also the name count of the closed stations.
"""
daily_sum = df.groupby(['Station','DATE'])['BUSYNESS'].sum(axis = 0).reset_index()
res = pd.DataFrame(columns = ['DATE','CLOSED_STATION','COUNT_CLOSED_STATION'])
row = 0
for name, group in daily_sum[daily_sum.BUSYNESS == 0].groupby('DATE'):
res.loc[row,'DATE'] = name
res.loc[row,'CLOSED_STATION'] = list(group.Station)
res.loc[row,'COUNT_CLOSED_STATION'] = group.count().Station
row += 1
return res
res_closed_stations_2013 = days_with_closed_stations(df_master_2013)
print('First 5 days in 2013 on which stations are entirely closed.')
res_closed_stations_2013.head()
# print(list(res_closed_stations_2013.DATE))
def days_with_not_fully_operat_station(df):
"""
In MTA dataset, 'C/A' variable is the indicator of 'booth'. A station may have multiple booths,
and each booth consist of multiple devices (ie. SCP).
Define the station is not fully operating as that at least 1 booth in that station has 0 busyness over a day.
Firstly, for each booth(ie.unique 'Station','C/A') calculate the daily busyness.
Then, find out the days on which any station is not fully operated, also the name and count of those stations.
"""
daily_booth_sum = df.groupby(['Station','C/A','DATE'])['BUSYNESS'].sum().reset_index()
daily_countclosebooth = daily_booth_sum.groupby(['Station','DATE']).apply(
lambda column: (column == 0).sum()).BUSYNESS.reset_index().rename(columns={'BUSYNESS':'CNT_CLOSE_BOOTH'})
res = pd.DataFrame(columns = ['DATE','NOT_FULLY_OPERATE_STATION','CNT_NOT_FULLY_OPERATE_STATION'])
row = 0
for name, group in daily_countclosebooth[daily_countclosebooth.CNT_CLOSE_BOOTH > 0].groupby('DATE'):
res.loc[row,'DATE'] = name
res.loc[row,'NOT_FULLY_OPERATE_STATION'] = list(group.Station)
res.loc[row,'CNT_NOT_FULLY_OPERATE_STATION'] = group.count().Station
row += 1
return res
res_not_fully_operate_station_2013 = days_with_not_fully_operat_station(df_master_2013)
print('First 5 days in 2013 on which stations are not fully operated.')
res_not_fully_operate_station_2013.head()
# print(list(res_not_fully_operate_station_2013.DATE))
The least busy day in 2013 is December 25, 2013.
The days on which stations were not operating at full capacity or closed entirely could be find in resulting datasets(top 5 rows showed as example).
# select data for station CANAL ST in Q1 2013
df_CANAL_2013Q1 = df_master_2013[(df_master_2013.MONTH.isin([1,2,3]))&(df_master_2013.Station == 'CANAL ST')]
plot_hourly_busyness = df_CANAL_2013Q1.groupby(['HOUR']).BUSYNESS.mean().round(2).sort_values(ascending = False)
plot_hourly_busyness[:3]
HOUR 21 is the busiest hour for staion CANAL ST in Q1 2013
In original data, there are multiple (8) data points included in each row.
So I firstly converted the original data to long form data set where there is only one data point per row, assuming we more care about how many data points in Q1 2013. Then, after filtering irregular data points from the long form data, I count the rows for each day, which is, in other word, the count of regular data points.
Just in case, we also want the row counts for the original wide form data files. That number should be around 1/8 of the count of data points. (Also, in our downloading log, we recorded the number of rows of the original datafile).
# select data for 2013Q1
df_master_2013q1 = df_master_2013[df_master_2013.MONTH.isin([1,2,3])]
plot_daily_count = df_master_2013q1.groupby(['DATE']).count().iloc[:,0].reset_index().rename(
columns={'C/A':'row_counts'})
plot_daily_count.tail()
fig, ax = plt.subplots(figsize=(10, 30))
y_pos = np.arange(plot_daily_count.shape[0])
ax.barh(y_pos, plot_daily_count.row_counts, align='center',
color='coral', ecolor='black')
# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(plot_daily_count.DATE)
ax.set_ylim((-1, plot_daily_count.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Row Counts')
ax.set_ylabel('Date')
ax.set_title('Daily Row Counts for Q1 2013')
# add value label for each bar
for i in ax.patches:
ax.text(i.get_width()+.3, i.get_y()+.38,'{0:.2f}k'.format(i.get_width()/1000) , fontsize=10,color='black')
plot_daily_entries_exits = \
df_master_2013q1.groupby(['DATE']).sum()[['HOURLY_ENTRIES','HOURLY_EXITS']].rename(
columns = {'HOURLY_ENTRIES':'daily_entries','HOURLY_EXITS':'daily_exits'}).reset_index()
plot_daily_entries_exits.head()
fig, ax = plt.subplots(figsize=(10, 40))
ind = np.arange(plot_daily_entries_exits.shape[0]) # the y locations for the groups
width = 0.4 # the width of the bars
rects1 = ax.barh(ind, plot_daily_entries_exits.daily_entries, width, color='coral')
rects2 = ax.barh(ind + width, plot_daily_entries_exits.daily_exits, width, color='tan')
# add some text for labels, title and axes ticks
ax.set_xlabel('Count')
ax.set_title('Daily Total Number of Entries and Exits')
ax.set_yticks(ind + width / 2)
ax.set_yticklabels(plot_daily_count.DATE)
ax.set_ylim((-1, plot_daily_count.shape[0]))
ax.invert_yaxis()
# add value label for each bar
def add_value_label(rects):
for i in rects:
height = i.get_width()
ax.text(i.get_width()+.3, i.get_y()+.38,
'{0:.2f}m'.format(height/1000000),fontsize=10,color='black')
add_value_label(rects1)
add_value_label(rects2)
Note: The chart shows daily entries always greater than daily exits. This actually make sense on a second thought, since many people may use emergency exit door when they get out of the station, hence they are not counted by any turnstile.
# get data from 34ST-PENN STA in Q1 2013
df_34penn_2013q1 = df_master_2013q1[df_master_2013q1.Station == '34 ST-PENN STA']
# calculate the daily entries and exits for Q1 2013
daily_entries_exits_34penn_2013q1 = df_34penn_2013q1.groupby(['MONTH','DATE'])[['HOURLY_ENTRIES','HOURLY_EXITS']]\
.sum().rename(columns = {'HOURLY_ENTRIES':'daily_entries','HOURLY_EXITS':'daily_exits'}).reset_index()
# calculate the mean and standard deviation of the daily total number of entries & exits for each month
plot_mean_err = daily_entries_exits_34penn_2013q1.groupby('MONTH')['daily_entries','daily_exits'].agg(
[np.mean,np.std]).reset_index()
plot_mean_err
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10, 5),sharey = True)
ind = list(plot_mean_err.MONTH)
ax1.errorbar(x=ind,y=list(plot_mean_err.daily_entries['mean']), \
yerr=list(plot_mean_err.daily_entries['std']),fmt='o',capsize = 5)
ax1.set_ylabel('Daily Entries')
ax1.set_xlabel('MONTH')
ax1.set_xticks(ind)
ax2.errorbar(x=ind,y=list(plot_mean_err.daily_exits['mean']),\
yerr=list(plot_mean_err.daily_exits['std']),fmt='o',capsize = 5)
ax2.set_ylabel('Daily Exits')
ax2.set_xlabel('MONTH')
ax2.set_xticks(ind)
fig.suptitle('Mean and Standard Deviation of the daily total number of entries & exits')
B.4. Plot 25/50/75 percentile of the daily total number of entries & exits for each month in Q1 2013 for station 34 ST-PENN STA.
plot_entries_stat = daily_entries_exits_34penn_2013q1.groupby('MONTH').daily_entries.describe(percentiles =[0.25,0.75])
plot_entries_stat
def plot_25_50_75_perc(summary,var):
fig, ax = plt.subplots(figsize=(5, 5))
err = np.vstack([list(summary['25%']),list(summary['75%'])])
ax.errorbar(x=ind,y=list(summary['50%']), \
yerr=err, fmt='o',capsize = 5)
ax.set_title('Plot of 25/50/75 percentile of the daily total {} for each month in Q1 2013'.format(var))
ax.set_xlabel('MONTH')
ax.set_xticks(ind)
plot_25_50_75_perc(plot_entries_stat,'ENTRIES')
plot_exits_stat = daily_entries_exits_34penn_2013q1.groupby('MONTH').daily_exits.describe(percentiles =[0.25,0.75])
plot_exits_stat
plot_25_50_75_perc(plot_exits_stat,'EXITS')
Let's use the two result datasets from part A question 4: res_closed_stations_2013 & res_not_fully_operate_station_2013.
# select data for Q1 2013
df_closed_station_2013q1 = res_closed_stations_2013[res_closed_stations_2013.DATE.apply(
lambda x:pd.to_datetime(x)) < '2013-04-01']
# convert DATE variable to date format
df_closed_station_2013q1['DATE'] = df_closed_station_2013q1['DATE'].map(
lambda x: pd.to_datetime(x))
# fill in missing date to the data
start = datetime(2013,1,1)
end = datetime(2013,3,31)
ts = pd.DataFrame(pd.date_range(start,end, freq='D'),columns= ["DATE"])
df_time = pd.merge(ts, plot_closed_station_2013q1[['DATE','COUNT_CLOSED_STATION']], how='left', on=['DATE']).fillna(0)
# plot the daily number of closed station
fig, ax = plt.subplots(figsize=(10, 20))
y_pos = np.arange(plot_daily_count.shape[0])
ax.barh(y_pos, df_time.COUNT_CLOSED_STATION, align='center',
color='coral', ecolor='black')
# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(df_time.DATE.apply(lambda x: datetime.strftime(x, '%m-%d-%y')))
ax.set_ylim((-1, df_time.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Count of closed station')
ax.set_ylabel('Date')
ax.set_title('Daily count of closed station for Q1 2013')
# add value label for each bar
for i in ax.patches:
ax.text(i.get_width()+.1, i.get_y()+.38,'{}'.format(i.get_width()) , fontsize=10,color='black')
# select data for Q1 2013
df_notfullyoper_station_2013q1 = res_not_fully_operate_station_2013[res_not_fully_operate_station_2013.DATE.apply(
lambda x:pd.to_datetime(x)) < '2013-04-01']
# convert DATE variable to date format
df_notfullyoper_station_2013q1['DATE'] = df_notfullyoper_station_2013q1['DATE'].map(
lambda x: pd.to_datetime(x))
df_notfullyoper_station_2013q1.head()
# fill in missing date to the data
df_time_notfullyoper = pd.merge(ts, df_notfullyoper_station_2013q1[['DATE','CNT_NOT_FULLY_OPERATE_STATION']], how='left', on=['DATE']).fillna(0)
# plot the daily number of closed station
fig, ax = plt.subplots(figsize=(10, 20))
y_pos = np.arange(df_time_notfullyoper.shape[0])
ax.barh(y_pos, df_time_notfullyoper.CNT_NOT_FULLY_OPERATE_STATION, align='center',
color='coral', ecolor='black')
# add some text for labels, title and axes ticks
ax.set_yticks(y_pos)
ax.set_yticklabels(df_time_notfullyoper.DATE.apply(lambda x: datetime.strftime(x, '%m-%d-%y')))
ax.set_ylim((-1, df_time.shape[0]))
ax.invert_yaxis()
ax.set_xlabel('Count of not fully operated station')
ax.set_ylabel('Date')
ax.set_title('Daily count of not fully operated station for Q1 2013')
# add value label for each bar
for i in ax.patches:
ax.text(i.get_width()+.1, i.get_y()+.38,'{}'.format(i.get_width()) , fontsize=10,color='black')