Chapter 2 Data sources

This project explores the more recent data from NYC Open Data for Motor vehicle CollisionsData Set.

This dataset can also be reached and interacted with through its Google BigQuery location

## [1] 1612178      32

The dataset consists of about 1.6 million reported incidents. We have about 29 features or decriptors for each data point. The names of these features and their data types are displayed below.

##                               [,1]       
## ACCIDENT DATE                 "Date"     
## ACCIDENT TIME                 Character,2
## BOROUGH                       "character"
## ZIP CODE                      "numeric"  
## LATITUDE                      "numeric"  
## LONGITUDE                     "numeric"  
## LOCATION                      "character"
## ON STREET NAME                "character"
## CROSS STREET NAME             "character"
## OFF STREET NAME               "character"
## NUMBER OF PERSONS INJURED     "numeric"  
## NUMBER OF PERSONS KILLED      "numeric"  
## NUMBER OF PEDESTRIANS INJURED "numeric"  
## NUMBER OF PEDESTRIANS KILLED  "numeric"  
## NUMBER OF CYCLIST INJURED     "numeric"  
## NUMBER OF CYCLIST KILLED      "numeric"  
## NUMBER OF MOTORIST INJURED    "numeric"  
## NUMBER OF MOTORIST KILLED     "numeric"  
## CONTRIBUTING FACTOR VEHICLE 1 "character"
## CONTRIBUTING FACTOR VEHICLE 2 "character"
## CONTRIBUTING FACTOR VEHICLE 3 "character"
## CONTRIBUTING FACTOR VEHICLE 4 "character"
## CONTRIBUTING FACTOR VEHICLE 5 "character"
## COLLISION_ID                  "numeric"  
## VEHICLE TYPE CODE 1           "character"
## VEHICLE TYPE CODE 2           "character"
## VEHICLE TYPE CODE 3           "character"
## VEHICLE TYPE CODE 4           "character"
## VEHICLE TYPE CODE 5           "character"
## DAY                           "numeric"  
## MONTH                         "numeric"  
## YEAR                          "numeric"

The names are quite self-explainatory but a few notes have been highlighted below: - Depending on the number of vehicles involved in the crash the columns “CONTRIBUTING FACTOR VEHICLE 2”, “CONTRIBUTING FACTOR VEHICLE 3”, “CONTRIBUTING FACTOR VEHICLE 4” and “CONTRIBUTING FACTOR VEHICLE 5” may be missing. Similarly “VEHICLE TYPE CODE 2”, “VEHICLE TYPE CODE 3”, “VEHICLE TYPE CODE 4” and “VEHICLE TYPE CODE 5” could be missing for smaller crashes. - The columns “CROSS STREET NAME”, “OFF STREET NAME”, “ON STREET NAME”, “ZIP CODE” and “LOCATION” are redundant as well as less accurate and are hence not used for location based analysis. We instead use the latitudes and longitudes. It should be noted that the dataset does not report the exact corrdinate of the collision, instead, it reports the coordinates of the nearest intersection. - “NUMBER OF PERSONS INJURED/KILLED” is the aggregation of the columns “NUMBER OF CYCLIST INJURED/KILLED” , “NUMBER OF MOTORIST INJURED/KILLED” and “NUMBER OF PEDESTRIANS INJURED/KILLED”.

We also use google static maps API to for plotting the spatial data. Instead of overlaying the image in the backgroud this gives us more control over the exploration.