Fatal Crashes in Austin, Texas
Introduction & Background
A major interstate cuts through the middle of downtown Austin, Texas. IH-35 holds a potent historical significance for the City and radiates at the center of a current debate about its future. Impacts to the local community are substantial and multi-faceted, so any attempt to measure the highway’s effect on the surrounding population implies a selection among available topics. In this blog, the dimension I examine is fatal vehicle collisions. This metric focuses the attention because it’s more relatable than a distributed impact like pollution or economic development. After peaking during the pandemic, traffic deaths have marginally declined. The Department of Transportation estimates that 9,330 people died as a result of vehicle collisions in the first quarter of 2023. I intend to analyze other dimensions of community impact and perform some statistical work, but first I will simply count events in a defined area.
The nature of transportation data means that I get the opportunity to refresh some GIS skills. Even after making many maps in past projects, I gained new knowledge around troubleshooting GIS irregularities and got to practice design techniques. It’s important to keep in mind that transportation data is inherently behavioral information. Counting vehicle collisions is a measure of stationary incidents, but human travel happens over non-uniform spaces at different speeds and times. This idea will become more important later when building a statistical model, but for now we can just “flatten the world” and take a raw count. I’ve started initial research into distance-decay models and how to create one in R for myself. The deliverable for this current project phase is the interactive map below. I’m still learning leaflet
and iterating on the final design.
Data Description
The Texas Department of Transportation (TxDOT) hosts a repository of vehicle collision reports on its Crash Reports Information System (CRIS), however, this database is restricted to official use and requires a login to gain access. Thankfully, the City of Austin (CoA) publishes a subset of local crash records on its open data portal. Each crash record contains lat-lon coordinates among many other variables like time of day and number of injuries. With just this dataset and a few lines of code from sf
package, we can generate a map of crashes by plotting the coordinates to x-y chart axes.
Code
# manipulate data and transform to geometry
<- atx_crash_raw %>%
atx_crash_pts
# keep only death count greater than 0
filter(death_cnt != "0") %>%
# add geometry as points
st_as_sf(coords = c("longitude", "latitude"),
# conform CRS to county map
# loaded earlier in hidden code chunk
crs = st_crs(county_poly)) %>%
# remove rows without geometry
filter(!is.na(geometry))
# generate map
%>%
atx_crash_pts ggplot() +
geom_sf(color = "firebrick",
size = 1,
show.legend = FALSE) +
scale_x_continuous(expand=c(0.5,0)) +
theme_minimal()
Even lacking context, we can observe a spatial relationship and concentration of crashes in a meaningful pattern. Anyone familiar with Austin geography would instantly recognize the source of the vertical cluster as IH-35. Each crash report contains a variety of rich information, including a street name. The CoA data portal also contains a shapefile of local roadways stored as linestrings
. Below are maps of all the local roadways and IH-35 isolated. We can generate a roadway map using the same technique, this time dropping lat-lon coordinates from the display. The easiest way I found to isolate IH-35 was to apply dplyr::filter()
to the prefix_typ
variable to extract the entries that contain “IH” (interstate highway).
Code
# all roadways
%>%
road_lines ggplot() +
geom_sf()
# just IH-35
%>%
road_lines filter(prefix_typ == "IH") %>%
ggplot() +
geom_sf()
To estimate crash density around IH-35, I need to orient it in space by placing it within a boundary. If I were to compare crash density in and around the highway compared to the rest of the land mass of Texas, for instance, the scale between the two areas wouldn’t allow any kind of meaningful interpretation. I decided to use Travis County as my boundary because it contains all of Austin and it’s a familiar shape to local residents. I may consider a different shape in the future because a smaller version of the disproportionate scale issue is reproduced within Travis County’s large empty spaces. As a proof of concept, it works well enough. I originally sourced this boundary from the Travis County open data portal, then switched to a CoA source for reasons I’ll explain below.
Code
# isolate Travis County boundary polygon
<- county_poly %>%
travis_boundary ::filter(county_nam == "TRAVIS") %>%
dplyrselect(county_nam, geometry)
# isolate IH-35 roadway line
<- road_lines %>%
IH_35 filter(prefix_typ == "IH")
%>%
IH_35 ggplot() +
# geom_sf ordered to achieve visual layering
geom_sf(data = travis_boundary) +
geom_sf() +
geom_sf(data = atx_crash_pts,
color = "firebrick",
size = 1,
show.legend = FALSE)
The map above combines crash point data and a subset of the roadway lines by overlaying both on a polygon representing Travis County. Again, the trend is obvious enough visually, but a layered image doesn’t give us a programmic way of defining and measuring crash density on and around the highway. Apart from street name labels in the crash points data, we don’t have a way to attribute crashes to a specific roadway.
Spatial Correlation
The empirical challenge at this stage concerns co-location. Orienting multiple objects in space allows us to infer something about their interaction by measuring their proximity. There are many different approaches to estimating spatial correlation, some of which I aim to explore in future stages. For now, I will keep it simple by abstracting the study area into a two-dimensional plane without considering urban geography or specific human behavior. Another aspect of “flattening the world” is restricting and parceling space by imposing a geometric overlay. In the case of the interactive map above, that means looking only at Travis County and converting the polygon representing its boundary into a hexagonal grid.
The Austin metro area is just as complex as any urban environment, and the road network profoundly affects how people move around. Vehicle trips may include multiple stops, long or short distances, or weird patterns like when you remember that you left the oven on and returned home. Thankfully it was off, but you see a package you meant to take and add another stop at the post office. The road network causes people to move in ways that are more complicated than transitioning smoothly between hexagonal zones. So when looking at the interactive map, remember that it’s flattening time and behavior for the sake of an artifice that bluntly measures spatial correlation.
Reducing complexity in spatial data does make co-location simpler, but it comes with tradeoffs. One upside is that it makes exploratory data analysis (EDA) much easier. With non-spatial data, EDA usually involves traditional charts like histograms and bar charts:
In the same way we can group and measure crashes per year, we can produce counts per sub-regions within an area, provided there are sub-regions available that are discretely separated and labeled. Travis County contains many sub-region groups such as Census tracts, City Council districts, and zip codes that attach simple labels to parcels of land. However, I want to split the Travis County polygon into even geometric parcels because it helps me measure distance more effectively than uneven administrative boundaries. Uniformly shaped hexagons also gives me a means to “pixelate” the study area into varying intensities. One other advantage of a hex-grid is that I can adjust sub-regions’ size by changing the number of hexes in the grid. This way, I can tightly control the area around IH-35 with a shape that is cleanly transected by the roadway line.
I thought I would have to design a custom method for sub-dividing a polygon, but thanks to the sf
package this can be accomplished easily with st_make_grid
.
Code
<- travis_boundary %>%
travis_grid
# 40 hexes evenly divides the polygon
# with minimum thickness to contain IH-35
st_make_grid(n = 40, square = F, flat_topped = T) %>%
st_intersection(travis_boundary) %>%
st_as_sf()
%>%
travis_grid ggplot() +
geom_sf()
My investigation into fatal crash data in Austin is at an exploratory stage, but is employing some tools that differ from typical EDA. In building an artifice to solve co-location, I made choices about the data environment. It’s also a little deceiving because the project deliverable looks pretty. Familiar shapes and cartographic conventions presented with a data overlay can feel persuasive while eliding key facts or choices about how the data was assembled. All of which is to say (breathes in…) I’m going out of my way to belabor the point that this is just one method of presenting raw counts of events in a static graphical format. Not yet a statistical inference.
GIS Troubleshooting
EDA is useful both for getting a feel for your data through quick visualizations and for testing your assumptions. Since I retrieved my data from official sources, I didn’t even consider overlapping geometries to be an assumption at first, but I guess that is illustrative of the sneaky nature of assumptions. Usually, when two geometries don’t merge properly, it’s because their coordinate reference systems (CRS) are different. Misdirected by an assumption about CRSs, I got caught in a tailspin of troubleshooting in completely the wrong direction for nearly a week. If you’ve ever been lost in the fog of endlessly googling to refine the phrasing of a problem only to hit a succession of brick walls, you know that feeling of elation when you finally breakthrough and everything works.
To cut a long story short, I figured out (after many deep dives into documentation and tests of every parameter I could think of) that the shapefile pulled from Travis County had the wrong bounding box. Or rather, the coordinates were stored in an unconventional manner. Lat-lon coordinates are usually classified as decimals, but the Travis County data portal had them stored as integers, in a different order, and without negative longitude values.
x-min | y-min | x-max | y-max | |
---|---|---|---|---|
Travis County | 2977872 | 9982003 | 3230641 | 10200112 |
City of Austin | -98.17306 | 30.02327 | -97.36971 | 30.62796 |
The Travis County boundary data and City of Austin data merged without an error, but they failed to overlay when displayed together. After a few days of tailspinning, I decided to stop troubleshooting through code, and just look at the maps generated by each shapefile. When I loaded Travis County in leaflet
and zoomed out, I realized that it was floating over the South Pacific. I decided at that point to see if CoA had a shapefile for Travis County and use that instead. To my immense relief, the polygons, lines, and points all lined up as demonstrated above. Along the way, I learned how to hard code a bounding box, convert shapefiles to different geometry types, and, most importantly, gained the wisdom of discovering a blindspot. Despite the frustration, it wasn’t a completely wasted week.
Merging GIS Datasets
To determine which hexes contain the IH-35 roadway, I use st_intersects
which performs as its name indicates. st_intersects
combined with if_else
gives me a way to construct a logical variable concerning IH-35’s presence within a given hex.
Code
# identify which hexes contain IH-35
<- travis_grid %>%
travis_grid mutate(
# ad ids to hexes
hex_id = row_number(),
# add IH_35 T/F label
IH_35_hex = if_else(
as.character(st_intersects(
# call piped dataset within nested function
.,
# recall IH_35 extract
!= "integer(0)", T, F)
IH_35))
)
%>%
travis_grid ggplot() +
geom_sf(aes(fill = IH_35_hex, alpha = 0.7)) +
scale_fill_manual(values = c("#d9d9d9", "#ff1693")) +
theme(legend.position = "none")
Although this hex classification isn’t necessary to produce the end product for this stage, it points towards future analysis and hints at other applications opened up by the hexgrid. With the study area geometrically parceled, we can locate crash points within hexes, then group and count by hex.
Code
# spatial join to correlate pts to hex
<- st_join(atx_crash_pts, travis_grid)
pts_to_hex
# drop geometry to perform calculations
st_geometry(pts_to_hex) <- NULL
# aggregate crashes by hex
<- pts_to_hex %>%
f.crsh_per_hex mutate(crash_fatal_fl = if_else(
== "Y", 1,0)
crash_fatal_fl %>%
)
# sum crash and death counts by hex
group_by(hex_id) %>%
summarize(crash_cnt = sum(crash_fatal_fl),
death_cnt = sum(death_cnt))
# rejoin to travis grid geometry
<- travis_grid %>%
f.crsh_aggregate left_join(f.crsh_per_hex) %>%
# replace NA values with zero
mutate_if(is.numeric, ~replace(., is.na(.), 0))
Finally, now that we have all our data assembled, we can generate a standard choropleth map to visually represent crash density.
Code
%>%
f.crsh_aggregate ggplot() +
geom_sf(aes(fill = crash_cnt),
color = "black",
alpha = 0.7) +
# apply color scale, legend title
scale_fill_viridis_c("Fatal Crashes",
option = "A") +
# adjust legend settings
theme(legend.position = "left",
legend.key.size = unit(1.5, 'cm'),
legend.title = element_text(size=14),
legend.text = element_text(size=14))
Interactive Map
Depending on the delivery medium of the final deliverable, a static image may be the best option for rendering a map. Whenever possible, I like to generate interactive HTML maps because they get my audience engaging with the data directly. My preferred tool is leaflet
because it permits a wide variety of aesthetic controls and integrates well with Shiny apps, which I’m planning to feature in the next development stage.
Code
# generate labels for popup
<- sprintf(as.character(f.crsh_aggregate$crash_cnt))
labels
# specify magma palette
<- colorNumeric(palette = "magma",
pal domain = f.crsh_aggregate$crash_cnt)
# leaflet map
%>%
f.crsh_aggregate leaflet(options = leafletOptions(zoomControl = FALSE)) %>%
# cartographic tiles
addProviderTiles(providers$CartoDB) %>%
setView(lng = -97.78181,
lat = 30.33422,
zoom = 10) %>%
# overlay crash data and calculations
addPolygons(label = labels,
stroke = FALSE,
color = "grey",
smoothFactor = .5,
opacity = 1,
fillOpacity = 0.3,
# apply palette to crash count variable
fillColor = ~pal(crash_cnt),
highlightOptions =
highlightOptions(weight = 2,
fillOpacity = 0.5,
color = "white",
opacity = 1,
bringToFront = TRUE)) %>%
addLegend(pal = pal,
values = ~crash_cnt,
title = "Fatal Crashes",
position = "topleft")
Thank you for reading and following along the development process. The Github repo is a bit of a mess right now because this project involved a lot of trial and error to get working properly. I know that the interactive maps generated in the blog don’t exactly match the results of the full screen version that I generated in a separate script with the same code. I’ll be rewriting most of this script anyhow in the next stages, so for now, I’m happy with this as a methodological foundation and proof of concept.
In the future, I want to explore other variables in the CoA data and pull in traffic statistics from TxDOT and possibly other administrative datasets. My next major goal is to design a distance-decay model. I’m also thinking about how I might incorporate it into a machine learning model. This traffic data and the spending trend data from my previous blog are two distinct preprocessed datasets and interesting from a machine learning perspective in different ways. So, basically I have some reading to do.