Fatal Crashes in Austin, Texas

Author

Robert Crump

Published

July 3, 2023

Introduction & Background

A major interstate cuts through the middle of downtown Austin, Texas. IH-35 holds a potent historical significance for the City and radiates at the center of a current debate about its future. Impacts to the local community are substantial and multi-faceted, so any attempt to measure the highway’s effect on the surrounding population implies a selection among available topics. In this blog, the dimension I examine is fatal vehicle collisions. This metric focuses the attention because it’s more relatable than a distributed impact like pollution or economic development. After peaking during the pandemic, traffic deaths have marginally declined. The Department of Transportation estimates that 9,330 people died as a result of vehicle collisions in the first quarter of 2023. I intend to analyze other dimensions of community impact and perform some statistical work, but first I will simply count events in a defined area.

The nature of transportation data means that I get the opportunity to refresh some GIS skills. Even after making many maps in past projects, I gained new knowledge around troubleshooting GIS irregularities and got to practice design techniques. It’s important to keep in mind that transportation data is inherently behavioral information. Counting vehicle collisions is a measure of stationary incidents, but human travel happens over non-uniform spaces at different speeds and times. This idea will become more important later when building a statistical model, but for now we can just “flatten the world” and take a raw count. I’ve started initial research into distance-decay models and how to create one in R for myself. The deliverable for this current project phase is the interactive map below. I’m still learning leaflet and iterating on the final design.

Fatal Crashes in Travis County : 2013-2023

Data retrieved from the City of Austin Open Data portal. Click here for a full screen version

Data Description

The Texas Department of Transportation (TxDOT) hosts a repository of vehicle collision reports on its Crash Reports Information System (CRIS), however, this database is restricted to official use and requires a login to gain access. Thankfully, the City of Austin (CoA) publishes a subset of local crash records on its open data portal. Each crash record contains lat-lon coordinates among many other variables like time of day and number of injuries. With just this dataset and a few lines of code from sf package, we can generate a map of crashes by plotting the coordinates to x-y chart axes.

Crash Points : 2013-2023

Code

# manipulate data and transform to geometry
atx_crash_pts <- atx_crash_raw %>% 
  
  # keep only death count greater than 0
  filter(death_cnt != "0") %>% 
  
  # add geometry as points
  st_as_sf(coords = c("longitude", "latitude"),
           
           # conform CRS to county map
           # loaded earlier in hidden code chunk
           crs = st_crs(county_poly)) %>% 
  
  # remove rows without geometry
  filter(!is.na(geometry))

# generate map
atx_crash_pts %>% 
  ggplot() +
  geom_sf(color = "firebrick", 
          size = 1,
          show.legend = FALSE) +
  scale_x_continuous(expand=c(0.5,0)) +
  theme_minimal()

Even lacking context, we can observe a spatial relationship and concentration of crashes in a meaningful pattern. Anyone familiar with Austin geography would instantly recognize the source of the vertical cluster as IH-35. Each crash report contains a variety of rich information, including a street name. The CoA data portal also contains a shapefile of local roadways stored as linestrings. Below are maps of all the local roadways and IH-35 isolated. We can generate a roadway map using the same technique, this time dropping lat-lon coordinates from the display. The easiest way I found to isolate IH-35 was to apply dplyr::filter() to the prefix_typ variable to extract the entries that contain “IH” (interstate highway).

Austin Roadways & IH-35 isloated

Code

# all roadways
road_lines %>% 
  ggplot() +
  geom_sf()

# just IH-35
road_lines %>% 
  filter(prefix_typ == "IH") %>% 
  ggplot() +
  geom_sf()

To estimate crash density around IH-35, I need to orient it in space by placing it within a boundary. If I were to compare crash density in and around the highway compared to the rest of the land mass of Texas, for instance, the scale between the two areas wouldn’t allow any kind of meaningful interpretation. I decided to use Travis County as my boundary because it contains all of Austin and it’s a familiar shape to local residents. I may consider a different shape in the future because a smaller version of the disproportionate scale issue is reproduced within Travis County’s large empty spaces. As a proof of concept, it works well enough. I originally sourced this boundary from the Travis County open data portal, then switched to a CoA source for reasons I’ll explain below.

Three layers; polygon, points, & lines

Code

# isolate Travis County boundary polygon
travis_boundary <- county_poly %>% 
  dplyr::filter(county_nam == "TRAVIS") %>% 
  select(county_nam, geometry)

# isolate IH-35 roadway line
IH_35 <- road_lines %>% 
  filter(prefix_typ == "IH")

IH_35 %>% 
  ggplot() +
  
  # geom_sf ordered to achieve visual layering
  geom_sf(data = travis_boundary) +
  geom_sf() +
  geom_sf(data = atx_crash_pts,
          color = "firebrick", 
          size = 1,
          show.legend = FALSE)

The map above combines crash point data and a subset of the roadway lines by overlaying both on a polygon representing Travis County. Again, the trend is obvious enough visually, but a layered image doesn’t give us a programmic way of defining and measuring crash density on and around the highway. Apart from street name labels in the crash points data, we don’t have a way to attribute crashes to a specific roadway.

Spatial Correlation

The empirical challenge at this stage concerns co-location. Orienting multiple objects in space allows us to infer something about their interaction by measuring their proximity. There are many different approaches to estimating spatial correlation, some of which I aim to explore in future stages. For now, I will keep it simple by abstracting the study area into a two-dimensional plane without considering urban geography or specific human behavior. Another aspect of “flattening the world” is restricting and parceling space by imposing a geometric overlay. In the case of the interactive map above, that means looking only at Travis County and converting the polygon representing its boundary into a hexagonal grid.

The Austin metro area is just as complex as any urban environment, and the road network profoundly affects how people move around. Vehicle trips may include multiple stops, long or short distances, or weird patterns like when you remember that you left the oven on and returned home. Thankfully it was off, but you see a package you meant to take and add another stop at the post office. The road network causes people to move in ways that are more complicated than transitioning smoothly between hexagonal zones. So when looking at the interactive map, remember that it’s flattening time and behavior for the sake of an artifice that bluntly measures spatial correlation.

Reducing complexity in spatial data does make co-location simpler, but it comes with tradeoffs. One upside is that it makes exploratory data analysis (EDA) much easier. With non-spatial data, EDA usually involves traditional charts like histograms and bar charts:

Fatal Crashes by Year

In the same way we can group and measure crashes per year, we can produce counts per sub-regions within an area, provided there are sub-regions available that are discretely separated and labeled. Travis County contains many sub-region groups such as Census tracts, City Council districts, and zip codes that attach simple labels to parcels of land. However, I want to split the Travis County polygon into even geometric parcels because it helps me measure distance more effectively than uneven administrative boundaries. Uniformly shaped hexagons also gives me a means to “pixelate” the study area into varying intensities. One other advantage of a hex-grid is that I can adjust sub-regions’ size by changing the number of hexes in the grid. This way, I can tightly control the area around IH-35 with a shape that is cleanly transected by the roadway line.

I thought I would have to design a custom method for sub-dividing a polygon, but thanks to the sf package this can be accomplished easily with st_make_grid.

Travis County Grid

Code

travis_grid <- travis_boundary %>% 
  
  # 40 hexes evenly divides the polygon
  # with minimum thickness to contain IH-35
  st_make_grid(n = 40, square = F, flat_topped = T) %>%
  st_intersection(travis_boundary) %>% 
  st_as_sf() 

travis_grid %>% 
  ggplot() +
  geom_sf()

My investigation into fatal crash data in Austin is at an exploratory stage, but is employing some tools that differ from typical EDA. In building an artifice to solve co-location, I made choices about the data environment. It’s also a little deceiving because the project deliverable looks pretty. Familiar shapes and cartographic conventions presented with a data overlay can feel persuasive while eliding key facts or choices about how the data was assembled. All of which is to say (breathes in…) I’m going out of my way to belabor the point that this is just one method of presenting raw counts of events in a static graphical format. Not yet a statistical inference.

GIS Troubleshooting

EDA is useful both for getting a feel for your data through quick visualizations and for testing your assumptions. Since I retrieved my data from official sources, I didn’t even consider overlapping geometries to be an assumption at first, but I guess that is illustrative of the sneaky nature of assumptions. Usually, when two geometries don’t merge properly, it’s because their coordinate reference systems (CRS) are different. Misdirected by an assumption about CRSs, I got caught in a tailspin of troubleshooting in completely the wrong direction for nearly a week. If you’ve ever been lost in the fog of endlessly googling to refine the phrasing of a problem only to hit a succession of brick walls, you know that feeling of elation when you finally breakthrough and everything works.

To cut a long story short, I figured out (after many deep dives into documentation and tests of every parameter I could think of) that the shapefile pulled from Travis County had the wrong bounding box. Or rather, the coordinates were stored in an unconventional manner. Lat-lon coordinates are usually classified as decimals, but the Travis County data portal had them stored as integers, in a different order, and without negative longitude values.

	x-min	y-min	x-max	y-max
Travis County	2977872	9982003	3230641	10200112
City of Austin	-98.17306	30.02327	-97.36971	30.62796

The Travis County boundary data and City of Austin data merged without an error, but they failed to overlay when displayed together. After a few days of tailspinning, I decided to stop troubleshooting through code, and just look at the maps generated by each shapefile. When I loaded Travis County in leaflet and zoomed out, I realized that it was floating over the South Pacific. I decided at that point to see if CoA had a shapefile for Travis County and use that instead. To my immense relief, the polygons, lines, and points all lined up as demonstrated above. Along the way, I learned how to hard code a bounding box, convert shapefiles to different geometry types, and, most importantly, gained the wisdom of discovering a blindspot. Despite the frustration, it wasn’t a completely wasted week.

Merging GIS Datasets

To determine which hexes contain the IH-35 roadway, I use st_intersects which performs as its name indicates. st_intersects combined with if_else gives me a way to construct a logical variable concerning IH-35’s presence within a given hex.

IH-35 Hexes in Pink

Code

# identify which hexes contain IH-35
travis_grid <- travis_grid %>% 
  mutate(
    
    # ad ids to hexes
    hex_id = row_number(),
    
    # add IH_35 T/F label
    IH_35_hex = if_else(
      as.character(st_intersects(
        # call piped dataset within nested function
        ., 
        
        # recall IH_35 extract 
        IH_35)) != "integer(0)", T, F)
    )

travis_grid %>% 
  ggplot() +
    geom_sf(aes(fill = IH_35_hex, alpha = 0.7)) +
    scale_fill_manual(values = c("#d9d9d9", "#ff1693")) +
    theme(legend.position = "none")

Although this hex classification isn’t necessary to produce the end product for this stage, it points towards future analysis and hints at other applications opened up by the hexgrid. With the study area geometrically parceled, we can locate crash points within hexes, then group and count by hex.

Code

# spatial join to correlate pts to hex
pts_to_hex <- st_join(atx_crash_pts, travis_grid)

# drop geometry to perform calculations
st_geometry(pts_to_hex) <- NULL

# aggregate crashes by hex
f.crsh_per_hex <- pts_to_hex %>% 
  mutate(crash_fatal_fl = if_else(
    crash_fatal_fl == "Y", 1,0)
    ) %>% 
  
  # sum crash and death counts by hex
  group_by(hex_id) %>% 
  summarize(crash_cnt = sum(crash_fatal_fl),
            death_cnt = sum(death_cnt))

# rejoin to travis grid geometry
f.crsh_aggregate <- travis_grid %>% 
  left_join(f.crsh_per_hex) %>% 
  
  # replace NA values with zero
  mutate_if(is.numeric, ~replace(., is.na(.), 0))

Finally, now that we have all our data assembled, we can generate a standard choropleth map to visually represent crash density.

Fatal Crashes in Travis County, 2013-2023

Code

f.crsh_aggregate %>% 
  ggplot() +
    geom_sf(aes(fill = crash_cnt),
            color = "black",
            alpha = 0.7) +
  
    # apply color scale, legend title
    scale_fill_viridis_c("Fatal Crashes",
                         option = "A") +
  
    # adjust legend settings
    theme(legend.position = "left",
          legend.key.size = unit(1.5, 'cm'),
          legend.title = element_text(size=14),
          legend.text = element_text(size=14))

Interactive Map

Depending on the delivery medium of the final deliverable, a static image may be the best option for rendering a map. Whenever possible, I like to generate interactive HTML maps because they get my audience engaging with the data directly. My preferred tool is leaflet because it permits a wide variety of aesthetic controls and integrates well with Shiny apps, which I’m planning to feature in the next development stage.

Fatal Crashes in Travis County: 2013-2023

Code

# generate labels for popup
labels <- sprintf(as.character(f.crsh_aggregate$crash_cnt))

# specify magma palette
pal <- colorNumeric(palette = "magma", 
                    domain = f.crsh_aggregate$crash_cnt)

# leaflet map
f.crsh_aggregate %>% 
  leaflet(options = leafletOptions(zoomControl = FALSE)) %>% 
  
  # cartographic tiles
  addProviderTiles(providers$CartoDB) %>% 
  setView(lng = -97.78181, 
          lat = 30.33422, 
          zoom = 10) %>%
  
  # overlay crash data and calculations
  addPolygons(label = labels,
              stroke = FALSE,
              color = "grey",
              smoothFactor = .5,
              opacity = 1,
              fillOpacity = 0.3,
              
              # apply palette to crash count variable
              fillColor = ~pal(crash_cnt),
              highlightOptions = 
                highlightOptions(weight = 2,
                                 fillOpacity = 0.5,
                                 color = "white",
                                 opacity = 1,
                                 bringToFront = TRUE)) %>%
  addLegend(pal = pal,
            values = ~crash_cnt,
            title = "Fatal Crashes",
            position = "topleft")

Click here for a full screen version

Thank you for reading and following along the development process. The Github repo is a bit of a mess right now because this project involved a lot of trial and error to get working properly. I know that the interactive maps generated in the blog don’t exactly match the results of the full screen version that I generated in a separate script with the same code. I’ll be rewriting most of this script anyhow in the next stages, so for now, I’m happy with this as a methodological foundation and proof of concept.

In the future, I want to explore other variables in the CoA data and pull in traffic statistics from TxDOT and possibly other administrative datasets. My next major goal is to design a distance-decay model. I’m also thinking about how I might incorporate it into a machine learning model. This traffic data and the spending trend data from my previous blog are two distinct preprocessed datasets and interesting from a machine learning perspective in different ways. So, basically I have some reading to do.

--- title: "Fatal Crashes in Austin, Texas" author: "Robert Crump" date: "July 3, 2023" toc: true format: html: theme: cerulean code-fold: true code-tools: true self-contained: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE, fig.align='center', fig.width=9, fig.height=6) library(tidyverse) library(sf) library(leaflet) library(htmlwidgets) theme_set(theme_void()) travis_grid <- st_read("data/created_files/f.crsh_aggregate/f.crsh_aggregate.shp") road_lines <- st_read("data/extracted_files/atx/Street Centerline/geo_export_2da675db-b505-4e2b-bc1a-8e738b82806c.shp") county_poly <- st_read("data/extracted_files/atx/Counties/geo_export_212e809f-c2c4-427c-b69f-e923db68b2c9.shp") atx_crash_raw <- read_csv("data/extracted_files/atx/atx_crash_raw.csv") X_travis_boundary_X <- st_read("data/archive/travis_boundary/Boundary.shp") ``` ![github](images/github-logo-blue.PNG){width=4%} [Github repository](https://github.com/rwrcrmp/atx_crash_data) ### Introduction & Background A major interstate cuts through the middle of downtown Austin, Texas. IH-35 holds a potent historical significance for the City and radiates at the center of a current debate about its future. Impacts to the local community are substantial and multi-faceted, so any attempt to measure the highway’s effect on the surrounding population implies a selection among available topics. In this blog, the dimension I examine is fatal vehicle collisions. This metric focuses the attention because it’s more relatable than a distributed impact like pollution or economic development. After peaking during the pandemic, traffic deaths have marginally declined. The Department of Transportation estimates that [9,330 people died](https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813482) as a result of vehicle collisions in the first quarter of 2023. I intend to analyze other dimensions of community impact and perform some statistical work, but first I will simply count events in a defined area. The nature of transportation data means that I get the opportunity to refresh some GIS skills. Even after making many maps in past projects, I gained new knowledge around troubleshooting GIS irregularities and got to practice design techniques. It’s important to keep in mind that transportation data is inherently behavioral information. Counting vehicle collisions is a measure of stationary incidents, but human travel happens over non-uniform spaces at different speeds and times. This idea will become more important later when building a statistical model, but for now we can just "flatten the world" and take a raw count. I’ve started initial research into distance-decay models and how to create one in R for myself. The deliverable for this current project phase is the interactive map below. I’m still learning `leaflet` and iterating on the final design. <center> __Fatal Crashes in Travis County : 2013-2023__ </center> ```{r leaflet demo, echo=FALSE} labels <- sprintf(as.character(travis_grid$crash_cnt)) pal <- colorNumeric(palette = "magma", domain = travis_grid$crash_cnt) travis_grid %>% leaflet(options = leafletOptions(zoomControl = FALSE)) %>% addProviderTiles(providers$CartoDB) %>% setView(lng = -97.78181, lat = 30.33422, zoom = 10) %>% addPolygons(label = labels, stroke = FALSE, color = "grey", smoothFactor = .5, opacity = 1, fillOpacity = 0.3, fillColor = ~pal(crash_cnt), highlightOptions = highlightOptions(weight = 2, fillOpacity = 0.5, color = "white", opacity = 1, bringToFront = TRUE)) %>% addLegend(pal = pal, values = ~crash_cnt, title = "Fatal Crashes", position = "topleft") ``` <center> _Data retrieved from the City of Austin Open Data portal._ </center> <center> [Click here for a full screen version](travis_crashes_hex.html) </center> </br> ### Data Description The Texas Department of Transportation (TxDOT) hosts a repository of vehicle collision reports on its [Crash Reports Information System (CRIS)](https://www.txdot.gov/data-maps/crash-reports-records.html), however, this database is restricted to official use and requires a login to gain access. Thankfully, the City of Austin (CoA) publishes a subset of local crash records on its [open data portal](https://data.austintexas.gov/Transportation-and-Mobility/Vision-Zero-Crash-Report-Data-Crash-Level-Records/y2wy-tgr5). Each crash record contains lat-lon coordinates among many other variables like time of day and number of injuries. With just this dataset and a few lines of code from `sf` package, we can generate a map of crashes by plotting the coordinates to x-y chart axes. <center> __Crash Points : 2013-2023__ </center> ```{r point map} # manipulate data and transform to geometry atx_crash_pts <- atx_crash_raw %>% # keep only death count greater than 0 filter(death_cnt != "0") %>% # add geometry as points st_as_sf(coords = c("longitude", "latitude"), # conform CRS to county map # loaded earlier in hidden code chunk crs = st_crs(county_poly)) %>% # remove rows without geometry filter(!is.na(geometry)) # generate map atx_crash_pts %>% ggplot() + geom_sf(color = "firebrick", size = 1, show.legend = FALSE) + scale_x_continuous(expand=c(0.5,0)) + theme_minimal() ``` Even lacking context, we can observe a spatial relationship and concentration of crashes in a meaningful pattern. Anyone familiar with Austin geography would instantly recognize the source of the vertical cluster as IH-35. Each crash report contains a variety of rich information, including a street name. The CoA data portal also contains [a shapefile of local roadways](https://data.austintexas.gov/dataset/Street-Centerline/8hf2-pdmb) stored as `linestrings`. Below are maps of all the local roadways and IH-35 isolated. We can generate a roadway map using the same technique, this time dropping lat-lon coordinates from the display. The easiest way I found to isolate IH-35 was to apply `dplyr::filter()` to the `prefix_typ` variable to extract the entries that contain "IH" (interstate highway). <center> __Austin Roadways & IH-35 isloated__ </center> ```{r roadways, fig.show="hold"} # all roadways road_lines %>% ggplot() + geom_sf() # just IH-35 road_lines %>% filter(prefix_typ == "IH") %>% ggplot() + geom_sf() ``` To estimate crash density around IH-35, I need to orient it in space by placing it within a boundary. If I were to compare crash density in and around the highway compared to the rest of the land mass of Texas, for instance, the scale between the two areas wouldn’t allow any kind of meaningful interpretation. I decided to use Travis County as my boundary because it contains all of Austin and it’s a familiar shape to local residents. I may consider a different shape in the future because a smaller version of the disproportionate scale issue is reproduced within Travis County's large empty spaces. As a proof of concept, it works well enough. I originally sourced this boundary from the [Travis County open data portal](https://tnr-traviscountytx.opendata.arcgis.com/datasets/TravisCountyTX::boundary/explore?location=30.324646%2C-97.770000%2C10.82), then switched to a CoA source for reasons I'll explain below. <center> __Three layers; polygon, points, & lines__ </center> ```{r three layers} # isolate Travis County boundary polygon travis_boundary <- county_poly %>% dplyr::filter(county_nam == "TRAVIS") %>% select(county_nam, geometry) # isolate IH-35 roadway line IH_35 <- road_lines %>% filter(prefix_typ == "IH") IH_35 %>% ggplot() + # geom_sf ordered to achieve visual layering geom_sf(data = travis_boundary) + geom_sf() + geom_sf(data = atx_crash_pts, color = "firebrick", size = 1, show.legend = FALSE) ``` The map above combines crash point data and a subset of the roadway lines by overlaying both on a polygon representing Travis County. Again, the trend is obvious enough visually, but a layered image doesn't give us a programmic way of defining and measuring crash density on and around the highway. Apart from street name labels in the crash points data, we don't have a way to attribute crashes to a specific roadway. ### Spatial Correlation The empirical challenge at this stage concerns co-location. Orienting multiple objects in space allows us to infer something about their interaction by measuring their proximity. There are many different approaches to estimating spatial correlation, some of which I aim to explore in future stages. For now, I will keep it simple by abstracting the study area into a two-dimensional plane without considering urban geography or specific human behavior. Another aspect of "flattening the world" is restricting and parceling space by imposing a geometric overlay. In the case of the interactive map above, that means looking only at Travis County and converting the polygon representing its boundary into a hexagonal grid. The Austin metro area is just as complex as any urban environment, and the road network profoundly affects how people move around. Vehicle trips may include multiple stops, long or short distances, or weird patterns like when you remember that you left the oven on and returned home. Thankfully it was off, but you see a package you meant to take and add another stop at the post office. The road network causes people to move in ways that are more complicated than transitioning smoothly between hexagonal zones. So when looking at the interactive map, remember that it's flattening time and behavior for the sake of an artifice that bluntly measures spatial correlation. Reducing complexity in spatial data does make co-location simpler, but it comes with tradeoffs. One upside is that it makes exploratory data analysis (EDA) much easier. With non-spatial data, EDA usually involves traditional charts like histograms and bar charts: <center> __Fatal Crashes by Year__ </center> ```{r column chart, echo=FALSE} atx_crash_pts %>% # summarize crash data by year group_by(year = year(crash_date)) %>% summarise(crash_counts = sum(n())) %>% # build plot ggplot(aes(year, crash_counts)) + # map variables to geoms geom_col(aes(fill = crash_counts, alpha = 0.7), ) + geom_label(aes(label = crash_counts, vjust = 1.2)) + # aesthetics scale_x_continuous(breaks = 2013:2023, expand = c(0, 0.7)) + scale_y_continuous(expand = c(0, 1)) + scale_fill_viridis_c(option = 1, guide = NULL) + theme(axis.text.x = element_text(size = 13, vjust = 1), plot.background = element_rect(fill = "lightgrey"), legend.position = "none") ``` In the same way we can group and measure crashes per year, we can produce counts per sub-regions within an area, provided there are sub-regions available that are discretely separated and labeled. Travis County contains many sub-region groups such as Census tracts, City Council districts, and zip codes that attach simple labels to parcels of land. However, I want to split the Travis County polygon into even geometric parcels because it helps me measure distance more effectively than uneven administrative boundaries. Uniformly shaped hexagons also gives me a means to "pixelate" the study area into varying intensities. One other advantage of a hex-grid is that I can adjust sub-regions' size by changing the number of hexes in the grid. This way, I can tightly control the area around IH-35 with a shape that is cleanly transected by the roadway line. I thought I would have to design a custom method for sub-dividing a polygon, but thanks to the `sf` package this can be accomplished easily with `st_make_grid`. <center> __Travis County Grid__ </center> ```{r travis grid} travis_grid <- travis_boundary %>% # 40 hexes evenly divides the polygon # with minimum thickness to contain IH-35 st_make_grid(n = 40, square = F, flat_topped = T) %>% st_intersection(travis_boundary) %>% st_as_sf() travis_grid %>% ggplot() + geom_sf() ``` My investigation into fatal crash data in Austin is at an exploratory stage, but is employing some tools that differ from typical EDA. In building an artifice to solve co-location, I made choices about the data environment. It's also a little deceiving because the project deliverable looks pretty. Familiar shapes and cartographic conventions presented with a data overlay can feel persuasive while eliding key facts or choices about how the data was assembled. All of which is to say _(breathes in...)_ I'm going out of my way to belabor the point that this is just one method of presenting raw counts of events in a static graphical format. Not yet a statistical inference. ### GIS Troubleshooting EDA is useful both for getting a feel for your data through quick visualizations and for testing your assumptions. Since I retrieved my data from official sources, I didn't even consider overlapping geometries to be an assumption at first, but I guess that is illustrative of the sneaky nature of assumptions. Usually, when two geometries don't merge properly, it's because their [coordinate reference systems (CRS)](https://en.wikipedia.org/wiki/Spatial_reference_system) are different. Misdirected by an assumption about CRSs, I got caught in a tailspin of troubleshooting in completely the wrong direction for nearly a week. If you've ever been lost in the fog of endlessly googling to refine the phrasing of a problem only to hit a succession of brick walls, you know that feeling of elation when you finally breakthrough and everything works. To cut a long story short, I figured out (after many deep dives into documentation and tests of every parameter I could think of) that the shapefile pulled from Travis County had the wrong bounding box. Or rather, the coordinates were stored in an unconventional manner. Lat-lon coordinates are usually classified as decimals, but the Travis County data portal had them stored as integers, in a different order, and without negative longitude values. | |x-min |y-min |x-max |y-max | |---------------|---------|--------|---------|--------| |Travis County |2977872 |9982003 |3230641 |10200112| |City of Austin |-98.17306|30.02327|-97.36971|30.62796| The Travis County boundary data and City of Austin data merged without an error, but they failed to overlay when displayed together. After a few days of tailspinning, I decided to stop troubleshooting through code, and just look at the maps generated by each shapefile. When I loaded Travis County in `leaflet` and zoomed out, I realized that it was floating over the South Pacific. I decided at that point to see if CoA had a shapefile for Travis County and use that instead. To my immense relief, the polygons, lines, and points all lined up as demonstrated above. Along the way, I learned how to hard code a bounding box, convert shapefiles to different geometry types, and, most importantly, gained the wisdom of discovering a blindspot. Despite the frustration, it wasn't a completely wasted week. ### Merging GIS Datasets To determine which hexes contain the IH-35 roadway, I use `st_intersects` which performs as its name indicates. `st_intersects` combined with `if_else` gives me a way to construct a logical variable concerning IH-35's presence within a given hex. <center> __IH-35 Hexes in Pink__ </center> ```{r IH_35 hexes} # identify which hexes contain IH-35 travis_grid <- travis_grid %>% mutate( # ad ids to hexes hex_id = row_number(), # add IH_35 T/F label IH_35_hex = if_else( as.character(st_intersects( # call piped dataset within nested function ., # recall IH_35 extract IH_35)) != "integer(0)", T, F) ) travis_grid %>% ggplot() + geom_sf(aes(fill = IH_35_hex, alpha = 0.7)) + scale_fill_manual(values = c("#d9d9d9", "#ff1693")) + theme(legend.position = "none") ``` Although this hex classification isn't necessary to produce the end product for this stage, it points towards future analysis and hints at other applications opened up by the hexgrid. With the study area geometrically parceled, we can locate crash points within hexes, then group and count by hex. ```{r pts_to_hex} # spatial join to correlate pts to hex pts_to_hex <- st_join(atx_crash_pts, travis_grid) # drop geometry to perform calculations st_geometry(pts_to_hex) <- NULL # aggregate crashes by hex f.crsh_per_hex <- pts_to_hex %>% mutate(crash_fatal_fl = if_else( crash_fatal_fl == "Y", 1,0) ) %>% # sum crash and death counts by hex group_by(hex_id) %>% summarize(crash_cnt = sum(crash_fatal_fl), death_cnt = sum(death_cnt)) # rejoin to travis grid geometry f.crsh_aggregate <- travis_grid %>% left_join(f.crsh_per_hex) %>% # replace NA values with zero mutate_if(is.numeric, ~replace(., is.na(.), 0)) ``` Finally, now that we have all our data assembled, we can generate a standard choropleth map to visually represent crash density. <center> __Fatal Crashes in Travis County, 2013-2023__ </center> ```{r choropleth} f.crsh_aggregate %>% ggplot() + geom_sf(aes(fill = crash_cnt), color = "black", alpha = 0.7) + # apply color scale, legend title scale_fill_viridis_c("Fatal Crashes", option = "A") + # adjust legend settings theme(legend.position = "left", legend.key.size = unit(1.5, 'cm'), legend.title = element_text(size=14), legend.text = element_text(size=14)) ``` ### Interactive Map Depending on the delivery medium of the final deliverable, a static image may be the best option for rendering a map. Whenever possible, I like to generate interactive HTML maps because they get my audience engaging with the data directly. My preferred tool is `leaflet` because it permits a wide variety of aesthetic controls and integrates well with [Shiny apps](https://shiny.posit.co/r/getstarted/shiny-basics/lesson1/index.html), which I'm planning to feature in the next development stage. <center> __Fatal Crashes in Travis County: 2013-2023__ </center> ```{r final deliverable} # generate labels for popup labels <- sprintf(as.character(f.crsh_aggregate$crash_cnt)) # specify magma palette pal <- colorNumeric(palette = "magma", domain = f.crsh_aggregate$crash_cnt) # leaflet map f.crsh_aggregate %>% leaflet(options = leafletOptions(zoomControl = FALSE)) %>% # cartographic tiles addProviderTiles(providers$CartoDB) %>% setView(lng = -97.78181, lat = 30.33422, zoom = 10) %>% # overlay crash data and calculations addPolygons(label = labels, stroke = FALSE, color = "grey", smoothFactor = .5, opacity = 1, fillOpacity = 0.3, # apply palette to crash count variable fillColor = ~pal(crash_cnt), highlightOptions = highlightOptions(weight = 2, fillOpacity = 0.5, color = "white", opacity = 1, bringToFront = TRUE)) %>% addLegend(pal = pal, values = ~crash_cnt, title = "Fatal Crashes", position = "topleft") ``` <center> [Click here for a full screen version](travis_crashes_hex.html) </center> </br> Thank you for reading and following along the development process. The Github repo is a bit of a mess right now because this project involved a lot of trial and error to get working properly. I know that the interactive maps generated in the blog don't exactly match the results of the full screen version that I generated in a separate script with the same code. I'll be rewriting most of this script anyhow in the next stages, so for now, I'm happy with this as a methodological foundation and proof of concept. In the future, I want to explore other variables in the CoA data and pull in traffic statistics from TxDOT and possibly other administrative datasets. My next major goal is to design a [distance-decay model](http://wiki.gis.com/wiki/index.php/Distance_decay). I'm also thinking about how I might incorporate it into a machine learning model. This traffic data and the spending trend data from my previous blog are two distinct preprocessed datasets and interesting from a machine learning perspective in different ways. So, basically I have some reading to do.