I don’t have much reason to dip in to big data processing tools in my current working capacity. Luckily for me, I have a (self-indulgent) blog now - perfect place to scratch these sorts of itches, and then share the resultant mishaps with the whole world.
I’d recently been working on a Shiny app for increasing visibility and engagement with police violence data. It just so happened that I stumbled across Stanford’s Open Policing Project, with some hefty open data ripe for exploring with big data tools, right after the app’s first build was finished.1 I figured introducing these things together should make for a smooth read.
Stanford’s Open Policing Project
Stanford is, slowly but surely, compiling and standardizing data on vehicle and pedestrian stops from law enforcement departments across the U.S.
This information is being made freely available - not just the data, but everything needed to reproduce the analysis (repo in here). The data files for d/l have been heavily cleaned (sounds like some clean-up job, too), but the strong-stomached can contact the project team for access to the raw stuff.
All the ingredients for a juicy pet project are in place, then, until I reach this sentence in the project overview:
“We’ve already gathered 130 million records from 31 state police agencies…”
For those of us rocking modest MacBooks est. 2k13, introducing our machines to this magnitude of data may trigger a ripple in the space-time. Without any further ado, I’ll move onto my next trick.
sparklyr is R Studio’s interface to Apache Spark, a pre-eminent open-source cluster-computing framework. The coup de grâce for sparklyr is it’s complete dplyr back-end, meaning there isn’t too much unfamiliar code to see here.
The what-does-Spark-do TL;DR: Spark makes programs run faster by utilising a distributed computing engine for expressive data processing. A useful analogy - now, you’re sending one person into a house to find something, whereas a distributed system sends someone into each room of the house and they communicate progress to each other.
What happens now? First off, the usual CRAN install/load one-two:
Spark also needs to be installed locally. This is as simple as:
Deploying Spark locally will be as far as this post goes. When you need it, you can find out more about cluster deployment here.
Now, there are all kinds of Spark configuration options at hand to change the behaviour of sparklyr and the cluster itself. I won’t go into extensive detail on this (Spark do, here) - for now, I’ll just initialize a
spark_config object with just the amount of memory used for the driver process set.
config <- spark_config() config[["sparklyr.shell.driver-memory"]] <- "2G"
The final step of the sparklyr setup is to establish a connection to the Spark instance:
sc <- spark_connect(master = "local", config = config)
sc is now acting as a remote dplyr data source to the Spark cluster. In the latest R Studio IDE, you can check things are looking normal, sparklyr-wise:
Nice, but, there’s not much going on over there at the moment. Enter the data.
Back to the Data
Stanford have made the police stops data available in a manageable way, as a series of data files (one for each state). I decided to take a look at Washington, which had all common fields available and relatively few data quality issues.2 There are two main methods for getting data up into a Spark instance.
- Read data into R as normal, and then use sparklyr’s
copy_tofunction to copy the data over
- Read data directly into Spark Dataframes using
spark_read_csv(or another of sparklyr’s ‘read’ functions)
I will be adopting the latter approach. The dataset in question is a not-to-be-sniffed-at 8+ million rows, so I don’t want to be doing much with it in-memory in R. So, here we go…
spark_read_csv(sc, name = "wa_stops", path = "../data/WA-clean.csv")
…and, ~five minutes later, it’s showtime. The data has been copied into the Spark cluster, and I promise there’s no more session prep work.3
“Hello dplyr, my old friend”
We can use all of your favourite dplyr moves to manipulate the data4, and these computations will take place over in the cluster.
Here’s an example of a dplyr transformation that returns a summary of the data grouped on several demographic fields.
# initiate spark data source demog_stats <- tbl(sc, "wa_stops") %>% # filter 2011-2015 date range filter(year(stop_date) >= 2011, year(stop_date) <= 2015) %>% # make search/hits boolean fields and a month field mutate(search = if_else(search_conducted == "TRUE", 1, 0), hits = if_else(contraband_found == "TRUE", 1, 0), month = month(stop_date)) %>% # group data by desired fields group_by(driver_race, driver_gender, driver_age, county_name, month) %>% # summary stats for stops, searches and hits summarise(n_stops = n(), n_searches = sum(search), n_hits = sum(hits)) %>% # remove grouping ungroup() %>% # drag data from spark to R collect()
Most of this will be familiar to the dplyr-literate. The function you might not be familiar with is
collect, which is the function that gets sparklyr to drop the query result into our R environment. It’s generally advisable not to do too much piping (%>%) at once, so you can debug your code adequately. You can use compute to store query results in Spark within a temporary table, and take into a subsequent query.
Once necessary data transformations have been done in Spark and collected in R’s environment, you can close the connection down, and it’s back to R for the analysis work.
Monitoring Washington’s Police Stops
First, a look at the trends of police stops, searches and ‘hits’ (searches finding contraband) over time. For comparative purposes, I’ve used January 2011 as a baseline figure and measured percentage change from this baseline for each of these metrics.
While stops have remained pretty stable over time, there’s been a drop-off in searches and (an even bigger one) in hits. That is, apart from the immediate period following legalisation of marijuana possession for adults 21 and over at the end of February 2015 in D.C. - I’m hypothesizing that the police were extra vigilant during this period to make sure such a controversial law change was being followed to the letter.
How has the police force itself fared over time? With the officer ID field, a proxy of the number of active officers on patrol can be established - I’ve gone with the number of unique officer IDs making at least one stop in a month.
The force seems to be getting smaller. Notice that there’s also a clear seasonal component to the number of officers on patrol - maybe officers don’t really want to be out for too long in those winter months.
There are fields in the data that can be used to get at racial disparities and possible bias in police behaviour. For example, the search/hits metrics from earlier can be taken a bit further and considered as rates based on stops and searches respectively. This study of outcomes may indicate discrimination.
While Black/Hispanic people are searched more often than Whites when searched, the hit rate (% of searches with contraband found) is lower for these minority groups, demonstrating the discrimination they face in this area.
Is this phenomenon seen across country forces? Using the geographic data fields, it’s possible to identify regional trends.
The rate of searches is consistently higher amongst Black and Hispanic people, compared to Whites.The fact that there are a number of instances when the ‘hit rate’ is close to zero amongst searches of minority groups suggests that the threshold for searches, or standard of evidence needed to initiate a search, is lower than for whites.
Quick word of warning - hit rates can be misleading. While a good indicator of discrimination, it’s not quite enough to infer racial bias. For example, suppose there are just two types of white drivers with either a 5% or 75% likelihood of carrying contraband. Suppose there are also just two types of black drivers: some black drivers have a 5% chance of carrying contraband, and the others have a 50% chance of carrying contraband. If a fair police officer only searches drivers with at least a 10% chance of carrying something illegal, the white hit rate would be 75% and the black hit rate would be 50%. The officer used the same standard to search each driver, and so did not discriminate, even though the hit rates differ.
This was a taste of how Spark can be utilized to power large-scale analyses of police behaviour and profiling - I hope to revisit the data as more states join in (and perhaps do a comparison with our forces across the pond, someday).
Introducing polMonitor (and Mapping Police Violence)
polMonitor is a related pet project that has seen the light of day thanks to a herculean effort by Mapping Police Violence to compile data about police killings from several disparate sources (namely, FatalEncounters.org, the U.S. Police Shootings Database and KilledbyPolice.net).
With all of this work going in to data collection, I was inspired to do whatever I could to make the data accessible and engaged with by more people, and ensure more accountable policing. To that end, I developed a Shiny app (now hosted over at shinyapps), or an interactive space for folks to immerse themselves in this data. Go see for yourself, and check the repo here.
I’m not going to delve into the ins and outs of Shiny development, but I wanted to quickly mention a couple of helping hands I found with this one.
hiddenfunction, which lets you choose bits of the app for the user to hide when not particularly useful.
- The GIS dream-team of sf and tidycensus: Boy did this package combo come through for me when it was a choropleth map situation.
I set out to learn a bit more about Spark and Shiny, and put a spotlight on critical policing issues in the process. The trickle down of new technologies can be slow to reach these kinds of spaces, but it’s as important as ever that there continues to be a pragmatic effort to expose discrimination and biases in systems and encourage open, accountable policing through good data science. Don’t hesitate to get stuck in.
I hope to find time to incorporate Stanford’s police stops data into my app. At the time of writing, there’s still a lot of states missing from their data.↩
On the d/l page, a table explains which fields are present in each data file. The GitHub repo’s readme has detailed data quality information about each state, which you should review before getting stuck in.↩
I found that the extended tidyverse family of packages could be used with mixed results. For example, lubridate’s
yearfunction worked OK, but not
wday. Therefore, some transformations may still need to be done in R. Keep up with updates over at the sparklyr GitHub repo.↩