November 9, 2017

Hacking homelessness & PDF prisons

Shout-out to the Monday Morning Data Science mailout from John Hopkins. Making my way through this week’s (#41) edition, I hit upon quite the eulogy concerning a new R package:

“The most compelling data visualization libraries I’ve seen in awhile…you have to see it to believe it”

You had me at compelling (I’m not sure who had me exactly, this was the only entry w/o a name attached to it 🤔).

Hex Democratization

Enter geogrid and a solution to the perennial information communication problem of plotting polygons of different sizes, that’s equal parts functional and beautiful. In the link above the author, Joseph Bailey, eloquently explains the motivations behind this package for automatic transformations of geospatial polygons into regular and hexagonal grids. In short, a grid system ensures a fairer representation of spatial data (while retaining geography) and an automated implementation means this is suitable to use in less commonly studied places. On both counts, this is a wicked example of building tools that democratize good data science. ✋

Next, I just needed a good reason to wheel it out (aside from sharing with y’all).1

Full Metal Data

Combined Homelessness and Information Network (CHAIN) is the foremost project collecting data about homelessness in London. Gaining direct access to the database is strictly limited to those working directly with rough sleepers. However, CHAIN reports are published on the Greater London Authority’s (GLA) London Datastore. The full annual report does contain data on the spatial distribution of rough sleepers, but it’s only available in PDF format. 💀

This xkcd cartoon is in good humour, but there’s a lot of truth in this - PDF’s have risen to represent a kind of absolute truth, embodied by their relative impenetrability from outside influence (this includes pesky data scientists). This all-but-closed system becomes a big problem when we want to extract the ‘open’ data holed up inside. (Enter next R package to save the day, stage left).

Introducing tabulizer AKA Governments, Lock Up Your PDFs

Actual footage of `tabulizer, an rOpenSci project dreamed up by Thomas Leeper, encountering impervious PDF #478:

As the rOpenSci release tells, tabulizer gives us R mortals the power of the tabula-java library for extracting tables from PDF files (which, in turn, powers Tabula, which you might have come across). You should dig in to that release blog (and associated GitHub repo) for the juicy inner workings, but I’ll show it off very shortly.

I should say that I hit some snags trying to get this package fully operational (again, detailed in the package release), owing to the dependency on Java. I will just say that the help contained within the below links were able to get me on my way (as a Mac user) AND it was worth it:

Where the Magic Happens

With new friends in tow, it’s about to go down on this PDF. Here’s an idea of what we’re working with (Eeyore’s sad at PDFs courtesy of magick, if you were wondering):

To get the data in this table out of it’s prison, we use extract_tables():


homeless_spatial_tab <- extract_tables(file = "", pages = 19)

We simply fed this two arguments - the file path of the report and the page number with our table of interest - and returned a list object of length one, containing a character matrix. The resultant object needs some light wrangling to get it just right:

# specify table list item
homeless_spatial_tab <- homeless_spatial_tab[[1]]

# remove unneeded rows and cols
homeless_spatial_tab <- homeless_spatial_tab[c(-2, -15, -37), -6:-7]

# get col names
col_names <- homeless_spatial_tab[1, ]

# create dataframe
homeless_spatial_tab <- data.frame(homeless_spatial_tab[-1, ])

# set colnames
colnames(homeless_spatial_tab) <- col_names

# set col types
homeless_spatial_tab <- mutate_at(homeless_spatial_tab, c("2013/14", "2014/15", "2015/16", "2016/17"), as.character)
homeless_spatial_tab <- mutate_at(homeless_spatial_tab, c("2013/14", "2014/15", "2015/16", "2016/17"), as.numeric)

##          Borough 2013/14 2014/15 2015/16 2016/17
## 1    Westminster    2197    2570    2857    2767
## 2         Camden     501     563     641     702
## 3  Tower Hamlets     324     377     395     445
## 4         Newham     202     221     260     396
## 5 City of London     317     373     440     379
## 6        Lambeth     427     468     445     355

Eeyore is joyful at this sight, mark my words. How effortless was that? This data is ripe for mapping. Speaking of which…

Bring in the Maps

Back to geogrid stuff. For starters, let’s load in the shapefiles we need (note: the GitHub repo’s example is actually on London, too, so I won’t dwell on every similar detail):


# get spatial polygons
input_file <- system.file('extdata', 'london_LA.json', package='geogrid')
original_shapes <- read_polygons(input_file)

# get polygon details
original_details <- calculate_grid(original_shapes)

In the case of spatial joins, I’ve found [sf] ( features) incredibly intuitive and so I’ll employ this tactic now to merge the distinct datasets together. I did notice some differences in the names given to certain London Boroughs (mainly the use of ‘&’/‘and’) which will need addressing as well.

# rename homeless data boroughs strings not matching geogrid shp
homeless_spatial_tab$Borough <- str_replace(homeless_spatial_tab$Borough, pattern = "Richmond",
                                            replacement = "Richmond upon Thames")
homeless_spatial_tab$Borough <- str_replace(homeless_spatial_tab$Borough, 
                                            pattern = "&",
                                            replacement = "and")


# turn geogrid shapefile into sf format
original_shapes_sf <- st_as_sf(original_shapes)

# join homeless data
original_shapes_sf <- left_join(original_shapes_sf, homeless_spatial_tab, by=c("NAME"="Borough"))

We’re about ready to put plot to paper. First, let’s take a look at a traditional, real space assignment of the area.

# get coords
coords <- original_shapes_sf %>%
  # find polygon centroids (sf points object)
  st_centroid %>%
  # extract the coordinates of these points as a matrix

# insert centroid long and lat fields as attributes of polygons
original_shapes_sf$long <- coords[,1]
original_shapes_sf$lat <- coords[,2]

# get percentage change from baseline
original_shapes_sf$pct_chg <- (original_shapes_sf$`2016/17` - original_shapes_sf$`2013/14`) / original_shapes_sf$`2013/14` * 100

# traditional map
ggplot(original_shapes_sf) +
  geom_sf(aes(fill = pct_chg)) +
  geom_text(aes(long, lat, label=str_sub(NAME, 1, 4)), 
            alpha = 0.75, size = 2.5, color = 'white') +
  coord_sf() +
  scale_fill_viridis_c() +

The metric used here represents % change in 2016/17 rough sleepers compared to 2013/14. Barking is highest (14 up to 49), but the eye and mind may be more drawn to some of it’s bigger neighbours (e.g. Havering).

Algorithmic tessellation is used to generate possible grid layouts for the data (as explained in the GitHub repo - I’ll just be demo-ing the hexes, but the same can be done with regular grids):

I’m quite partial to #3. Maneuvering from real space geography to this grid involves an implementation of the hungarian algorithm. All you have to do is calculate_cell_size() with the grid (seed) of choice and assign_polygons() to the shapefile containing the original geography. From there, the steps to visualisation are identical to previous ones:

new_cells_hex <-  calculate_grid(original_shapes, 0.03, 'hexagonal', 3)
resulthex <- assign_polygons(original_shapes, new_cells_hex)

# turn geogrid shapefile into sf format
resulthex_sf <- st_as_sf(resulthex)

# join homeless data
resulthex_sf <- left_join(resulthex_sf, homeless_spatial_tab, by=c("NAME"="Borough"))

# get coords
coords <- resulthex_sf %>%
  # find polygon centroids (sf points object)
  st_centroid %>%
  # extract the coordinates of these points as a matrix

# insert centroid long and lat fields as attributes of polygons
resulthex_sf$long <- coords[,1]
resulthex_sf$lat <- coords[,2]

# get percentage change from baseline
resulthex_sf$pct_chg <- (resulthex_sf$`2016/17`- resulthex_sf$`2013/14`) / resulthex_sf$`2013/14`

# hex plot
ggplot(resulthex_sf) +
  geom_sf( aes(fill = pct_chg)) +
  geom_text(aes(long, lat, label=str_sub(NAME, 1, 4)), size = 2.5, color = 'white') +
  scale_fill_viridis_c(labels = percent) +
  coord_sf() +
  labs(title="Where are homeless sightings becoming more frequent in London?",
       subtitle="People seen rough sleeping, % change 2013/14 to 2016/17 by borough",
       caption="@ewen_") +
  theme_void() +
    text = element_text(size = 9),
    plot.title = element_text(size = 12, face = "bold"), 
    plot.subtitle = element_text(size = 9),
    axis.ticks = element_blank(), 
    legend.direction = "vertical", 
    legend.position = "right",
    plot.margin = margin(1, 1, 1, 1, 'cm'),
    legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm")

From PDF prison to 🔥 hex map with ease. May this inspire you to solve your own #otherpeoplesdata horror stories.

On a More Serious Note

I also hope that the seriousness of the subject matter was not forgotten. The rise in homelessness sightings is really bad. A major reason why getting hold of this homelessness data was so difficult is that organisations like St Mungo’s suffer from chronic under-funding, and so machine-readable data is understandably not their top priority. They are amazing so please consider donating to them if you can, and others like them below:

  1. To keep the post concise I don’t show all of the code, especially code that generates figures. But you can find the full code here.