Tuesday, 19 May 2015

Retail Outlets on OpenStreetMap: Cartograms, and Patchwork Quilts

I enjoyed the process of creating a cartogram from OpenStreetMap data a couple of years back, even if it was somewhat tedious. However two things stopped me from taking it further: the QGIS plugin I was using does not work with later editions, and I really wanted something a little more refined.

Pub Cartogram
Cartogram of Local Authority areas in Great Britain based on numbers of pubs on OpenStreetMap
Created using ScapeToad, this is a simple, and naive, cartogram.

Cartograms (again)

Nearly a year ago I attended a London Geomob session at Google Campus. I particularly wanted to hear what Benjamin Hennig had to say about cartograms. In the end the crush in the pub was so big I never got to talk to him, instead I managed to have a long and interesting conversation with another one of the speakers, Jason Davies (known to some for his work on d3js). Much of our discussion was about visualisation techniques (of which more below), but I did manage to get some advice about producing cartograms.

Cartogram using Scapetoad example Data
Swiss cantons, cartogram by population
(this is the demo data set).
In particular Jason recommended ScapeToad, which I downloaded more or less immediately. For a variety of reasons I did not actually try it out until earlier this year.

As an aside, I should mention that the first time I came across the idea that simple distortions of a grid could do interesting things was in the work of the Scottish biologist D'Arcy Thompson. He wrote a magnificent book On Growth and Form, published around 1916. This in turn both inspired scientists & mathematicians, such as Alan Turing, Lewis Wolpert and Brian Goodwin, to investigate pattern formation and morphogenesis long before the actual molecular mechanisms were known. In turn Thompson was inspired by renaissance drawings by Dürer (see below: Thompson's drawings are still in copyright).

Durer face transforms
Durer's sketches of the relative proportions of the human head.

Global Retail POIs from OSM

In the meantime I have, from time-to-time, been trying to create a worldwide file of retail POIs from OpenStreetMap. The idea is that the data in such a file should both be enriched and standardised. For the moment I am concerned with very simple enrichments: adding country codes, and grouping retail outlets into a small(ish) number of categories.

The basic idea is simple: extract a small number of amenity tag values and all shop tag values from OSM planet files, and then using a point-in-polygon method assign each element to a given country. In practice I have used Geofabrik's continent-wide extracts rather than the whole planet, and all elements have been converted to centroids using osmconvert. Two problems arose in this approach: a) using osmfilter gave inconsistent results; and b) point-in-polygon requires detailed, not generalised country polygons.

I still don't quite know why a given filter spec with osmfilter sometimes works as expected and other times does not, but I have been able to get consistent results using simple filters and using them in sequence.

When I started I deliberately did not use OSM country polygons: mainly because I thought there was a considerable risk that they would be broken. Instead I started with Natural Earth 1:10 million scale shape files. However, these proved to be far too generalised to work: particularly as so many densely populated places lie near a coastline.

My first approach was to buffer these polygons, but it's more complicated to eliminate overlaps. I then buffered NE coastline shape files and combined them with the country files. This fixed coastlines, but then numerous other problems arose where populated places in two countries are close together (e.g., Rhine Valley, Vatican City, etc). At this point I put the work to one side. It should be noted that this experience shows that Natural Earth polygons are not suitable for analytical rather than cartographic applications. I would hope that in the future they might consider maintaining an additional set of vector data for this purpose

Very recently I was made aware of the great work of a group of German mappers who now maintain boundary polygons on OSM. This means that it is now entirely possible to work using the more detailed polygons from OSM for point-in-polygon queries.I was able to extract country boundaries from the Geofabrik continent files, although one or two larger boundaries had to be extracted from OSM directly (presumably because they weren't complete in a single extract file).

Next I gridded the country boundaries so as to create smaller areas for resolving the st_within query. I used a grid of 7.5 degrees square, which works for me. Even with the smaller areas PostGIS queries were taking a long time to run. Rather than simplifying the boundaries I resorted to a cruder technique.

Instead of assigning POIs to a country within the grid, I just assigned them to a latlon grid square. This latter because it is a regular bounding box is a very efficient computation. I then created a cartesian product between the assigned grid square and all country polygons within the grid square. In most cases only a few countries fall within a given latlon square, so the number of comparisons is reasonable. Another way to speed the queries up was to have both the point and candidate polygon geometries in the same row. In other words I used lots of disk space, transient tables, and in-row comparisons to avoid heavy processing of geometry columns. For 3 million POIs this runs in a few minutes, which is fine for my purposes. The end result is a table of the POIs assigned to country.

Cartograms of Retail POIs

This can be used for many different kinds of visualisation. The first kind I tried was, of course, a cartogram of pubs. For this I could use Natural Earth (NE) boundaries, I just created shape files with a count of pubs by country linked to the NE data

Cartogram of Pubs on OpenStreetMap
Cartogram of countries based on number of pubs mapped on OpenStreetMap
Now this is quite fun, but not very informative. We already know that pubs are pretty much of British origin, and that Germany is the country with the most active mapping community. So to a large degree it tells us where people are mapping things. We can improve this a little by correcting for population and using the density mapping option in Scapetoad. Now we get more intriguing results:

Cartogram of Pubs on OSM
Cartogram rejigged to use pub/unit population and a density function to generate the cartogram.
Countries are coloured according to pub density.
This is a bit more like it. We can still see that places which were part of the British Empire (notably Australia) have a higher density of pubs compared with other places. It is also obvious that certain countries just don't have people mapping pubs. These include the expected places, such as most of Africa, but also the USA which shrinks significantly in this cartogram. A less obvious feature is how the southern part of Latin America curves away from the Falkland Islands/Islas Malvinas. This is because the islands are unusually well-mapped for pubs, with 14 for a population of just over 3000. I inadvertently used the wrong option when I first made this cartogram, the result was a map of the world with a distinct bias to the British Empire and its residual bits and pieces.

Cartogram using a density value but mass algorithm
As above, but cartogram generated using mass rather than density method.
Just goes to show that one has to be careful choosing parameters even with a tool such as ScapeToad which has tried to keep everything simple.

Working with pubs allowed me to work out the basic methodology. I then ran this through with a number of other subsets of POIs: Banks, Supermarkets, Restaurants, Post Offices & Pharmacies.

Supermarket Cartogram (density by country)
Restaurants Cartogram (density by country)
Banks Cartogram (density by country)
All tell a familiar tale of where things are mapped on OSM: over-large Europe, tiny Africa, Middle East, Africa and China. The USA is usually smaller than would be expected, with one exception: post offices. And post offices in the US are an easily understood anomaly. Many amenity=post_office nodes in the US were imported from GNIS, and like many other GNIS data sets are of dubious value. Even if these post offices are reasonably accurately located (which is doubtful) over 6000 are labelled with "(historic)" after the name of the post office (run this overpass query over anywhere in the US).

So far all this tells us is that OpenStreetMap has a long way to go before basic amenities are mapped reasonably comprehensively anywhere, let alone in China, India, and Africa. All the cartograms are maps of mapper activity not maps of things.

Marimekko (Mosaic) Plots

Another way I have wanted to look at this data is using a Marimekko plot.

I wanted to do this before I gave a talk at SotM-Baltics, but there were too many complications. I therefore created a very rough approximation using values from country-specific instances of taginfo and Excel:

Stacked-bar charts showing distribution of various retail categories by country
with city of Nottingham (ng) at bottom.
I'd chatted to Jason about this type of plot, but great though d3js is, I really wanted something involving less programming. So I turned to R. I found a GIST for a marimekko plot, but in contemplating it it didn't do quite what I wanted and still required computing the boundaries of the plot. Now I first created a little app for doing Marimekkos around 1993 using a product called Metaphor on a "grey Workstation". The firm I worked for was building an app (in VB3) for an international drinks firm who used Marimekko plots to display market share of their brands in various categories (whisky, rum, gin, vodka, etc). Basically I did exactly what the GIST and other examples do: precompute the corners of each rectangle in the plot before plotting. In Metaphor I was able to do this non-programmatically because one could just connect very simple operations in a chain (somewhat like using UNIX pipes, but using a GUI interface). I'd hoped for something similar in R, but couldn't find it.

Instead I used SQL windowing functions to calculate the lower & upper bounds of the boxes and extracted that data into R. After a fair bit of experimentation with ggplot2, I found the geom_rect method most appropriate. Like a lot of R packages this is extremely rich in functionality some of which takes some effort to work out how to exploit. The examples below are a work in progress.

Retail Outlets on OSM in Africa

Retail Outlets on OSM in Europe

Retail Outlets on OSM in South America

There are masses of improvements to this visualisation: better labelling (particularly of numeric values), showing smaller countries as 'other' or otherwise aggregated together. The main thing I did was to fiddle with the colour scheme in the hope that it is more readable by those who have colour-blindness. I used a suggested palette for colour-blind safe values and changed the brightness in steps for similar retail categories. I haven't yet had a chance to get feedback on this, but it did take time to evolve a custom palette for the values I used. I haven't really started using the plots for any kind of in-depth analysis (really I hope to be able to produce something like the Excel plot at the top on a more systematic basis), so this post is more about the realities of trying to create a (simple) global data set from OSM and then create visualisations with it.

I'll leave looking at the retail stuff in detail for a later post: the cartograms & plots are part of the toolset for exploring this data. I probably really want to look at retail POIs at something like city rather than country scale. This will require a bit more data wrangling (and probably some more gotchas to explore) before I'm able to do that.

I also would like to further tidy up the base dataset & get it so that I can update it from daily planet diffs. At that point it may become useful for others.


  1. Could the legend from the 3 last pics be vertically inverted?
    The black values are on top of the graphic but they are at the bottom of the legend, for example.

  2. I wasnt quite sure what you meant when I first read this. It was only when Harry asked if it could be picture of the week that I instantly realised you wanted the legend to be in the same order as the main plot. Fortunately R allows a simple reverse=TRUE options for legends, so here it is:.