Wednesday 3 April 2013

Maps for Dogs, or Lamp Posts in chains

Public Art-Wet Lamp Post
Wet Lamp Post, photo by Don3rdSE (all rights reserved)
Not more stuff on street lights, 'fraid so.

In my earlier post I mentioned that it would be nice to see if it was possible to find the location of streets simply by linking together all the lamp posts on the same street. This is a (technical) account of how I achieved this using PostGIS.

City by Night
Polygons representing streets captured from Street Light Open data
Colours are dependent on the number of lamp posts, the area is here.
My first, and deliberately naive, attempt was to group all lamp posts sharing a street name and to 'shrinkwrap' them with a concave hull. This is an out-of-the-box operation in PostGIS, the lamp posts are grouped into a single geometry first:
select street_name,        
       ST_ConcaveHull(ST_Union(lamp_post_geom))

from lamp_posts
group by street_name
Great, except that it didn't work. The reason why: there are several streets in the city which share a name, and these are often on opposite sides of town (e.g., Vernon Avenue, Basford and Vernon Avenue, Wilford). In this case, and probably others, Wilford only became part of the city around 1950. So duplicate names have arisen through city expansion rather than through some absence of mind of the authorities.

Therefore I decided to start at the base level of the data and work upwards from the lamp post positions themselves. It is easy to find each lamp post's nearest neighbour which shares the same street name. I simply placed a decent sized buffer (50-60 metres looks about sight, average distance is 29.54 m ± 16.18 )around each light's geometry and then found the lights within that geometry sharing the streetname:
where ST_Within(a.lamp_post_geom ,ST_Buffer(b.lamp_post_geom, DISTANCE)
and a.street_name = b.street_name
Then one selects the lamp post closest in distance, using the neat DISTINCT ON clause in PostgreSQL. This also has problems because it will find pairs of lamp posts. Therefore another predicate is needed to ensure that for each lamp post choose only one of its adjacent posts is consistently chosen. Again PostGIS offers geometry operators, and we can test if the bounding box of one lamp post is above, below, left (<<) or right (>>) of another. It doesn't matter which is used. In this case I used <<. However, I got no results when I added such an operator directly to the where clause above. I had to nest the query as a set of sub-queries as follows;
create table streetlight_arcs as
select distinct on (a.geo_id)
  a.geo_id left_geo_id
, b.geo_id right_geo_id
, a.street_name
, st_distance(a.geom,b.geom) short_length
from streetlights a
   , streetlights b
where 1 = 1 -- a place holder predicate to make editing easier!
and a.
street_name = b.street_name
and
St_Within(a.geom,St_Buffer(b.geom,75))
and a.geom << b.geom
order by a.geo_id,
st_distance(a.geom,b.geom)
The result is that we have the arcs of many graphs (streets) described as pairs of nodes (lamp posts). All we have to do is traverse these arcs to find all the member nodes of each graph (i.e., all lamp posts in a street). Of course many of the streets will be a simple chain of arcs, but longer streets will have a (slightly) more complex topology. I suspect that the use of the geometry operator also means that all the graphs are DAGs, but I haven't attempted to verify this.

I've been interested in efficient ways of working with graphs in SQL databases for years. In the late 1990s I worked with a large (million+) database of customer records which had a very rich set of linkages between individual customer ids. This was for a data mining project so we needed to group the individual customer ids into something which represented the real world customer more coherently (for instance companies all controlled by the same director). As the project was a trial I had to find something which worked fairly quickly. It also had to efficient even though the data was on an MPP machine. Irritatingly I can always remember the basic principles of the alogorithm I developed, but never the nitty-gritty details.

As I wanted to do something similar with the lamp posts I've had a go at re-writing this as a Postgres Stored Procedure. What I have has none of the performance optimisation from my past implementations, but it does serve the basic purpose, of traversing a graph and producing output with each graph uniquely identified and a list of all of its member vertices. Input is a table containing arcs with each node of the arc identified by a unique integer (other columns may be present, but aren't used). Output is a table with two columns, parent_id (i.e., a unique id for the graph) and child_id (graph members). The graph identifier is merely the lowest identifier of in its set of members).

A single work table is created and one column is updated on each pass. Two queries are used per pass, one for the left vertex, the other for the right vertex. When no more rows are updated data are summarised into the output table.
CREATE OR REPLACE FUNCTION gt(
  input_table character varying

, left_column character varying
, right_column character varying
, output_table character varying)
  RETURNS integer AS
$BODY$
DECLARE iter integer; -- mainly used for debug purposes
DECLARE level integer;
DECLARE row_count integer := 0;
DECLARE sql_text text;
DECLARE work_table character varying := 'work_table';
BEGIN
-- Input table requires a list of arcs of a graph
-- stored as pairs of integers identifying the nodes
-- only pairs with a distance of one are required,
-- nodes must be uniquely identified by an integer

EXECUTE 'DROP TABLE IF EXISTS work_table';
 
-- First populate the work table with self references to all nodes
sql_text := 'CREATE TABLE work_table as
      select distinct ' || left_column || ' as ' || left_column || '
         ,' || left_column || ' as left_uid
         , ' || left_column || ' as right_uid
         ,' || left_column || ' as orig_min_uid
         ,' || left_column || ' as min_uid
        , 0::integer lvl
        , 0::integer iter
       from
            (select ' || left_column || ' from '|| input_table || '
             union
             select ' || right_column || ' from '|| input_table || '
               ) a1
      union
      select ' || left_column ||
        ', ' || left_column ||
        ', ' || right_column ||
        ', least(' || left_column || ',  ' || right_column || ')
         , least(' || left_column || ',  ' || right_column || ')
         , 1::integer lvl

         ,  0::integer iter
          from ' || input_table || ' as a2
           union
          select ' || right_column || ', ' || right_column || ',
          ' || left_column || ', least(' || left_column || ', 
          ' || right_column || '), least(' || left_column || ', 
          ' || right_column || '), 1::integer lvl ,  0::integer iter
          from ' || input_table || ' as a3' ;

-- run the query

EXECUTE sql_text;
GET DIAGNOSTICS row_count = ROW_COUNT;

-- now do the main iteration
    iter := 1;
WHILE row_count <> 0 -- exit loop when no more updates
  LOOP
    -- first find minimum vertex id for left vertices
    sql_text :=
      'UPDATE work_table
         SET min_uid = a.min_uid
           , iter = iter+1
         FROM (
            SELECT
              a1.left_uid left_uid
            , min(least(a1.min_uid,a2.min_uid)) min_uid
            FROM work_table a1
               , work_table a2
            WHERE a1.right_uid = a2.left_uid
              AND a1.min_uid <> a2.min_uid
            GROUP BY a1.left_uid ) a
          WHERE a.left_uid = work_table.left_uid
            AND work_table.min_uid <> a.min_uid';
  
    EXECUTE sql_text;
    GET DIAGNOSTICS row_count = ROW_COUNT;
    -- then minimum vertex id for right vertices
    sql_text :=
       'UPDATE work_table
           SET min_uid = a.min_uid
             , iter = iter+1
          FROM (
            SELECT
                 a1.right_uid right_uid
               , min(least(a1.min_uid,a2.min_uid)) min_uid
              FROM work_table a1
                 , work_table a2
             WHERE a1.right_uid = a2.left_uid
               AND a1.min_uid <> a2.min_uid
            GROUP BY a1.right_uid ) a
          WHERE a.right_uid = work_table.right_uid
            AND work_table.min_uid <> a.min_uid';
         
     EXECUTE sql_text;
     GET DIAGNOSTICS row_count = ROW_COUNT;
            
     iter := iter + 1;

   END LOOP;

 sql_text := 'CREATE TABLE ' || output_table || ' as
          SELECT DISTINCT min_uid as parent_id
                        , left_uid as child_id
          FROM work_table ';
          RAISE NOTICE '%', sql_text;
          EXECUTE sql_text;         
          GET DIAGNOSTICS row_count = ROW_COUNT;
          RETURN row_count;

END; -- gt
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

The algorithm works by comparing the identifiers of two or more linked nodes and always selecting the lowest which is then assigned as the parent identifier of those nodes. The process is iterated as the graph is traversed. I'm sure that at the moment it is O(n) when it should be O(log(n)), where n is the minimum length of a path across graphs. It also scans the whole table on each pass, so it needs a lot of refinement.

Nottingham Street Lights used for Routing
Comparison of OSM highways with those derived from Streetlight data
The image above shows an overview of the city mapped just using these chains of streetlights. For comparison I have added the current OSM highway network as an underlay. The process has created a pretty decent representation of the city's road network. Closer up (below) one can see the quirks of the ways produced: zigzag ways on roads with regular alternation of lamp posts on either side of the road; a ladder effect on major roads where lamp posts are arranged in pairs on either side of the road; and oddities generated by using the postgis geometry operator. All in all, this is another example about how much one can derive from an apparently uninformative dataset.

The first abstract image above was obtained by putting an buffered envelope around each isolated street and colour coding according to the number of lamp posts within the buffer.

The obvious use case for all this is create a navigation app for dogs (or possibly drunks). Here is a Garmin displaying an IMG file created with mkgmap just to show that this is possible!

IMG_5419a
Routable map on a Garmin etrex built from Street Light Open Data
I think, however, that there is a more serious side.
  • Firstly the technique described above — of converting geodata into a graph, analysing it, and then generating additional geodata — is very useful. This is particularly so if one can convert very computationally intensive operations (PostGIS geometry functions and operators) into simple ones (integer comparisons). I have used a similar approach to assemble land polygons from the OS OpenData Meridian coastlines, which are only available as lines not areas. It is also very easy to convert various sorts of OSM data, such as highway ways, into this kind of graph within Postgres. For instance this could be used to find isolated sub graphs.
  • Secondly, by enriching the data I have created some useful polygons suitable for independent verification of OSM data. This is nice because a significant proportion of street data in the UK is derived from a single source: Ordnance Survey Open Data. Whilst many names have been checked by survey, it's very nice to have another comprehensive source.
  • Thirdly, it enables identification of missing residential ways: in particular named ways which are only footpaths.
  • Fourthly, private roads not owned or maintained by the council are more easily picked out. I have been surprised how many new developments do not have council provided street lights (for instance Stanhope Avenue  and Langdon Place). Such roads may have distinct access restrictions and other unusual features.
  • Fifthly, the same sort of approach of enrichment & derivation will work with sets of data from OSM or other sources, provided that they are comprehensive.
Of course I actually just did this 'because it was there' .

No comments:

Post a Comment

Sorry, as Google seem unable to filter obvious spam I now have to moderate comments. Please be patient.