Saturday, 5 February 2011

Crime hotspots pinpoint data problems

You can't have missed the news earlier this week that UK crime information was being made available online, linked to maps and searchable by postcode. The story made the news for possibly the wrong reasons: the site was unreachable and the results seemed nonsensical in some cases.

The problem with access to the site,, seems to have been resolved and can be put down to intense consumer interest coupled with widespread publicity. Building a web system that can cope with thousands of hits a minute isn't easy and isn't cheap, especially if the base load predicted for the system is much less. But that's not why I'm writing this here.

I am interested in data, and have been for a long time. Data isn't really something in isolation: it usually needs interpreting and quite often needs background of how the data is collected to allow you to understand it completely. There were a couple of newsworthy elements to the crime data, one of which relates to the meaning of location.

The crime reports seem to be displayed based on postcode centroids (essentially the geometrical middle of the postcode) which tends to shift them to the middle of roads (meaning the middle of the length of a road, not the white line). This may be misleading if the criminal activity tends to occur at road junctions since centroids are rarely located there, and activity spread across a postcode will tend to be pushed to the centre, as if it was falling downhill towards the centroid. But as the web site display says: To protect privacy, crimes are mapped to points on or near the road where they occurred. That's a data output issue.

On the other hand, data input can be misleading. A call centre was logging nuisance calls to its own location when they didn't have a real location for it. That meant that the call centre itself was listed as a crime hotspot! That's a data input problem, which could be ameliorated if a 'confidence' ranking was given to logged locations.

One final thought concerns data correlation. With access to this crime data it is likely that someone will start to compare these maps with other data. This is unwise unless you really know what you are doing and what the data represents. To give an example, there may be an apparent correlation between the distribution of red admiral butterflies and car crime on Merseyside. But if you thought you saw this in a map you wouldn't take it seriously would you? I know it's a somewhat ridiculous example, but I'm sure you get my point.

Those butterflies eh?