Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Thursday, 29 January 2015

Accuracy and meaning

My eye was caught this morning by two unrelated news items but which have in common the ideas of accuracy and meaning. How can a seemingly small change be significant?

In some ways my first isn't a small change: the BBC Arabic Service have said they will not describe an act as terrorism or a person as a terrorist. Instead the terminology will be more specific, such as bomber or attacker or gunman. There's an interesting analysis of this by Memphis Barker in the Independent. Apparently this stance is already reflected in the BBC's editorial guidelines which say that the BBC "does not ban the use of the word. However, we do ask that careful thought is given to its use by a BBC voice."

The word itself is interesting in that it derives from the French terrorisme which specifically referred to the then French government's reign of terror.

That said, use of the word, and by extension any emotive word, needs to be carefully considered, especially if it has connotations beyond its literal meaning. Such risk can be exacerbated when working out pithy and attractive headings for web pages (and stories in newspapers), and avoidance of such problems is part of the skill of the newspaper sub-editor. If you're writing for a blog or web site then you will also be taking on that role. If you're an organisation like the BBC then communicating with 'your voice' is also a factor. Are your clients big enough to think this way too?

My second example is something to strike fear into the hearts of anyone running databases: can a small error be catastrophic?

In recording data about companies that had been wound up, the UK companies registrar, Companies House, accidentally failed to notice a letter 'S' in a company name that should not have been there. Taylor & Sons Ltd had not gone into liquidation, it was Taylor & Son Ltd. As this Guardian piece explains, that single letter cost Taylor & Sons dear ... it really did go out of business ... and now, even though they corrected the mistake after three days, Companies House have to carry the can to the tune of what is likely to be several million pounds.

This kind of error can be caused during data prep, when the data is input, or during processing or data retrieval. From your company point of view, it would probably be covered by professional indemnity insurance, should there be a financial liability. Sometimes, however, it might just be embarrassing. In the BBC Domesday Project, an inadvertent error made the UK seem to be highly radioactive. Fortunately it was noticed before publication and fixed by a software engineer doing the data equivalent of a high wire act to correct a single byte of data.

Friday, 4 May 2012

Databases: relational instances or NoSQL

Rather like that man who would occasionally emerge from a shed on the Fast Show, these days I am mostly working on databases.

So far they have been relational databases, based on a plan devised in 1970 by a man named Edgar Codd who worked for IBM in California. In this model data is split into tables and they are interlinked by using keys (usually numbers). A table is basically a flat file, like a spreadsheet and has a number of records (eg people) with fields (such as their name). If you want to record the companies they work for you could have a field in the people table with the company name but, more likely, you will have another table of companies which contains the information about each company. The relational bit is that there would be a reference number (an ID) for each company and it is this ID that would be included, in a 'companies' field, in the people table in order to link the person to the company. In this way several people can work for the same company. Incidentally, the process of separating out data so that repeated stuff (like the company data) is in another table and is cross-referenced is known as normalisation.

This model starts to fall down, by becoming more complex, if the relationships become multi-connected. Let me give you an example.

You have a database of CDs. A CD has, as what we call its attributes, such things as a title, an artist, a release data and a list of songs. Two things are problematic here: artist names can vary subtly between albums and each album has a different number of songs on it. Because of this you can't have a simple relationship between the record for an artist (in the artist table) and the CD or between a song (in the song table) and the CD. This is because the artist has more than one name and a song may be on more than one CD.

To deal with this you can use instances of the links between the tables and all these instances go into a separate table. For the artist there is one instance for one name by which the act is known (eg Prince) and one for another (the Artist Formerly Known as Prince) and the album links to whichever instance is appropriate (and gets the artist's name from it) and the instance then links to the main artist information. Similarly with songs, although usually ... but not always ... the song's name remains the same for each instance.

Doing this adds versatility to the database but it increases the number of tables and can also reduce performance.

Such complexity is one of the reasons that the 21st Century approach to databases is widening beyond the relational model (scalability is an even more crucial one) and although many of the big players on the web, such as Facebook and Twitter, use such a database there are others. Google's Big Table being perhaps the best known. These collectively seem to be known as NoSQL (because they don't use the SQL query language), although if you read the discussion page associated with the previous Wikipedia link, it does seem to be a controversial topic. I should add that Wikipedia's Talk pages are always worth checking.

I don't think relational databases are quite dead yet, and not all of us have Google's voracious appetite for data ... but it's clearly something to keep an eye on. What do you think?

Saturday, 5 February 2011

Crime hotspots pinpoint data problems

You can't have missed the news earlier this week that UK crime information was being made available online, linked to maps and searchable by postcode. The story made the news for possibly the wrong reasons: the site was unreachable and the results seemed nonsensical in some cases.

The problem with access to the site, www.police.uk, seems to have been resolved and can be put down to intense consumer interest coupled with widespread publicity. Building a web system that can cope with thousands of hits a minute isn't easy and isn't cheap, especially if the base load predicted for the system is much less. But that's not why I'm writing this here.

I am interested in data, and have been for a long time. Data isn't really something in isolation: it usually needs interpreting and quite often needs background of how the data is collected to allow you to understand it completely. There were a couple of newsworthy elements to the crime data, one of which relates to the meaning of location.

The crime reports seem to be displayed based on postcode centroids (essentially the geometrical middle of the postcode) which tends to shift them to the middle of roads (meaning the middle of the length of a road, not the white line). This may be misleading if the criminal activity tends to occur at road junctions since centroids are rarely located there, and activity spread across a postcode will tend to be pushed to the centre, as if it was falling downhill towards the centroid. But as the web site display says: To protect privacy, crimes are mapped to points on or near the road where they occurred. That's a data output issue.

On the other hand, data input can be misleading. A call centre was logging nuisance calls to its own location when they didn't have a real location for it. That meant that the call centre itself was listed as a crime hotspot! That's a data input problem, which could be ameliorated if a 'confidence' ranking was given to logged locations.

One final thought concerns data correlation. With access to this crime data it is likely that someone will start to compare these maps with other data. This is unwise unless you really know what you are doing and what the data represents. To give an example, there may be an apparent correlation between the distribution of red admiral butterflies and car crime on Merseyside. But if you thought you saw this in a map you wouldn't take it seriously would you? I know it's a somewhat ridiculous example, but I'm sure you get my point.

Those butterflies eh?

Sunday, 31 October 2010

You never know who's listening

I suppose it comes into the category of a story that will run and run: it has legs, as they say, even though the problem was caused on wheels. What am I on about?

It looks as if Google Street View could be in breach of some laws in the UK, after having had similar problems in other countries and been blocked from the odd village for 'snooping'. It even shows our neighbour, frozen forever in the act of reading a book in the conservatory in front of his house while we all wait for our bins to be collected. I was in but Elaine was out, according to the cars in front of the house. Do I care?

Not about the photos, but if I thought that Google had recorded a snippet of my Wi-Fi traffic then I might be. That seems to be the nub of it: incidental and, apparently, inadvertent recording of data. Data that might contain part of a confidential email exchange or even a password sent unencrypted to an FTP server. The interception was, as I understand it, done to match a WiFi router's MAC code to the physical location. This would enable, say, a mobile phone to check its location by looking to see what transmitters of any kind were in range. Those of you with iPhones will have seen that blue dot dance that occurs as the Google Maps application refines your location, from a combination of cell tower information and WiFi until it can, finally, use GPS to give you the real location. The story goes that some extra code from another project got into the Google car system and instead of just recording the WiFi's location it also recorded some of the traffic.

There is a lesson for us all here, which is the danger of amalgamating code snippets without fully understanding what they do. The 'snooping' code was presumably attached to something less contentious but both were incorporated in the street view system. On the one hand it's good coding practice to efficiently reuse your legacy code ... to not reinvent a software wheel ... but it is vital to look in detail at what that code does. In turn that comes down to documentation and code comments. It also comes down to making sure that any code put in a routine for testing purposes is removed or disabled in the release version.

There is also a great temptation to cut corners with code for internal use; but you never know when things will get out into the wild. In radio they tell you never to swear in front of a microphone because you never know when it might be live. Treat code the same way.