Friday, 4 May 2012

Databases: relational instances or NoSQL

Rather like that man who would occasionally emerge from a shed on the Fast Show, these days I am mostly working on databases.

So far they have been relational databases, based on a plan devised in 1970 by a man named Edgar Codd who worked for IBM in California. In this model data is split into tables and they are interlinked by using keys (usually numbers). A table is basically a flat file, like a spreadsheet and has a number of records (eg people) with fields (such as their name). If you want to record the companies they work for you could have a field in the people table with the company name but, more likely, you will have another table of companies which contains the information about each company. The relational bit is that there would be a reference number (an ID) for each company and it is this ID that would be included, in a 'companies' field, in the people table in order to link the person to the company. In this way several people can work for the same company. Incidentally, the process of separating out data so that repeated stuff (like the company data) is in another table and is cross-referenced is known as normalisation.

This model starts to fall down, by becoming more complex, if the relationships become multi-connected. Let me give you an example.

You have a database of CDs. A CD has, as what we call its attributes, such things as a title, an artist, a release data and a list of songs. Two things are problematic here: artist names can vary subtly between albums and each album has a different number of songs on it. Because of this you can't have a simple relationship between the record for an artist (in the artist table) and the CD or between a song (in the song table) and the CD. This is because the artist has more than one name and a song may be on more than one CD.

To deal with this you can use instances of the links between the tables and all these instances go into a separate table. For the artist there is one instance for one name by which the act is known (eg Prince) and one for another (the Artist Formerly Known as Prince) and the album links to whichever instance is appropriate (and gets the artist's name from it) and the instance then links to the main artist information. Similarly with songs, although usually ... but not always ... the song's name remains the same for each instance.

Doing this adds versatility to the database but it increases the number of tables and can also reduce performance.

Such complexity is one of the reasons that the 21st Century approach to databases is widening beyond the relational model (scalability is an even more crucial one) and although many of the big players on the web, such as Facebook and Twitter, use such a database there are others. Google's Big Table being perhaps the best known. These collectively seem to be known as NoSQL (because they don't use the SQL query language), although if you read the discussion page associated with the previous Wikipedia link, it does seem to be a controversial topic. I should add that Wikipedia's Talk pages are always worth checking.

I don't think relational databases are quite dead yet, and not all of us have Google's voracious appetite for data ... but it's clearly something to keep an eye on. What do you think?