Thursday, 6 September 2012

Assessing risks in iMedia projects and lessons learnt

We've met some thorny problems recently over the maintenance of a long standing project that has been ticking over happily for six years. The design had been done by one company, the server hosting by another and the database processing back-end was done by us. The client managed the three companies in their separate roles. All has not gone smoothly just recently and it provides a salient trigger for a few points to bear in mind if you have long-running projects that just ... work.

Lesson 1 Don't get complacent. Re-assess the risks even on stable projects. There were several signs of risks in hindsight. Key players left two companies and the tacit intelligence they had built up about the nuances of the project left with them. The new staff struggled to catch up. On our side we will be providing detailed briefing documents to help new hosting staff understand how the database fits together, as it isn't obvious. We are also improving our 'disaster recovery' at our client's insistence (quite right too).

Lesson 2 Check the level of experience of the newcomers and point out to the companies that it is their responsibility to manage the changeover of staff successfully with enough back-up of the right level of experience to be mentors.

It is hard to meddle in other companies, of course, but in the end you can voice your concerns about the risks to the client and it is up to them to manage the companies. They pay them. But you should also be aware that you can help cover this risk by briefing people fully about how things work.

Lesson 3 The unforeseen consequences of a tiny bit extra.

When you get that niggling feeling, it probably means there's a risk that needs to be covered! We had that niggling feeling when the design company needed to insert a tiny extra database to log accesses to the public end of the system to make sure that people were limited on the amount of data they could download from the web site. This appears a sound bit of security for the whole database system, we'd all agree. So, what happened?

The system as a whole gets around a million hits a week and the new tiny database was actually being continuously updated. So the server logging increased dramatically and when this was coupled with the logging done as the database was rebuilt the server just ran out of disc space; very suddenly. Knowing what could be deleted without affecting the ongoing rebuild was not straightforward.

Lesson 4 Even back-up systems fail.

Yes, although everyone thought there was the emergency backup system if the database failed, the back-up failed too! There was the strangest set of circumstances, naturally. The server company physically moved the servers which lead to cascading set of failures. (See Andy's blog of two weeks ago Escape to the country, 27th August.) The emergency fallover backup wasn't quite isolated enough as its database was automatically following changes from the server it was mirroring and it mirrored the failure too.

Lesson 5 If one thing goes wrong, other things are more likely to as well! It's that tip of Murphy's (Law) iceberg. These things in isolation wouldn’t have been so bad, but they all happened in quick succession. Maybe too, this was because all had run so smoothly for years.

A long- and smooth-running project should be praised constantly for its stability and the people concerned need praise too. We are all guilty of only noticing the errors and moaning about them. Give credit to the old, forgotten but smooth-running projects. But keep an eye on them.