I mentioned in a post earlier this morning that I was having problems accessing wordpress.com blogs – wordpress.com is a hosted multi-user version of the blog software I use, WordPress. The site is now available again but suffered a “major disk failure” according to a message on the wordpress.com Dashboard.
The data loss is presumably because the drive which failed was not in a RAID array and the last backup of the site was a couple of days ago!
This is unforgivable. No matter how small a hosting organisation you are (and WordPress.com couldn’t be considered small), your users data is sacrosanct. Users will tolerate occasional downtime but not loss of data.
Matt and the rest of the WordPress.com team, you need to try to resurrect as much of your users data as possible (if you haven’t already done this), put the site on a RAID array, put a disaster recovery plan in place which ensures no data can ever be lost again and then try very hard to rebuild your now shattered reputation.
MacManX alerted me, in the comments of this post, to the fact that Matt has put up a post about this issue. In the post, Matt explains what happened, how the WordPress.com team responded and that fact that no data was lost:
Donncha was on the ball and switched all the traffic to a recent backup so most things would work while we investigated the hardware failure. This means that an old version of your site was shown for a few hours.
A few minutes ago we restored the up-to-date database and weâ€™re currently syncing it to the backup to get back any posts you might have made during the semi-downtime. Even though we were able to recover everything, weâ€™re looking at ways to make things even more redundant, so if this ever happens again the problems will be measure in seconds or minutes
It is lucky for the WordPress.com team that no data was lost, this will help people’s confidence in the platform. However, they need to get a RAID solution in place for the database (preferably with multiple RAID containers – 1 for OS, 1 for db and 1 for transaction logs) and a live backup db server in case of a logic board failure on the db server. Only at this level of redundancy will they be able to sleep at night and hand on heart be able to promise data integrity to WordPress.com users.
4 thoughts on “WordPress.com "major drive failure"”
Matt has finally made a statement about the issue.
Tom, I think you’re completely underestimating the size/potential size of WordPress.com if you’re talking about RAID solutions on a single server. With WordPress.com we’re going to be talking about literally 100s of machines.
Why exactly is it ‘lucky’ no data was lost? There’s a load of ways to backup data besides RAID. Besides the fact that RAID doesn’t even necessarily imply mirroring.
Dave, tbh I was talking about the db server – that’s the one wordpress.com reported they had problems with. I understand (in fact I hope) there is more than one physical server acting as the db server, but it is still a db server no matter how many physical servers are involved.
When I said it is lucky no data was lost, I meant that a loss of data would undermine wordpress.com user’s confidence far more than a brief outage – that comment had nothing to do with the architecture.
Yes, yes, and yes. We do have RAID 1+0 on the DB servers, a live backup running now, and have done a few other things to ensure this won’t happen again. The timing was unfortunate because things were still in transition because of the datacenter move.
Comments are closed.