Problems this morning
At approximately 4am PST, two separate database servers (db1 and db16) had RAID failures that caused file system corruption. They kept trying to process traffic but Linux had switched part of the file system to "read only", so no traffic data was actually being written to the hard drives. This problem lasted from approximately 4am to 7am PST. Unfortunately, this traffic data is gone and unrecoverable.
We have alert systems setup so that when a significant event occurs, such as a server going offline or a RAID failure, we are alerted immediately. Unfortunately, the RAID notifications on a few servers were recently disabled while we were performing some maintenance, and wouldn't you know it, db1 and db16 were among those servers. Because of this, we weren't notified of the problem, and didn't discover it until we woke up to a flood of emails in our inbox this morning.
There were no problems on other servers that we could find, but if you have a site on a server other than db1 or db16 and it's experiencing issues, please leave a comment here explaining what's happening. Be sure to include the site ID.
We apologize for this issue, which we take very seriously. The RAID notifications are all back online, and we will be sure to always re-enable them immediately after this kind of maintenance in the future. Leaving them disabled was just an honest mistake.
One final note, these RAID failures occurred at the exact same time on two different servers. This happened once before as well, although it was three servers instead of two, and it didn't cause any corruption last time. This seems like very strange behavior to us, and we're not sure what could possibly cause such a thing to happen to separate servers (that don't talk to each other) at the exact same time. If any sysadmins out there have any ideas, please share.
19 comments | Sep 02 2009 8:44am
We have alert systems setup so that when a significant event occurs, such as a server going offline or a RAID failure, we are alerted immediately. Unfortunately, the RAID notifications on a few servers were recently disabled while we were performing some maintenance, and wouldn't you know it, db1 and db16 were among those servers. Because of this, we weren't notified of the problem, and didn't discover it until we woke up to a flood of emails in our inbox this morning.
There were no problems on other servers that we could find, but if you have a site on a server other than db1 or db16 and it's experiencing issues, please leave a comment here explaining what's happening. Be sure to include the site ID.
We apologize for this issue, which we take very seriously. The RAID notifications are all back online, and we will be sure to always re-enable them immediately after this kind of maintenance in the future. Leaving them disabled was just an honest mistake.
One final note, these RAID failures occurred at the exact same time on two different servers. This happened once before as well, although it was three servers instead of two, and it didn't cause any corruption last time. This seems like very strange behavior to us, and we're not sure what could possibly cause such a thing to happen to separate servers (that don't talk to each other) at the exact same time. If any sysadmins out there have any ideas, please share.
19 comments | Sep 02 2009 8:44am

Recent Comments
The new EU KEBORD law should be coming in to effect soon, controlling and prohibiting ... Simon Nicklin, May 23 2012 hi iam boy and eager to become friends , with you if you want my love saeed, May 23 2012 Very good on my Galaxy tab! doctormauri73, Apr 01 2012 too bad... force close on checkrom evolution on galaxy s2 pierement, Mar 31 2012 Finally!! Thank you! Daymon, Mar 28 2012 Hi Guys. I haven't been able to access your website all day, and just wondered what ... Alison, Mar 28 2012 My biggest site is on 30 and would rather it was done on a weekday! Weekend is the ... Lee, Mar 27 2012 It's not going to affect tracking in any way, your stats will just unavailable for ... Sean (Clicky), Mar 27 2012 I installed the apk multiple times but I can't see it in the widget section on Android ... Thomas Sileo, Mar 27 2012 Can you give any indication on when my database server will be down ..... 30 minutes ... When?, Mar 27 2012