Search

mpexo

  • This site is proudly listed as a mobile blog on mpexo.

Count per Day

  • 11Visitors today:

Recent Comments

    Testimonials

    “I just wanted to let you know how much we appreciate the work that you guys have done in assisting us… not only in working through our issues but in keeping us informed along the way. It really is a pleasure to work with such a professional organization that obviously takes pride in their work. I look forward to our continued successes.”



    David Bloom
    Big Apple Blog

    UserOnline

    Todays Clicks

      Problems this morning

      posted by clicky 5:32 PM
      Monday, April 5, 2010

      At approximately 4am PST, two separate database servers (db1 and db16) had RAID failures that caused file system corruption. They kept trying to process traffic but Linux had switched part of the file system to read only, so no traffic data was actually being written to the hard drives. This problem lasted from approximately 4am to 7am PST. Unfortunately, this traffic data is gone and unrecoverable.

      We have alert systems setup so that when a significant event occurs, such as a server going offline or a RAID failure, we are alerted immediately. Unfortunately, the RAID notifications on a few servers were recently disabled while we were performing some maintenance, and wouldn’t you know it, db1 and db16 were among those servers. Because of this, we weren’t notified of the problem, and didn’t discover it until we woke up to a flood of emails in our inbox this morning.

      There were no problems on other servers that we could find, but if you have a site on a server other than db1 or db16 and it’s experiencing issues, please leave a comment here explaining what’s happening. Be sure to include the site ID.

      We apologize for this issue, which we take very seriously. The RAID notifications are all back online, and we will be sure to always re-enable them immediately after this kind of maintenance in the future. Leaving them disabled was just an honest mistake.

      One final note, these RAID failures occurred at the exact same time on two different servers. This happened once before as well, although it was three servers instead of two, and it didn’t cause any corruption last time. This seems like very strange behavior to us, and we’re not sure what could possibly cause such a thing to happen to separate servers (that don’t talk to each other) at the exact same time. If any sysadmins out there have any ideas, please share.



      Leave a Reply

      You must be logged in to post a comment.

      Blog WebMastered by All in One Webmaster.

      Switch to our mobile site