Thursday, July 22, 2010

RAID is not *Backup*

Well,I've heard of this since a few years back. It could never protect from Human errors, e.g, mistakenly issued a DROP database command. :)

Some people believe RAID 1 could save them from the risk of data looses due to hardware failure since they have 2 copies of every things. That's partially true. RAID 1 could save you from data looses due to 'a hard drive failure', not any of hardware failure. In a RAID 1 system, it includes 2 hard drives minimum ( could be 3 with a hot-spare) and a RAID controller card. Everything on a primary HDD is mirrored across on secondary one. So even if the primary HDD is failed, we could still continue our business since every bits of data are still exist on the secondary HDD. Probability of failing both HDDs at the same time is a very rare chance, approximately 0.01% chances only. Looks promising, right ?

But as I've said above, RAID controllers are involved in RAID systems and they play the main role in it. In most cases, when a RAID controller fails, it might make some improper I/O onto the hard drives which can bring your system into file system corruption. The chances of failing RAID controller is a bit high if we are dealing with integrated raid cards. Dedicated RAID controllers are much better in this case.


I've had this problem one time a year ago on Windows System running on RAID 5. It makes my system hang and gives BSOD on next reboot ( still BSOD even on the other known good machines). At that time I got a few luck to copy out all data using a Linux Live CD and it was a development server which is not too critical.

Just in a few days ago, one of our client gives us a call and says their payment gateway system is not accessible. What a generic report from a user 'not accessible' with no specific details. I could browse to their admin web UI and ssh in and looks around. But when i tried to create a file using 'touch' I got an error. Well, this's it. Initially, I was totally clam as I know this system is running on 2 mirrored HDDs. If a HDD fails, I can just swap it with a new HDD and bingo. That is what I've in my mind while travelling down to data center.

But it was not that fortunate. I was surprised to see both HDD having the same IO issue. Magic fsck could not help me this time. Try those HDDs with another spare identical machine, Kernel panic. I whispers "that's good. you give me something interesting bullsh*t for today". By running ServeRAID on the problematic machine, I could see it can't detect the HDDs all the the time even to the new HDD I just brought from office. Only thing I can do is I can backup databases and some data from linux rescue mode. Yes, some lib directories are not even browsable at this point.

I've left with no choice but to prepare a new server to replace. This existing system has self compiled Apache and some modules, including openssl and some libraries which supports TLS. It's not a difficult things, but it takes time. It's 11pm already when I started installing a new server. Well, that's an almost overnight working day.


When a RAID controller's gone, it has a chance all your data could be gone/corrupted too. So backup is a 'must' even if you are running on RAID 50. Be planned, be prepared.



Zephyr said...

I also hate the payment gateway ...
Though the developer is quite smart and put the credit info in plain text :)

Zephyr said...

Someone phoned to me on that night ...
In deed, you should also need to mention, If someone get 70000 SGD for this smaill project, shdn't use such lousy 500SGD server :).

Just to be fair to technicians and engineers, should buy better and more reliable one :).
This issue should be occured for new server may be only after 20 yrs later or at least 7 yrs later ... :)

Btw, I cannot watch movie, today, and tomorrow .. :)

Will watch on Sunday :)