Spent all of last week fighting fires. We have a production server that was suffering software (application bugs) and hardware failures. Crashing left and right, I got very little sleep responding to my pager and going online to restart the app and/or server.
The app was using too much memory (Java app) the server just can’t take more memory (we already have 32GB in it). So we decided to throw more hw at it. App and Postgresql was running on it (yes, I know, bad, bad, bad design — my excuse is that it was not me that set this up, I joined later). Anyway, brought up a new, faster server (Dell R410) and moved Java app over on to it, leaving Postgresql on the old server. The plan is that if we run into problem, it’s easy to move right back to old server. Also easier quicker this way, no down time to take down DB, copy data over, etc. Besides which, the DB is currenlty over 65GB, will take a while to copy over.
Well, guess what…. the new R410 started experiencing hw problem! I have RAID 10 setup on the 4 drives. Drives 1 & 2 (one of each RAID1 element) faulted, CRAP! Swapped drives. Still faulting. I get message from the kernel (dmesg) that it kept having to rescan the SAS bus as the drives kept dropping out. (Running CentOS 5.2 64 bit).
Talked with Dell support…. ah, what a pain in the rear they are. They insisted that it was a firmware issue!!!! Google for “Dell RAID controllers rejecting non-DELL drives”. We paid for same day support and we want support now! After a couple hours on the phone, we got them to agree to swap motherboard and RAID controller the next day.
In the mean time, we have another R410 sitting the same rack (but in use). The apps on it can be move to another server though. So I spent a couple hours at the data center moving the drives from failing R410 over to the other one. I was afraid there might be problem because the current state of the RAID is degraded (2 drives in the RAID10 faulted and still syncing). But it worked like a charm. Shut down both systems, swapped drives (two at a time, drive 0, drive 1, drive2, drive3 so I don’t mess up). Bring up the good R410….
It came up fine. Saw the new RAID drives and asked if I want to import foreign config. Said yes, and press Ctrl-R anyway so I can check and the RAID controller saw the RAID10. It told me that the two drives are syncing. Great, exit out and reboot.
Then I noticed that this system only have 16GB RAM…. aw CRAP! Shut it down, pull them both off the rack, open the case, swapped DIMMs. Put them both back in, boot up the good one…. hold my breath….. and YES, it came up, 32GB, saw the RAID drives…
Once I got the login: prompt, login, check around, making sure everything is there. Realize that the network is not up. Spent a couple panic stricken minutes checking cable, switch ports, etc. Then I remember that with RedHat (and CentOS) the ifcfg-ethN script is updated at boot and uses the MAC address. Since I moved the drives to another server, the MAC changed and RH/CentOS noticed that the MAC address in existing ifcfg-ethN does not match current MAC, it updated those files. Luckily it renamed the existing one to ifcfg-ethN.old.
I fired up vi and updated the old ifcfg-ethN.old file with new MAC address, rename them back to ifcfg-ethN (eth0 and eth1). Bring them down and back up (ifdown eth0, then ifup eth0) and the network is up.
Reboot the server just to be sure that everything work, login and start up the app. Checked from an external address (ssh to my home server, point my browser to squid at home) via a browser that the app is running and acessible from the outside world.
I’ve done this before, e.g. moving entire RAID (it was RAID1 and RAID5) from one Dell server to another identical hw Dell server. So I know it works. Only difference was the degraded mode of the RAID, but I am glad that it worked fine too.