Moving RAID 10 from one Dell R410 to another

Spent all of last week fighting fires.  We have a production server that was suffering software (application bugs) and hardware failures.  Crashing left and right, I got very little sleep responding to my pager and going online to restart the app and/or server.

The app was using too much memory (Java app) the server just can’t take more memory (we already have 32GB in it).  So we decided to throw more hw at it.  App and Postgresql was running on it (yes, I know, bad, bad, bad design — my excuse is that it was not me that set this up, I joined later).  Anyway, brought up a new, faster server (Dell R410) and moved Java app over on to it, leaving Postgresql on the old server.  The plan is that if we run into problem, it’s easy to move right back to old server.  Also easier quicker this way, no down time to take down DB, copy data over, etc.   Besides which, the DB is currenlty over 65GB, will take a while to copy over.

Well, guess what…. the new R410 started experiencing hw problem!  I have RAID 10 setup on the 4 drives.  Drives 1 & 2 (one of each RAID1 element) faulted, CRAP!  Swapped drives.  Still faulting.  I get message from the kernel (dmesg) that it kept having to rescan the SAS bus as the drives kept dropping out.  (Running CentOS 5.2 64 bit).

Talked with Dell support…. ah, what a pain in the rear they are.  They insisted that it was a firmware issue!!!!  Google for “Dell RAID controllers rejecting non-DELL drives”.  We paid for same day support and we want support now!  After a couple hours on the phone, we got them to agree to swap motherboard and RAID controller the next day.

In the mean time, we have another R410 sitting the same rack (but in use).  The apps on it can be move to another server though.  So I spent a couple hours at the data center moving the drives from failing R410 over to the other one.  I was afraid there might be problem because the current state of the RAID is degraded (2 drives in the RAID10 faulted and still syncing).  But it worked like a charm.  Shut down both systems, swapped drives (two at a time, drive 0, drive 1, drive2, drive3 so I don’t mess up).  Bring up the good R410….

It came up fine.  Saw the new RAID drives and asked if I want to import foreign config.  Said yes, and press Ctrl-R anyway so I can check and the RAID controller saw the RAID10.  It told me that the two drives are syncing.  Great, exit out and reboot.

Then I noticed that this system only have 16GB RAM…. aw CRAP!  Shut it down, pull them both off the rack, open the case, swapped DIMMs.  Put them both back in, boot up the good one…. hold my breath…..  and YES, it came up, 32GB, saw the RAID drives…

Once I got the login: prompt, login, check around, making sure everything is there.  Realize that the network is not up.  Spent a couple panic stricken minutes checking cable, switch ports, etc.  Then I remember that with RedHat (and CentOS) the ifcfg-ethN script is updated at boot and uses the MAC address.  Since I moved the drives to another server, the MAC changed and RH/CentOS noticed that the MAC address in existing ifcfg-ethN does not match current MAC, it updated those files.  Luckily it renamed the existing one to ifcfg-ethN.old.

I fired up vi and updated the old ifcfg-ethN.old file with new MAC address, rename them back to ifcfg-ethN (eth0 and eth1).  Bring them down and back up (ifdown eth0, then ifup eth0) and the network is up.

Reboot the server just to be sure that everything work, login and start up the app.  Checked from an external address (ssh to my home server, point my browser to squid at home) via a browser that the app is running and acessible from the outside world.

I’ve done this before, e.g. moving entire RAID (it was RAID1 and RAID5) from one Dell server to another identical hw Dell server.  So I know it works.  Only difference was the degraded mode of the RAID, but I am glad that it worked fine too.

11 thoughts on “Moving RAID 10 from one Dell R410 to another”

  1. Here is another option, I would not use it on servers, but I am surprised at how well it run. VirtualBox.

    Funny enough, for more than a year now, I’ve been using a Macbook Pro. Didn’t want Parallel or VMWare, heard about VirtualBox and tried it. It’s been surprisingly usable. I can exchange vmdk’s back and forth with VMWare.

    I know Sun sells it on their servers as an alternative to VMWare and Xen.

  2. Yes, KVM is interesting. I’d like to see some seasonings on it though :-). Probably put it up on a non-essential service for a while and see how it behave.

    The thing about automatic init scripts is… well, it’s automatic script and things can go wrong. I supposed for true production servers, you can have multiple failovers and load balancing, so you can take one down and work on it without affecting the rest.

    Let me know if you try out KVM. I am moving things to EC2, time to roll up my sleeves and write a bunch of automation scripts. What they have there…. need work.

  3. My init scripts are derived from http://www.tuxyturvy.com/blog/index.php?/archives/48-Automating-VMware-modules-reinstall-after-Linux-kernel-upgrades.html with some small tweaks to make them regular init scripts:

    For the host:

    #!/bin/bash

    #
    # config-vmware Reconfigures the VMware modules after kernel upgrades as needed
    #
    # chkconfig: – 09 09
    # description: Reconfigures the VMware modules after a kernel upgrade as needed

    # Reinstall the VMware modules as needed
    if [ ! -e /lib/modules/`uname -r`/misc/.vmware_installed ]; then
    /usr/bin/vmware-config.pl –default EULA_AGREED=yes
    touch /lib/modules/`uname -r`/misc/.vmware_installed
    fi

    For the VMs themselves:

    #!/bin/bash
    #
    # config-vmware-tools Reconfigures the VMwareTools modules after kernel upgrades
    #
    # chkconfig: – 09 09
    # description: Reconfigures the VMwareTools modules after a kernel upgrade

    # Following lines auto-recompile VM Tools when kernel updated
    VMToolsCheckFile=”/lib/modules/`uname -r`/misc/.vmware_installed”
    VMToolsVersion=`vmware-config-tools.pl –help 2>&1 | awk ‘$0 ~ /^VMware Tools [0-9]/ { print $3,$4 }’`

    printf “\nCurrent VM Tools version: $VMToolsVersion\n\n”

    if [[ ! -e $VMToolsCheckFile || `grep -c “$VMToolsVersion” $VMToolsCheckFile` -eq 0 ]]; then
    [ -x /usr/bin/vmware-config-tools.pl ] && \
    printf “Automatically compiling new build of VMware Tools\n\n” && \
    /usr/bin/vmware-config-tools.pl –default && \
    printf “$VMToolsVersion” > $VMToolsCheckFile && \
    rmmod pcnet32
    rmmod vmxnet
    depmod -a
    modprobe vmxnet
    fi

  4. I tried Xen a bit over a year ago, but it had some issues with bonded ethernet interfaces. And it actually caused a kernel panic for the host a couple of times.

    I fixed the vmware-config issues here by writing some init scripts to run it automatically as needed.

    I’m leaning towards KVM over Xen now because it is clearly where RH is going with their support.

  5. @tin

    VMware Server 2.

    I’m looking at KVM for the longer term. VMServer2 has unfixed problems such as creeping CPU load requiring the VMs to be stopped completely and restarted periodically and Glibc version incompatibilities. VMware seems to have unoffically orphaned VMServer2 in favor of ESXi.

    I looked at ESXi, but it is too finicky about supported hardware and is really just a loss leader for them to get you to buy the expensive management tools.

    1. You should take a look at Xen. I am using VMWare now, but not too happy with it. As far back as two years ago, I’ve gotten Xen to boot and run pretty much anything I threw at it – Linux, BSD and Win2K8 (even Vista).

      The thing I like most about Xen is kernel updates come with compiled drivers, so I don’t have to run vmware-config everytime. I can’t count the times my SAs reboot a system and forgot to re-run vmware-config, then the VMs are down for hours.

      KVM is good, but the important thing to look at is the supporting tools.

  6. @Benjamin Franz

    Yeah, I have so much Dell hw here… legacy and all that. I am experimenting with Supermicro’s (Silicon Mechanics branded) servers. Their 4-in-1 2U systems are sweet. Essentially miniblade servers.

    Heh, all my servers are CentOS 5.4, except for some legacies that are still at 4.x.

  7. “Dell RAID controllers rejecting non-DELL drives”

    I saw that a while ago. Another reason I’m not a fan of Dell servers. What’s up with the CentOS 5.2? That implies the servers are *way* out of date patch-wise.

Leave a Reply