Update September 2006

I had one of the RAID drives on my system described on the previous page fail. The good thing is that I had mdadm monitor the RAID drives so I knew almost immediately as I received an email message when it happened. The other good news is that because of the redundancy of the RAID, I didn't loose any data. The bad news of course is that I had to replace the failed drive.

Not one to be satisfied with just a simple replacement, I decided to upgrade from this:

 
/dev/hde
/dev/hdg
/dev/hdi
/dev/hdl
Size
120GB 
120GB 
80GB 
80GB 
SWAP
256KB
256KB
256KB
256KB
RAID5 (240GB)
80GB
80GB
80GB
80GB
RAID1 (40GB)
40GB
40GB
 
 

to the following configuration. I would replace the two 80GB drives with 250GB drives and reconfigure the RAID devices.

 
/dev/hdf
/dev/hdh
/dev/hdi
/dev/hdk
Size
250GB 
250GB 
120GB 
120GB 
SWAP
1GB
1GB
RAID5 (240GB)
120GB
120GB
120GB
120GB (spare)
RAID1 (120GB)
120GB
120GB
 
 

The benefit of the new configuration was a spare drive in addition to the increased room. The spare would work fine if the other 120GB drive failed. If one of the new 250GB drives failed, the spare could be used for one RAID device or the other, but not both. However, since the RAID devices already had redundancy built in, any single failure would still be tolerated without loss of data. In my experience the failure was bad sectors rather than the entire drive failing. If that was the case, the spare would work as it could replace any partition.

I gained a little bit of space. The RAID5 is still 240GB (3x120 - 120). The RAID1 is now 120GB rather then 40GB so I have an extra 80GB of redundant storage.

I decided on a spare because I had a double failure on my Windows RAID5 on a Promise RAID controller. The drives didn't fail at the same time, but when the second drive failed I had a crippled system. I missed the first failure, probably from a poor configuration. The Promise RAID controller will report problems by email, but I didn't have it setup. The good news is that the Promise controller refused to mark the second drive as failed. It limped along until I noticed the problem when I couldn't read a file over the network. I salvaged most of the data (all of the valuable stuff) by copying it from the crippled RAID drive to my Linux system and getting the rest from backups. I took the opportunity to replaced all the drives on that system and rebuilt the system. I went with a three drive RAID5 with a spare. But I digress.

Step 1:    Replace the drive.

I used the similar methods as I used previously to replace the drives and reconfigure the system. I started by replacing the failed drive with one of the new 250GB drives. I formatted it as a single ext3 Linux partition, and copied my /home (/dev/md5) file system to the new drive.

[root@wysenburg /]#mkdir /mnt/newhome
[root@wysenburg /]#mount /dev/hdk1 /mnt/newhome
[root@wysenburg /]#cp -ax /home/* /mnt/newhome

I then umounted /home (/dev/md5), mounted the new drive (/dev/hdl) as /home. Nobody was the wiser. Of course I don't have a lot of users. I then replaced the second 80GB drive with the new 250GB drive. The replacements required shutdowns as my IDE ATA drives aren't hot pluggable. If they had been SATA (as opposed to normal IDE ATA) drives they can be replaced without shutting down the system. Just remove the drive from the RAID,

[root@wysenburg /]#mdadm /dev/md5 -f /dev/hdi1 -r /dev/hdi1

replace the SATA drive with the new one and add it back in:

[root@wysenburg /]#mdadm /dev/md5 -a /dev/hdi1

Don't forget to partition and format the drive as Linux RAID (fd) before you add it to the RAID device. SATA drives are made to be hot pluggable so you can safely unplug them while you system is still running. I've done that on my workstation. The drivers for the SATA drive have to be able to recognize the new drive when you plug it back in. The Windows nVIDIA drivers for my ASUS A8N motherboard do just that. In fact there is an icon in the system tray that allows me to stop the drive before I remove it. I haven't tried this on a Linux system so I don't know if it will work. But I digress again.

Step 2:    Reconfigure the RAID devices

I was completely reconfiguring the RAID devices rather than just replacing the drive so it took a little more work. Once I have the /home file system copied, I shut down the system and replaced the other 80GB drive with the second new 250GB drive. Linux had a problem as the RAID5 device that I had mounted as /home was no longer a valid device. It was missing two drives so it failed to start. Linux complained at boot and dropped me into a maintenance mode. I edited the /etc/fstab to mount /dev/hdk1 as /home.

# This file is edited by fstab-sync - see 'man fstab-sync' for details
/dev/md1 / ext3 defaults 1 1
none /dev/pts devpts gid=5,mode=620 0 0
/dev/hdk1 /home ext3 defaults 1 2
none /proc proc defaults 0 0
none /dev/shm tmpfs defaults 0 0
/dev/hdf1 swap swap defaults,pri=1 0 0
/dev/hdh1 swap swap defaults,pri=1 0 0
#/dev/hdi1 swap swap defaults,pri=1 0 0
#/dev/hdl1 swap swap defaults,pri=1 0 0

I was loosing a couple of the swap partitions so I commented those out of the /etc/fstab also. When I rebooted the system came up just fine. I had enough spare drives to rebuild the RAID drives as shown in the chart above. My system now looked like this:

  Size SWAP RAID5 RAID1  
/dev/hde 120GB 256KB 80GB 40GB  
/dev/hdg 120GB 256KB 80GB 40GB  
/dev/hdi 250GB 250GB (not used)
/dev/hdk 250GB 250GB (/home)

It was still running as the root device (RAID1) was valid and I had moved /home to /dev/hdk1.

The new 250GB drives had 30401 cylinders as compared to 14596 for the old 120GB drives. I settled on using 132 cylinders for swap and 2*14596 cylinders for the partitions on the 250GB drives and a single 14596 cylinder partition on the 120GB drives. This still left 1077 cylinders or about 8GB (1077cyl * 255heads * 63sectors * 512 bytes) of left over space on each of the 250GB drives. However, if I was going to use one of the 120GB drives as a spare, it had to be as large as the partition it was replacing. There was no other solution. The two 8GB left over partitions came in handy later. I partitioned /dev/hdi with one swap and two Linux RAID partitions:

  Size SWAP RAID5 RAID1  
/dev/hdi 250GB 1GB 120GB 120GB  

As mentioned, the RAID5 device no longer automatically assembled as it lost two drives. I removed the mirror of the RAID1

[root@wysenburg /]#mdadm /dev/md1 -f /dev/hdg3 -r /dev/hdg3

deleted the partitions on /dev/hdg and created a single new Linux RAID partition. I now had enough partitions to create the two new RAID devices, each with a missing part.

  Size SWAP RAID5 RAID1  
/dev/hde 120GB 256KB 80GB 40GB  
/dev/hdg 120GB 120GB  
/dev/hdi 250GB 1GB 120GB 120GB  
/dev/hdk 250GB 250GB (/home)

I created the new RAID devices with the following commands:

mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/hdi3 missing
mdadm --create /dev/md5 --level=5 --raid-devices=3 /dev/hdg1 /dev/hdi2 missing

Previously the device number reflected the level, but I still had the root running on /dev/md1 so I had to name the new device /dev/md0. I think I seen some documentation that shows the naming of RAID devices as anything you want but I stuck with the conventional naming. The devices started in a degraded mode as soon as I created them. I formatted them as ext3, mounted them and copied the root from /dev/md1 to /dev/md0 and /home from /dev/hdk1 to /dev/md5 with the commands:

[root@wysenburg /]#mkdir /mnt/newhome
[root@wysenburg /]#mount /dev/md5 /mnt/newhome
[root@wysenburg /]#cp -ax /home/* /mnt/newhome

[root@wysenburg /]#mkdir /mnt/newroot
[root@wysenburg /]#mount /dev/md0 /mnt/newroot
[root@wysenburg /]#cp -ax / /mnt/newroot

I then modified /dev/fstab on the new root

# This file is edited by fstab-sync - see 'man fstab-sync' for details
/dev/md0 / ext3 defaults 1 1
none /dev/pts devpts gid=5,mode=620 0 0
/dev/md5 /home ext3 defaults 1 2
none /proc proc defaults 0 0
none /dev/shm tmpfs defaults 0 0
/dev/hdi1 swap swap defaults,pri=1 0 0
/dev/hdk1 swap swap defaults,pri=1 0 0

Step 3:    Sinking into Disaster

The swap partition wasn't created on /dev/hdk yet and I hadn't formatted the new swap on /dev/hdi1 but that didn't stop the system from rebooting properly - or not. I ran into a lot of problems here. First, I hadn't changed /boot/grub/grub.conf to find and boot from the new root partition (/dev/md0). I also moved my drives around as I wanted the larger drives as the first drives. I was thinking of using the little spare part at the end of the drive as a boot partition. Of course the system didn't boot as the boot drive was now the third drive.

First things first. The motherboard is an old Intel server board so it had a lot of fancy things like switching the order of the drives even though the drives were connected to Promise Ultra100 TX2 IDE PCI cards. I had two of the Promise cards each with two channels so that the drives were each on a separate channel. The new drives were jumpered to CS (cable select) while the old drives were jumpered to M (master). The 120GB drives (farthest from the cards) were on the on the end of the ribbon cables. They were recognized as masters (jumpered). The two new drives were on the first connector of the ribbon cable so they were recognized as the slave drives of that channel - CS (cable select). The promise cards reported the drives as D1, D2, etc. with the first card (order of PCI slots) reporting the first drive. I had D1, D3, D4 and D6 which Linux recognized as shown in the tables above.

The odd drive is the one original 80GB drive that was recognized as /dev/hdl (first table on this page). I had a faulty cable (pulled too hard on it once) and it eventually fried the master channel on the Promise TX2 card. The slave channel worked so I jumpered that drive as a slave. I had since replaced the card and cable but left the drive as a slave.

As mentioned, I wanted the larger drives as the first drives, but I also wanted to keep some order in the way they were stacked in the case, so I switched the cards around and switched the cables. That completely reversed the numbering of the drives from what it was before. So basically after switching the cards and cables around I total confusion. I knew the order of the drives as I kept track (by marking the cables), but I also used the BIOS setting to make the third drive (originally the first drive) the first drive so that I could boot off it. Well turns out that GRUB reads the drives out of BIOS, but Linux finds them in the order they are connected. It only took me until 4AM to find that out. Just a few long hours before I had this thing running enough to create RAID devices and copy things around. One good thing is that I now had two copies of everything or so I thought.

To complicate things further, the motherboard was faulty and one processor wouldn't boot so I switched motherboards - a couple of times. It turns out that the other motherboard (also a dual PIII) didn't have the capacity to boot from the Promise cards. Normally setting the BIOS to boot from SCSI solves this problem. That didn't work, but then I couldn't get either board to boot so I didn't know that at the time. 

Step 4:    Recovering from Disaster

I solved the problem by installing a fifth drive and installed FC5 (Fedora Core 5) on it. Even though I had mixed up the drives, the RAID devices still assembled and I had /dev/md0, /dev/md1 and /dev/md5. The data was all intact. I then added an entry in /boot/grub/grub.conf to boot my system from /dev/md0. That worked, but I had a dual processor board and forgot to boot to the SMP image. It crashed on boot and I went to bed at 6AM only to get up two minutes later when I figured that out. At 6:05AM the system was running again, but it was just a pile of stuff in the case. I left it like that and went for some sleep.

The next (same) day I got up and fiddled with it some more, this time with single deliberate moves towards the configuration I wanted with the stuff I had. Switching the drives worked fine. I was booting off /dev/hda and the entry in /boot/grub/grub.conf referenced the RAID device /dev/md0. The order of the drives on the promise cards didn't matter as the RAID devices all assembled properly. I repartitioned the second 250GB drive with a swap and two Linux RAID partitions and added these to /dev/md0 and /dev/md5.

[root@wysenburg /]#mdadm /dev/md5 -a /dev/hdf2
[root@wysenburg /]#mdadm /dev/md0 -a /dev/hdf3
  Size SWAP RAID5 RAID1  
/dev/hdf 250GB 1GB 120GB 120GB  
/dev/hdh 250GB 1GB 120GB 120GB  
/dev/hdi 120GB 120GB  
/dev/hdk 120GB 256KB 80GB 40GB  

I also formated the swap partitions, modified the /etc/fstab and started them.

[root@wysenburg /]#mkswap -c /dev/hdf1
[root@wysenburg /]#mkswap -c /dev/hdh1
[root@wysenburg /]#swapon -a

The next step was to get the system to boot from the Promise controller. I tried to get GRUB to install on the new drive:

[root@wysenburg /]#grub
grub>root (hd0,2)
grub>setup (hd0)
grub>quit

It gave me all the feed back in that it found the partition and stage1, stage 1.5, etc and said that it installed properly. But it didn't and wouldn't. After some experimenting I found it also couldn't find the last two drives, or any drive on the second controller. That didn't really matter as I wanted to install on the first drive. It wouldn't boot, so I switched motherboards - again, but still no luck. I was convinced that I was doing something wrong so I installed FC5 to the 8GB partition at the end of the 250GB drives. I also let it created a separate boot partition, something I thought I would use the left over space for anyhow. This is when I discovered that I couldn't get that motherboard do boot from the Promise cards. FC5 installed just fine but just wouldn't boot.

Another system I have has an ABIT BP6 dual Celeron board with HighPoint controllers for the ATA66 IDE connectors. I put in a Compaq SMART2P RAID controller. The only way that thing boots from the RAID controller is if I leave the floppy controller active even though it doesn't have a drive. I tried all sorts of things with the Gigabyte GA-6BXDS but it wouldn't boot from the Promise cards. I put back the Intel board and while it only booted with one processor active, it booted just fine to the FC5 installation. I added an entry for the Linux image on /dev/md0 to /boot/grub/grub.conf and we were up and running with the configuration I wanted.

 

 



Home Page Maps Software Search Support Site Map Contact Us

©1998-2004 Digital Mapping Systems
Maintained by: WebMaster@DigitalMapping
Get Firefox! Created with Microsoft Front Page Powered by Windows NT Server