Unlucky Raid 1

The scenario…

Dell 2800, 2x36Gig (RAID 1 – Systems Boot)), 3x146Gig (RAID 5 – Data), primary file server, primary AD controller.

12PM Yesterday Dell Server Manager began reporting a fault on disc 1 (36Gig), the first step in this scenario is always to reseat the disc to rule out any possible connection issues.

This indeed brought up both drives again, only to be followed 5 minutes later by a failure of both drives in the set.

Rebooting the server to the RAID BIOS revealed that there were faults found  on disk 0 not disk 1 (as reported by the Dell Server Manager).

Lesson 1 – never underestimate the power of the hot spare.

At this current  moment in the time the company was without file, print and primary AD.

We decided to remove the disc 0 (that the bios said was faulty), and bring the disc 1 (that at bios said had no faults), back online.

Unfortunately a reboot in this scenario forced a windows chkdsk, which found faults on the drive. Although the system booted, AD did not come up and system was effectively broken.

We had no spare chasis available to drop the data disks  into, so we were in a fast reinstall situation.

The first step however was to seize the FSMO roles using NTDSUtil to one of our backup AD controllers, and cleanup all references to the old primary AD.

While these changes were propogating through AD we began the reinstall process. As we had no spare disks we were forced to reinstall to the one disk we thought was good.

All went well, and we were smart and didn’t repromote the server to AD yet.

Lesson 2 – only bring a server up as an AD controller when its a known good.

Unfortunately later that day the server failed again. Fortunately we were able to bring the disk online again without a problem.

The next day the first new disk arrived, we took the decision to not install it yet as we felt mirroring might force the remaining dodgy disk to fail…

… which it dutifully did.

We took the decision to then install the spare, allowing the RAID BIOS to resync the drives without rebooting the OS. We felt that we could have done it on the fly with OS running but that gave us a greater chance of the mirroring failing.

It took about 40 minutes for the mirror to take place.

We then removed the dodgy mirror, leaving us running on 1 disk until a further warranty replacement arrived.

Lesson 3: Always get disks of different ages, batches, brands if possible, two disks from the same batch can fail at the same time.

Sigh…

Author: Ben King

My name is Ben King, I am a director of an Internet solutions company called bit10 ltd. My ultimate responsibility is to bring in the work that bit10 delivers. However I also do a myriad of other things, for example system design, and administration. Outside work I go out, I drink, I socialise, I cook, I have fun, oh and I play a little bit too much World of Warcraft!

Leave a Reply

Your email address will not be published. Required fields are marked *