Busy Sunday
Jun. 27th, 2010 11:12 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I slept in this morning. While it felt utterly fantastic, it didn't exactly help me have a productive day. Ah, well... I consider it an investment in myself.
Last Friday morning, I learned one of our blades had a failed hard drive. OK, no problem: swapping a dead disk should be an easy operation. It turned into more of an educational field trip, giving me the opportunity to check into some supporting details.
For starters, I found two hard drives in my inventory. Were they set to the side as spare disks for replacement or were they sitting to the side as unreliable or dead disks? I placed both into an unused blade and ran a simple "dd" command to scan each disk to see if any read errors would be detected: there were none, but the full scan took 25 minutes for a 74GB disk.
Opening up the failing blade, I was surprised to find it had two disks in it already. The RAID controller was configured to mirror the disks by default. So which disk was generating the disk errors? The obvious thing would be to remove one disk and see if the remaining disk generated errors. If not, swap the disks and retest to confirm the identify of the failing disk.
BUT! The command to break the disk mirroring generates a warning that data would be lost. Losing data is a Very Bad Thing. Does it mean that my data would be lost or just that the mirroring data would be lost and therefore a full resync would be required later? Again, I used an available blade with two disks to create a mirror (a full sync took nearly one hour), break the mirror and then try booting from each of the drives individually to ensure their contents were truly intact.
The next obvious question was how to tell how long a full disk resync mirroring process would take, and to determine if the operating system could learn the status of the resync process was completed or still in progress. The RAID controller itself only had a "synced"/"not in sync" boolean status with no progress meter. Still, I found Linux was receiving data into /var/log/messages which indicated the activity of the RAID controller so I was able to note when the syncing started and by absence of further log messages when it approximately ended.
Did I mention that because of nature of the applications on the problem blade, it had to be in full multi-user mode at the top of every hour for at least ten minutes?
By the time I reached this point of my experimentation, it was nearly 3am Saturday morning so I left the entire project for later in the weekend.
Today, around 11am, I headed to the data center for some remaining work. The experimentation was already done so the rest was easy. As soon as I arrived, I booted to the RAID controller menu, broke the mirror, removed the problem disk0, moved disk1 into the disk0 slot, booted to confirm all was kosher, then placed a previously checked new disk into the disk1 slot, forced mirror synchronization, booted the blade into multi-user mode and started the applications. An hour later, after the mirroring was complete, I rebooted the machine one more time to confirm all was fine.
It isn't clear to me (yet) how the IBM blade RAID disk controller operates. It presents a single device interface to the Linux operating system (/dev/sda) but has two disks mirrored to each other. The RAID controller identifies disk0 as the primary and disk1 as the mirror. So, is it actually spreading reads/writes across both disks, alternating reads between them or abusing the primary disk for reads and mirroring writes to the secondary disk? For operations, it doesn't really matter as it is all invisible to the operating system and data but I'd like to know.
I still need to open a support case with IBM for the disk replacement.
I was home again around 2pm and promptly flaked out for a short while.
kent4str and I spent nearly an hour digging out some details to help the DC Diamond Circulate treasurer figure out some allocations for the final convention report due in Chicago later this week. Her computer suffered a disk crash a month ago and while she's been able to recover much of the data, there were still some chunks left in limbo. Witness the importance of backups, people!
This evening, we went into downtown DC to collect
bibliocub and then skip over into Arlington to pick up Jeff M. for a mexican dinner in Del Ray, VA. I keep telling myself I have to stop stuffing my face with the complementary chips & salsa so I'll have room for my entrée, but I blow it every time.
After dinner, we walked a few blocks south to Dairy Godmother for ice cream. I had the chocolate brownie sundae. Again, I ate too much. I really need to cut back.
We're home again and I still need to iron shirts so I'd better get busy!
Last Friday morning, I learned one of our blades had a failed hard drive. OK, no problem: swapping a dead disk should be an easy operation. It turned into more of an educational field trip, giving me the opportunity to check into some supporting details.
For starters, I found two hard drives in my inventory. Were they set to the side as spare disks for replacement or were they sitting to the side as unreliable or dead disks? I placed both into an unused blade and ran a simple "dd" command to scan each disk to see if any read errors would be detected: there were none, but the full scan took 25 minutes for a 74GB disk.
Opening up the failing blade, I was surprised to find it had two disks in it already. The RAID controller was configured to mirror the disks by default. So which disk was generating the disk errors? The obvious thing would be to remove one disk and see if the remaining disk generated errors. If not, swap the disks and retest to confirm the identify of the failing disk.
BUT! The command to break the disk mirroring generates a warning that data would be lost. Losing data is a Very Bad Thing. Does it mean that my data would be lost or just that the mirroring data would be lost and therefore a full resync would be required later? Again, I used an available blade with two disks to create a mirror (a full sync took nearly one hour), break the mirror and then try booting from each of the drives individually to ensure their contents were truly intact.
The next obvious question was how to tell how long a full disk resync mirroring process would take, and to determine if the operating system could learn the status of the resync process was completed or still in progress. The RAID controller itself only had a "synced"/"not in sync" boolean status with no progress meter. Still, I found Linux was receiving data into /var/log/messages which indicated the activity of the RAID controller so I was able to note when the syncing started and by absence of further log messages when it approximately ended.
Did I mention that because of nature of the applications on the problem blade, it had to be in full multi-user mode at the top of every hour for at least ten minutes?
By the time I reached this point of my experimentation, it was nearly 3am Saturday morning so I left the entire project for later in the weekend.
Today, around 11am, I headed to the data center for some remaining work. The experimentation was already done so the rest was easy. As soon as I arrived, I booted to the RAID controller menu, broke the mirror, removed the problem disk0, moved disk1 into the disk0 slot, booted to confirm all was kosher, then placed a previously checked new disk into the disk1 slot, forced mirror synchronization, booted the blade into multi-user mode and started the applications. An hour later, after the mirroring was complete, I rebooted the machine one more time to confirm all was fine.
It isn't clear to me (yet) how the IBM blade RAID disk controller operates. It presents a single device interface to the Linux operating system (/dev/sda) but has two disks mirrored to each other. The RAID controller identifies disk0 as the primary and disk1 as the mirror. So, is it actually spreading reads/writes across both disks, alternating reads between them or abusing the primary disk for reads and mirroring writes to the secondary disk? For operations, it doesn't really matter as it is all invisible to the operating system and data but I'd like to know.
I still need to open a support case with IBM for the disk replacement.
I was home again around 2pm and promptly flaked out for a short while.
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
This evening, we went into downtown DC to collect
![[livejournal.com profile]](https://www.dreamwidth.org/img/external/lj-userinfo.gif)
After dinner, we walked a few blocks south to Dairy Godmother for ice cream. I had the chocolate brownie sundae. Again, I ate too much. I really need to cut back.
We're home again and I still need to iron shirts so I'd better get busy!