Hi guys. Was a long night so I'll keep this relatively brief! Firstly, we are up and running with no data loss (not that you should expect any less). Turns out the boot drive in the main server decided to lose a few bits which were quite important to booting. Read on for more technical details. I will be taking measures to ensure a faster recovery should this happen again. Thanks for your patience .
Please take this moment to follow/bookmark @osustatus. I use it to get information out when the osu! website is not available.
Note that replay data will not be available for another hour or two while it is checked for consistency.
---- technical stuff below ----
The problem was quite blatantly obvious when grub decided to not recognise the boot partition. Usually this would be a very simple operation -- pop in a rescue cd and run fsck on the damaged partition, then reboot. Unfortunately due to the remote location of the server, there is quite a delay in communication and actions of the remove hands. It took around 9 hours of communication to get KVM (keyboard/video/mouse access at a very low level) access to the server, and then a while after to get a CD drive with a rescue cd.
Unfortunately, they decided to give me a dated CentOS 5 rescue disk, which didn't have reiserfs repair tools loaded (note for the future: use ext3/4 for boot partitions just so they are standard). Loading an rpm is near impossible on a rescue cd due to dependencies, so I decided to try my luck at compiling the required binary on my home server and transferring it across. This worked wonders.
The actual repair process (from the point of having access to a rescue CD) took around 26 minutes, for what it's worth.
It is interesting to note that the reiser node tree was in perfect shape. Replaying the transaction log was enough to fix the problem. Kind of makes you wonder...
Please take this moment to follow/bookmark @osustatus. I use it to get information out when the osu! website is not available.
---- technical stuff below ----
The problem was quite blatantly obvious when grub decided to not recognise the boot partition. Usually this would be a very simple operation -- pop in a rescue cd and run fsck on the damaged partition, then reboot. Unfortunately due to the remote location of the server, there is quite a delay in communication and actions of the remove hands. It took around 9 hours of communication to get KVM (keyboard/video/mouse access at a very low level) access to the server, and then a while after to get a CD drive with a rescue cd.
Unfortunately, they decided to give me a dated CentOS 5 rescue disk, which didn't have reiserfs repair tools loaded (note for the future: use ext3/4 for boot partitions just so they are standard). Loading an rpm is near impossible on a rescue cd due to dependencies, so I decided to try my luck at compiling the required binary on my home server and transferring it across. This worked wonders.
The actual repair process (from the point of having access to a rescue CD) took around 26 minutes, for what it's worth.
It is interesting to note that the reiser node tree was in perfect shape. Replaying the transaction log was enough to fix the problem. Kind of makes you wonder...