Filed under: Travel
Then there are the days when you don’t *want* to be the systems administrator of a film company. Take yesterday, for example. At around 4pm, one of the 5 main servers, the one that carries over 4 terrabytes of information, decides that it’s not cool to work anymore, burns up all the fans in the process, and refuses to boot up again. Panic sets in. What went wrong, how do I fix it, how quick can I get it back on line. These are just some of the questions running through my mind.
Haul the server out of the server rack, and take a looksie. Repair the fans that are needed to cool it, hoping that this is all that went wrong. 6:30pm, fans fixed. Fire up the server again… Nothing. Panic some more.
Lets freeze there for a few seconds… Why panic in a situation like this, you ask? Well, a couple reasons. The server that has died has a RAID array of 8, 500Gb hard drives. These drives are all striped together into two big arrays, making them look like 1tb each, along with a mirror. This means that this server, when dead, is the only server that knows how these drives are lined up and, subsequently, the only machine that can actually look at the data on these drives. To any other machine, when plugging those same 8 drives into it, it would just look like garbage. It’s freaky. Things like this are very freaky. They are made even more freaky by the fact that the information (whilst I cannot explain exactly what is on them on a public web page) is incredibly important. And on top of it, although there are backups made of the drives, the last backup was a week ago, and since then, we’ve done more work on that server in the last week than we’ve done in the last 2 months. This means huge data loss. Project goes on hold for a week whilst we scramble for the data… who knows. Well, I know, and thats why I was stressing so much.
Back to the story… So, after 7 hours of overhaul, I managed to get the exact same RAID array working on another PC, by transferring the entire RAID controller from the broken machine, into one of my other servers, and then replicate the exact same arrangement to trick the controller into thinking nothing went wrong. It works! Hooray! But thats not where the story ends… You now have 4tb of hard drives hanging off the side of a server that is also used for critical information… that server is now offline whilst you work on restoring and temporarily moving the information off of it. Theres no room for any other hard drives in any of the other servers because they’re all pretty much running at capacity. It’s tight, but I manage to sort it out, moving all of the offline servers information on to another hard disk so that I didn’t have to move the 4tb of data anywhere.
So, at midnight, I walked out of here, everything pretty much sorted, but so incredibly not keen for today. Today is the day that everything has to be re-thunk. Thunking sucks.
1 Comment so far
Leave a comment
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Sounds like system administrator karma. This post illustrates the effect. A previous post illustrates the cause. Just a theory.
Good luck with that thunking though.
Comment by halfhaggis July 6, 2007 @ 4:58 pm