Jump to content

The Great TGTAP Downtime 2013


Chris

Recommended Posts

We're back!

First off, we're sorry about the lengthy downtime we suffered during the past week. For the most part it was completely beyond our control. We did unfortunately suffer some data loss, some of which was simply due to corruption. Now if you continue reading I'll explain what happened, or just skip to the end if you don't care and just want to know wtf is up with the site right now.

Back in the February we suffered some downtime, over 24 hours in fact. Scanning the system logs we were able to pinpoint this to an issue with one of the hard drives, but it was difficult to determine which one (we have 4 in a RAID 10 array) when the tools we have available were reporting them all as healthy! Fast-forward to last week, one of the hard drives decided to completely fail on us. Those of you who know about RAID will know that a single hard drive in the array failing is not a problem, so we simply asked the server techs to replace it for us. No big deal. As I was submitting the support ticket, I was running tests on the other 3 hard drives to check they were ok - turns out they weren't. A drive in the second pair was also failing, and as the test finished running, it did fail. This brought the server into a read-only state, we rebooted to allow the techs to replace the first bad disk. This was done, but we had to wait practically a whole day for it to finish filesystem checks and rebuild itself into the array.

The bad news got worse after that and long story short, the OS had become corrupted and we were unable to get the server to boot up. Combine this problem with incredibly slow support staff and you see where this is heading...

In the days that followed, they eventually replaced the second failed disk for us, and within another couple of days they finally got it into rescue mode and were able to get the server back online (today).

Unfortunately this was when I discovered quite a lot of file integrity loss, corrupted files everywhere. After working all night to get the server back into a workable state, we realised that unfortunately our database system was completely fucked, for lack of a better phrase. Worse still, our on-site nightly backups were mostly lost. The most recent off-site backups we had were over 2 weeks old, but these had to do.

What I've done is restored a database backup from 13th March. Anything that happened since then has been lost. As for files and uploads, we believe most of this is ok, but chances are there's some missing files we aren't aware of. Please let me know (in this topic) if you are experiencing errors or other weirdness on the forums or anywhere else on the website.

While no one is to blame (except myself for not having more recent backups available) for what happened, we feel the support staff made us wait unreasonably long times to both replace the failed hardware and recover the system for us. Downtime was almost inevitable with these kinds of failures, but it certainly should not have been this long. For this reason, we will be transferring TGTAP (and our other sites) to a new server within the next couple of weeks. We don't expect there to be any downtime while this happens, though the forums will be turned offline for approx 15 mins to ensure we have successfully migrated the data.

That's all.

  • Like 1
Link to comment
Share on other sites

Hard Drive failures are the most critical to any computer anywhere, so yeah, I sympathize, backing up whenever and wherever possible is the way to go about it, Cloud storage in this modern age is certainly worthy of consideration but for business it will wind up costing for more then the free basic amounts I'm sure, if even possible.

 

I've dealt with RAM memory failures as well and modern Lithium Ion batteries most recently, they just stop working all of a sudden, unlike ZINC batteries of old.

 

Ram failures in particular are just too odd to nail down quickly, they manifest in different ways in terms of erratic operations.

  • Like 1
Link to comment
Share on other sites

Don't think I've ever had a hard drive break, five graphics cards and three CPU's and god knows how man memory cards but never a hard drive. So yeah...

 

I think there are people who specialize in data recovery in the UK. Data can be recovered from both memory cards and hard drives, unless you raged and stepped on 'em. There aren't any here cause East Europe.

Edited by TUN3R
Link to comment
Share on other sites

For a forum, a HDD recovery service for recent posts is really going to extremes. Our own forum was sabotaged and I restored with an older archive at that moment then I would have liked to, so I regret that, but it nearly mirrors TGTAP in this case, we reset from a prior backup point...kinda like Windows Restore in Win Millennium! YEAH!

  • Like 1
Link to comment
Share on other sites

  • 2 weeks later...
  • 4 weeks later...

You think a site a site as big as this can fit on a VPS? You're severely underestimating our size, a dedicated server is a minimum requirement, and we've been on several high-end ones since 2007.

 

Also, Cloudflare is good but it wouldn't have helped in this case. We were not under attack so their DDoS mitigation wouldn't have been of any use. The downtime was simply due to two consecutive hard drive failures. As for their caching methods, Cloudflare only shows cached pages in the event of an unreachable server, and for a limited time. Since the site was down for a whole week this would only have been effective in the first day.

 

Anyway, we actually silently migrated to a brand new server a couple of weeks ago and everything has been running smoothly since then. I've also taken extra measures to ensure we have nightly backups stored off-site in two separate locations. In the event of any future downtime due to hard drive failure, the most data we'd be at risk of losing is an absolute maximum of 24 hours. Unless this coincided with a big news/content update, this won't even be much of an issue. Point is, while we can't guarantee against hardware failure, we can now be extremely optimistic about recovery should anything untoward ever happen again, and we certainly don't expect downtime as long as that ever again.

Link to comment
Share on other sites

We're at RapidSwitch now. And yeah I'm familiar with Hostgator. I got into the hosting game back in 2005 which I ran alongside TGTAP, but I didn't have any budget outside of advertising revenue brought in by this site, which was only enough to cover the cost of a cheap server back then. Struggled with that and couldn't dedicate enough time to it to take it seriously so it was never something I pursued, but I still gained useful knowledge through doing it. Over the years I've managed a total of 11 servers (8 dedicated, 3 VPS) at a total of 8 different hosts around the world. I don't claim to be an expert on server management, I'm far from it, but I know everything that I need to know to get it optimised, secured, kept stable, and fix any problems that might crop up.

 

And yeah a 600K post forum shouldn't have too much of a problem running off a decent VPS. But this forum is one of the least intensive parts of this site as it's not particularly active right now, it's our downloads database that sees a massive amount of traffic compared to the rest of the site. When you're transferring that much data so quickly you need decent hardware and bandwidth. Also, bear in mind I also host GrandTheftWiki on the server, as well as a couple of other projects I have unrelated to GTA. So the server gets decent usage even if the whole thing isn't being used by TGTAP - it's not wasted if that's what you're thinking, and we've plenty of room to grow and expand :)

Link to comment
Share on other sites

We've been talking about backup files of our own site, since the Webmaster himself can't seem to log in, and has to create a new account to come in and try and remedy things. That's a PITA as well! ARGH!!!

 

I've been too lazy to setup any home RAID systems, but the idea should apply for online as well, where a HDD is mirrored with the same data in case one dies for any reason, the other should mechanically operate and retain as much data in recent history, but I'm no expert either, I opted to learn a bit more once I had a major trojan attack and lost all access to data on my HDD. NEVER AGAIN!!

Edited by BlackListedB
Link to comment
Share on other sites

  • 2 months later...
Guest
This topic is now closed to further replies.
×
×
  • Create New...