Leaderboard

in all areas
- All areas
- Events
- Event Comments
- Topics
- Posts
- Status Updates
- Status Replies
Custom Date
- Custom Date
  Between and

BlackListedB

Members
- Points
  
  2
- Posts
  
  1692
- Find Content
Chris

Administrators
- Points
  
  1
- Posts
  
  11987
- Find Content

Popular Content

Showing content with the highest reputation on 03/31/13 in all areas

The Great TGTAP Downtime 2013

We're back! First off, we're sorry about the lengthy downtime we suffered during the past week. For the most part it was completely beyond our control. We did unfortunately suffer some data loss, some of which was simply due to corruption. Now if you continue reading I'll explain what happened, or just skip to the end if you don't care and just want to know wtf is up with the site right now. Back in the February we suffered some downtime, over 24 hours in fact. Scanning the system logs we were able to pinpoint this to an issue with one of the hard drives, but it was difficult to determine which one (we have 4 in a RAID 10 array) when the tools we have available were reporting them all as healthy! Fast-forward to last week, one of the hard drives decided to completely fail on us. Those of you who know about RAID will know that a single hard drive in the array failing is not a problem, so we simply asked the server techs to replace it for us. No big deal. As I was submitting the support ticket, I was running tests on the other 3 hard drives to check they were ok - turns out they weren't. A drive in the second pair was also failing, and as the test finished running, it did fail. This brought the server into a read-only state, we rebooted to allow the techs to replace the first bad disk. This was done, but we had to wait practically a whole day for it to finish filesystem checks and rebuild itself into the array. The bad news got worse after that and long story short, the OS had become corrupted and we were unable to get the server to boot up. Combine this problem with incredibly slow support staff and you see where this is heading... In the days that followed, they eventually replaced the second failed disk for us, and within another couple of days they finally got it into rescue mode and were able to get the server back online (today). Unfortunately this was when I discovered quite a lot of file integrity loss, corrupted files everywhere. After working all night to get the server back into a workable state, we realised that unfortunately our database system was completely fucked, for lack of a better phrase. Worse still, our on-site nightly backups were mostly lost. The most recent off-site backups we had were over 2 weeks old, but these had to do. What I've done is restored a database backup from 13th March. Anything that happened since then has been lost. As for files and uploads, we believe most of this is ok, but chances are there's some missing files we aren't aware of. Please let me know (in this topic) if you are experiencing errors or other weirdness on the forums or anywhere else on the website. While no one is to blame (except myself for not having more recent backups available) for what happened, we feel the support staff made us wait unreasonably long times to both replace the failed hardware and recover the system for us. Downtime was almost inevitable with these kinds of failures, but it certainly should not have been this long. For this reason, we will be transferring TGTAP (and our other sites) to a new server within the next couple of weeks. We don't expect there to be any downtime while this happens, though the forums will be turned offline for approx 15 mins to ensure we have successfully migrated the data. That's all.
- July 14, 2013
1 point
The Great TGTAP Downtime 2013

For a forum, a HDD recovery service for recent posts is really going to extremes. Our own forum was sabotaged and I restored with an older archive at that moment then I would have liked to, so I regret that, but it nearly mirrors TGTAP in this case, we reset from a prior backup point...kinda like Windows Restore in Win Millennium! YEAH!
- March 30, 2013
1 point
The Great TGTAP Downtime 2013

Hard Drive failures are the most critical to any computer anywhere, so yeah, I sympathize, backing up whenever and wherever possible is the way to go about it, Cloud storage in this modern age is certainly worthy of consideration but for business it will wind up costing for more then the free basic amounts I'm sure, if even possible. I've dealt with RAM memory failures as well and modern Lithium Ion batteries most recently, they just stop working all of a sudden, unlike ZINC batteries of old. Ram failures in particular are just too odd to nail down quickly, they manifest in different ways in terms of erratic operations.
- March 30, 2013
1 point

Sign In

Leaderboard

BlackListedB

Points

Posts

Chris

Points

Posts

Popular Content

The Great TGTAP Downtime 2013

The Great TGTAP Downtime 2013

The Great TGTAP Downtime 2013

Browse

Activity