Announcement

**KyrosKrane** · 08-10-2005, 07:37 PM

Originally posted by Moraganth

but, if the servers that are running drinal (for example) are at 1/3rd of the capacity, or less, that they can run at (before the merge), and they double that capacity... it's still only 2/3rds of what they can put on those machines.

This is true, but the flip side is that if the machines were over 50% utilization before the merger, they would be very stressed indeed after the merger. (This may happen for very busy zones like PoK or the bazaar.)

At any rate, the discussion is a bit academic. We have no real proof one way or the other, outside of what Sony tells us. The last I heard was them saying that the only machines they retired were outdated ones, and the rest were pressed back into service.

**KyrosKrane** · 08-10-2005, 08:20 PM

Oh, one other note.

Originally posted by Moraganth

You do attempt to repair them, because if you can put a server up with 10 minutes of work, it's better than 4+ weeks on the rma list...

If your vendor or support contractor can't ship you a replacement server within 24 hours, you need a new vendor. (As I said, this is for mission critical stuff; any unplanned downtime is completely and totally unacceptable.)

This reminds me a bit of a story from my last job. One of the folks I worked with was in charge of the server farm (such as it was) at our business; it was essentially a dozen or so servers at the time, loosely connected. Downtime would be bad, as these machines essentially ran the business. The company had a policy that in case of emergency, downtime of up to 24 hours would be acceptable, after which the servers must be available again with no loss of data beyond that 24 hour time frame.

The IT manager set up the systems in such a way that the data would be backed up to two servers so it wouldn't be lost, and he arranged a contract with the supplier who'd sold us the machines that guaranteed a hot replacement server would be delivered to us within four hours of declaring an emergency, 24 hours a day, 365 days a year. (A hot replacement server is an identical machine to an existing server, lacking only software and data. You just restore those from your backups and switch the cords from the dead machine to the new one, and you're up and running.)

One day, feeling a bit bored, the manager decided to do a surprise test of the emergency recovery plan. He walked into the IT room, reached behind one of the servers, and yanked the power plug. When the rest of the IT staff ran into the room to see what the problem was, he simply pointed at the server and said, "This machine just died. Get the backup plan rolling."

Within eight hours, the vendor had delivered the hot backup, the software and data from the last backups had been loaded, and the machine was up and running.

I would expect that for an operation the size of EQ, nothing less than this standard, and probably something much, much tougher, would be in effect for all the machines.

**Britneyy** · 08-10-2005, 10:54 PM

Originally posted by KyrosKrane

Oh, one other note.

If your vendor or support contractor can't ship you a replacement server within 24 hours, you need a new vendor. (As I said, this is for mission critical stuff; any unplanned downtime is completely and totally unacceptable.)

Yup gotta agree depending on the company 4hours is even bad. Downtime is unacceptable some places loose millons a hour for downtime.

Now for server downtime i assume its replaced within 15min for a company the size of sony probably less. For optimal running what they do is probably have extras of servers laying ready to be deployed along with servers that sit there idle waiting to take on the load of a downed server.

For upgrading servers its generally not a good idea. There are a few things you can upgrade usually memory and harddrive space, usually you max out the server when you buy it or buy it with your desired amount. Disk space shouldn't be a factor if your running a good farm you will have a seperate fc-al array for your storage. The reason for this is close to why Dell has done so good mass producing of the of computer since all the parts are the same it makes the ability to move around for restoring easy, not to mention testing of software you are able to test on one and know the result for all. Such as you can install an OS on a harddrive clone it 50 times and know that it will load.

ALso about reducing the amount of servers is generally not a good idea for being redundant. Such as you have 100 servers running PoK at a 50% load on each you decide to cut 30 servers and increase the general load to 85%. So for each server remaining that would go down out of the 70 servers left your increasing general load by 1.5%, 10 servers down and your at 100% (assume no failover servers) not the best example but this is why you want to keep loads somewhat smallish

Announcement

SOE Billing Error

Comment

Comment

Comment