Quick question.

Paul Jakma paul at clubi.ie
Sun Aug 1 17:06:03 UTC 2004


On Sun, 1 Aug 2004, Michel Py wrote:

> True; this would be like raid-0 arrays, the more disks the greater 
> the chance of failure.

This holds true for most RAID-x levels.

> In other words, I don't really care if the second processor reduces 
> the MTBF from 200k hours to 60k hours, but I do care if the second 
> processor reduces the time to restore service from 24 hours to 20 
> minutes (7.5 minutes for SNMP to fail the query twice, 1.5 minute 
> for the tech to find out that either it's frozen or there's a BSOD, 
> 6 minutes to have someone go there and reset, 5 minutes to reboot).

If a CPU dies, it's unlikely to come back up without removing the bad 
CPU, especially if the CPU has become unreliable rather than dying 
completely. Even if CPU 0 is good and the BIOS has no problems 
booting the OS, the SMP aware OS will quite probably hit problems 
with the bad CPU.

If you really want to guard against CPU failures, you need a machine 
designed for fault-tolerance, not a "cheap" SMP box, those are just 
*less* reliable.[1]

> The dead processor still has to be replaced, but this is scheduled 
> maintenance, not outage. A little extra ammo when you have to hunt 
> five or six nines.

Just tape a spare CPU to the inside of the box if time-to-repair is 
important. Even better, just have a second system on standby.

> Unsignificant in my experience, and does not balance what Alexei 
> mentioned yesterday:

Alexei is talking about something else.

> a duallie will keep the system up when a faulty process hogs 100% 
> CPU, because the second one is still available. That also increases 
> availability ratio.

This is a resource problem, not an availibility problem. A spinning 
application is not going to take down the machine on any modern OS[2] 
and anyway can be dealt with with resource limits, SMP or not, 
presuming your OS supports resource limits.

The real problem with SMP is kernel complexity. Drivers that are rock 
solid in single-processor can have bugs that are only triggered under 
SMP. Threaded applications can also become unreliable on SMP systems.

The extra power of an SMP system might be a bonus, but trying to 
argue their benefits on the basis of reliability is misguided.

> Michel.

1. Now, they may still be very reliable, and more than reliable 
enough for your needs, but they are still not as reliable as the 
exact same machine with terminators in all CPU sockets/slots bar one 
;) The fault-tolerant systems are outrageously expensive.

2. Unless you're running MacOS 9 or Windows 3.11 on your server.. - 
dont think either supports SMP though ;).

regards,
-- 
Paul Jakma	paul at clubi.ie	paul at jakma.org	Key ID: 64A2FF6A
Fortune:
A Linux machine! because a 486 is a terrible thing to waste!
(By jjs at wintermute.ucr.edu, Joe Sloan)



More information about the NANOG mailing list