PAIX Outages

Richard A Steenbergen ras at e-gerbil.net
Thu Apr 28 17:51:54 UTC 2005


On Wed, Apr 27, 2005 at 10:45:15AM -0400, Jay Patel wrote:
> 
> I have heard rumors that S&D has been having persistent switch
> problems with their switches at PAIX (Palo Alto), and I was kind of
> wondering if anyone actually cared?

Personally I tend to suspect the general lack of uproar is a rather 
unfortunate (for them) sign that PAIX is no longer relevant when it comes 
to critical backbone infrastructures.

It looks like different folks have been seeing different levels of outages 
depending upon which switch/card they are connected to, but I havn't been 
able to find anyone who has seen fewer than 30 hits between April 16th and 
the two this morning. Our ports have seen just under 28 hours of total 
downtime so far this month, while some lucky people have only seen around 
6 hours.

I'm not sure if anyone at S&D or Extreme actually has any real idea what 
the problem is with these current switches, but given this amount of 
downtime, they should have replace every last component by now. If Extreme 
can't fix them, there should be a pile of Black Diamond's sitting on the 
curb waiting for trash day. In fact, 9/10ths of the way through writing 
this e-mail, I got a call from S&D stating that they are doing exactly 
that. :)

In the mean time, here are some of the more interesting snipits of what 
has been tried on the current switches:

16 Apr 2005 20:19:53 GMT
We are currently experiencing some problems with 2 network cards in our 
Palo Alto peering switch. This might be causing possible service
degradations. Switch Engineers are expecting new cards to replace the 2
suspected faulty network cards. These cards should be arriving in or
around 1 hour. Right after the cards arrive, we will be scheduling an
emergency maintenance window to get these cards replaced.

19 Apr 2005 14:16:07 GMT
The Purpose of this Emergency Maintenance window is for Switch Engineers 
to replace a faulty processor module card affecting the Bay Area Peering
customers. The estimated down time will be 15 minutes.
(Actual downtime several hours)

19 Apr 2005 19:27:49 GMT
This is the final update regarding the problems experienced today with the
peering fabric. Our Switch Engineers corrected the problems during the
emergency maintenance window by replacing two line cards and 2 processor
cards in the Palo Alto switch. All peering sessions should be restored at
this time.

22 Apr 2005 21:56:15 GMT
The purpose of this emergency maintenance window is for engineers to 
replace defective power supply units on the Paix Switch. No impact to your
services is expected.

24 Apr 2005 21:25:48 GMT
Our Switch Engineers will be conducting and emergency processor cards 
replacement at the Palo Alto site. The expected downtime while this
maintenance is being conducting will be 2 hours.

24 Apr 2005 21:36:18 GMT
Our Switch Engineers will be conducting and emergency chassis replacement 
at the Palo Alto site. The expected downtime while this maintenance is
being conducting will be 3 hours.

25 Apr 2005 19:17:41 GMT
Our engineers have escalated the problems with the peering switch in Palo
Alto to 3rd level support at Extreme, the switch vendor. More details will
follow as they become available.

26 Apr 2005 03:00:34 GMT
Our Switch Engineers have advised us that the switch has been migrated to 
a different power bus to rule out any power variables. Power is being 
monitored for the next 24 hours.

28 Apr 2005 13:33:05 GMT
At approximately 6:05 AM local time, the peering switch rebooted itself. 
Our switch engineers are investigating this issue and believe all sessions
are back to normal at this time. More details will be provided as they
become available.

When I see a stable switching platform going forward, and some service 
credits for the massive outages we've all endured so far, I'll probably be 
a lot less cranky about the entire situation. Until then I have to say, if 
they keep this up their are going to need to change their name to "Switch 
or Data".

Oh well, at least this didn't happen during the S&D sponsored NANOG. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)



More information about the NANOG mailing list