Details on CareGroup incident

Tue Dec 3 20:25:49 UTC 2002

To quote Dr. Halamka:
> I hope that my approval enables [Cisco] to share more with the press
> so that everyone understands that our issues were purely architectural
> and not related to Cisco products or Cisco engineers.

So, now that I have it in writing, here's what I'll share:

The Topology

There are three campuses, each with several phyiscal sites, connected to
each other with 8 different FEC and OC3 LANE links -- all L2.  Each campus
had several core switches depending on the number of buildings, again
connected redundantly within the campus with FEC and/or OC3 LANE L2 links,
plus dozens of access switches with redundant L2 uplinks to various core
switches.  Many protocols were in use, including non-routeable protocols
like NetBEUI and SNA.

How did this mess occur?  "The CareGroup network grew organically due to the
BIDMC merger, PACS installation, East to West movement of clinical services,
Libby sale and changing CareGroup environment."  In short, the network was
growing very quickly due to business changes and was never redesigned to
cope with the much larger scale and new application requirements.  This
resulted in the previously-mentioned 10+ hop STP, which spanned across
multiple sites and multiple adminitrative zones.

CareGroup had recently gone through a network audit by a third party which
listed most/all of the potential problems in this design and suggested
solutions.  The outage occurred before CareGroup had an opportunity to act
on these recommendations.

The Outage

The trigger for the outage was the addition of a new high-bandwidth
application which somehow interfered with the proper operation of STP.  Even
after this application was removed, STP could not reconverge due to the
default 7-hop limit.

During troubleshooting, Cisco inserted L3 hops at logical boundaries,
breaking the STP into more manageable chunks.  Most of these smaller chunks
were still unstable, so Cisco removed all redundant links in each area,
verified stability, and then reintroduced redundant links one-by-one to
troubleshoot potential problems.  As you can imagine, this takes a long
time, and was done with Cisco staff on site travelling to each building and
wiring closet to make the necessary changes.

Going Forward

The new design calls for one pair of distribution L3 switches per building,
with L3 GE links in a full mesh between all distribution switches in a
campus, and between specific switches at each campus.  STP will be confined
to the access L2 switches.  All non-IP traffic will be removed or isolated.

In the meantime, CareGroup will be performing small-scare hardware and link
upgrades to improve resiliency within each campus, and will install
computers with dial-up connections to the datacenter in key locations so
that they will have minimal connectivity if another LAN event occurs.
Strict change control will be implemented with Cisco approvals.

Conclusion

As I alluded to before, this is not a story of equipment failure or STP's
inherent scaling problems.  It's a story of a business which didn't or
couldn't adapt the network to the business needs -- even though everyone saw
the train coming at them.  The lesson here isn't to review your network
design, but to review how your business handles growth and/or known
problems.

S