few big monolithic PEs vs many small PEs

Thu Jun 27 12:48:25 UTC 2019

On Thu, 27 Jun 2019 at 12:46, <adamv0025 at netconsultings.com> wrote:
>
> > From: James Bensley <jwbensley at gmail.com>
> > Sent: Thursday, June 27, 2019 9:56 AM
> >
> > One experience I have made is that when there is an outage on a large PE,
> > even when it still has spare capacity, is that the business impact can be too
> > much to handle (the support desk is overwhelmed, customers become irate
> > if you can't quickly tell them what all the impacted services are, when service
> > will be restored, the NMS has so many alarms it’s not clear what the problem
> > is or where it's coming from etc.).
> >
> I see what you mean, my hope is to address these challenges by having a "single source of truth" provisioning system that will have, among other things, also HW-customer/service mapping -so Ops team will be able to say that if particular LC X fails then customers/services X,Y,Z will be affected.
> But yes I agree with smaller PEs any failure fallout is minimized proportionally.

Hi Adam,

My experience is that it is much more complex than that (although it
also depends on what sort of service you're offering), one can't
easily model the inter-dependency between multiple physical assets
like links, interfaces, line cards, racks, DCs etc and logical
services such as a VRFs/L3VPNs, cloud hosted proxies and the P&T edge.

Consider this, in my opinion, relatively simple example:
Three PEs in a triangle. Customer is dual-homed to PE1 and PE2 and
their link to PE1 is their primary/active link. Transit is dual-homed
to PE2 and PE3 and your hosted filtering service cluster is also
dual-homed to PE2 and PE3 to be near the Internet connectivity.

How will you record the inter-dependencies that an outage on PE3
impacts Customer? Because when that Customer sends traffic to PE1
(lets say all their operations are hosted in a public cloud provider),
and PE1  has learned the shortest-path to 0/0 or ::0/0 from PE2, the
Internet traffic is sent from PE1 to PE2, and from PE2 into your
filtering cluster, and when the traffic comes back into PE2 after
passing through the filters it is then sent to PE3 because the transit
provider attached to PE3 has a better route to Customer's destination
(AWS/Azure/GCP/whatever) than the one directly attached to PE2.

That to me is a simple scenario, and it can be mapped with a
dependency tree. But in my experience, and maybe it's just me, things
are usually a lot more complicated than this. The root cause is
probably a bad design introducing too much complexity, which is
another vote for smaller PEs from me. With more service dedicated PEs
one can reduce or remove the possibility of piling multiple services
and more complexity onto the same PE(s).

Most places I've seen (managed service providers) simply can't map the
complex inter-dependencies they have been physical and logical
infrastructure without having some super bespoke and also complex
asset management / CMDB / CI system.

Cheers,
James.