few big monolithic PEs vs many small PEs

Thu Jun 20 06:03:43 UTC 2019

On Wed, 19 Jun 2019 at 23:25, <adamv0025 at netconsultings.com> wrote:

> The conclusion I came to was that *currently the best approach would be to
> use several medium to small(fixed) PEs to replace a big monolithic chasses
> based system.

For availability I think it is best approach to do many small edge
devices. Because software is terrible, will always be terrible. People
are bad at operating the devices and will always be. Hardware is is
something we think about lot when we think about redundancy, but it's
not that common reason for an outage.
With more smaller boxes the inevitable human cockup and software
defects will affect fewer customers. Why I believe this to be true, is
because the events are sufficiently rare and once those happen, we
find solution or at very least workaround rather fast. With full
inaction you could argue that having A3 and B1+B2 is same amount of
aggregate outage, as while outage in B affects fewer customers, there
are two B nodes with equal probability of outage. But I argue that the
events are not independent, they are dependent, so probability
calculation isn't straightforward. Once we get some rare software
defect or operator mistake on  B1, we usually solve it before it
triggers on B2, making the aggregate downtime of entire system lower.

> Yes it will cost a bit more (router is more expensive than a LC)

Several of my employees have paid only for LC. I don't think the CAPEX
difference is meaningful, but operating two separate devices may have
significant OPEX implications in electricity, rack space,
provisioning, maintenance etc.

> And yes there is the "node-slicing" approach from Juniper where one can
> offload CP onto multiple x86 servers and assign LCs to each server (virtual
> node) - which would solve my chassis full problem -but honestly how many of
> you are running such setup? Exactly. And that's why I'd be hesitant to
> deploy this solution in production just yet. I don't know of any other
> vendor solution like this one, but who knows maybe in 5 years this is going
> to be the new standard. Anyways I need a solution/strategy for the next 3-5
> years.

Node slicing indeed seems like it can be sufficient compromise here
between OPEX and availability. I believe (not know) that the shared
software risks are meaningfully reduced and that bringing down whole
system is sufficiently rare to allow availability upside compared to
single large box.

-- 
  ++ytti