Whats so difficult about ISSU

Saku Ytti saku at ytti.fi
Fri Nov 9 07:36:21 UTC 2012


On (2012-11-09 01:22 +0200), Kasper Adel wrote:

> We've been hearing about ISSU for so many years and i didnt hear that any
> vendor was able to achieve it yet.
> 
> What is the technical reason behind that?

I'd say generally code quality in routers is really really bad, I'm not
sure why this is.
I think one problem is, that we start on premise that code will be written
correctly. When we start on that premise, we can do silly things like write
run-to-completion operating systems like IOS and JunOS (rpd). Which means
single guy making one bad judgement call, and whole OS is bad.

Of course run-to-completion is most optimum way to execute code, if your
code is flawless, but that ship has sailed. Possibly when IOS started CPU
time was premium and it was cheaper to through code review money at the
problem. 
But today it clearly is cheaper to add power to control plane and have
levels of abstraction in control-plane which saves the system from bad
code, i.e. design your control-plane assuming code you deliver isn't good.

Take a page from erlang team on design principles. I think Arista is
walking the right path. They have (hopefully) stable and simplistic
state-storage process, from which separate processes can download their
states when they crash, which can make crashing virtually transparent to
operator.
However I think Arista is still running single BGPd etc, I think you should
at least rung iBGP and eBGP or maybe even peer gruops in different daemons,
so when you get bad UPDATE, it'll crash your eBGPs or one peer-group,
instead of all neighbours. Or of course if you keep TCP state and various
bgp RIBs in separate location, you won't need to tear down the TCP just
because you crash.

Someone might argue the overhead is too large, but is it though? MX routers
ship with 4 cores RP, out of which you're using 1 core. The overhead isn't
that high.

Some people write positive things about ISSU in reply, only box where I've
seen it work reliably is CAT4500 switches. I've not seen it working in
routers. On MX960 my personal hit miss ratio is like 4/5 ISSU work, 1/5
have failed catastrophically, like suddenly PFE is dropping packets as if
FW filter was applied, while none is. So we've stopped using ISSU.
Point of ISSU is, you're not doing change management notices to your
customers, so then it positively has to work, or you're in breach of
contract.

-- 
  ++ytti




More information about the NANOG mailing list