Global Akamai Outage

Tue Jul 27 19:02:24 UTC 2021

Mon, Jul 26, 2021 at 07:04:41PM +0200, Lukas Tribus:
> Hello!
> 
> On Mon, 26 Jul 2021 at 17:50, heasley <heas at shrubbery.net> wrote:
> >
> > Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:
> > > rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
> > > possible for RTR servers to stop considering outdated VRP's:
> > > https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925
> >
> > Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
> > server "stop considering" something that does not exist from its PoV?
> 
> rpki-client can only remove outdated VRP's, if it a) actually runs and
> b) if it successfully completes a validation cycle. It also needs to
> do this BEFORE the RTR server distributes data.
> 
> If rpki-client for whatever reason doesn't complete a validation cycle
> [doesn't start, crashes, cannot write to the file] it will not be able
> to update the file, which stayrtr reads and distributes.
> 
> If your VM went down with both rpki-client and stayrtr, and it stays
> down for 2 days (maybe a nasty storage or virtualization problem or
> maybe this just a PSU failure in a SPOF server), when the VM comes
> backup, stayrtr will read and distribute 2 days old data - after all -
> rpki-client is a periodic cronjob while stayrtr will start
> immediately, so there will be plenty of time to distribute obsolete
> VRP's. Just because you have another validator and RTR server in
> another region that was always available, doesn't mean that the
> erroneous and obsolete data served by this server will be ignored.
> 
> There are more reasons and failure scenarios why this 2 piece setup
> (periodic RPKI validation, separate RTR daemon) can become a "split
> brain". As you implement more complicated setups (a single global RPKI
> validation result is distributed to regional RTR servers - the
> cloudflare approach), things get even more complicated. Generally I
> prefer the all in one approach for these reasons (FORT validator).
> 
> At least if it crashes, it takes down the RTR server with it:
> 
> https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163
> 
> 
> But I have to emphasize that all those are just examples. Unknown bugs
> or corner cases can lead to similar behavior in "all in one" daemons
> like Fort and Routinator. That's why specific improvements absolutely
> do not mean we don't have to monitor the RTR servers.

I am not convinced that I want the RTR server to be any smarter than
necessary, and I think expiration handling is too smart.  I want it to
the load the VRPs provided and serve them, no more.

Leave expiration to the validator and monitoring of both to the NMS and
other means.  The delegations should not be changing quickly[1] enough
for me to prefer expiration over the grace period to correct a validator
problem.  That does not prevent an operator from using other means to
share fate; eg: if the validator does fails completely for 2 hours, stop
the RTR server.

I perceive this to be choosing stability in the RTR sessions over
timeliness of updates.  And, if a 15 - 30 minute polling interval is
reasonable, why isnt 8 - 24 hours.

I too prefer an approach where the validator and RTR are separate but
co-habitated, but this naturally increases the possibility that the two
might serve different data due to reachability, validator run-time, ....
To what extend differences occur, I have not measured.

[1] The NIST ROA graph confirms the rate of change is low, as I would
expect.  But, I have no statistic for ROA stability, considering only
the prefix and origin.