Global Akamai Outage

Tue Jul 27 21:23:09 UTC 2021

Hello,

On Tue, 27 Jul 2021 at 21:02, heasley <heas at shrubbery.net> wrote:
> > But I have to emphasize that all those are just examples. Unknown bugs
> > or corner cases can lead to similar behavior in "all in one" daemons
> > like Fort and Routinator. That's why specific improvements absolutely
> > do not mean we don't have to monitor the RTR servers.
>
> I am not convinced that I want the RTR server to be any smarter than
> necessary, and I think expiration handling is too smart.  I want it to
> the load the VRPs provided and serve them, no more.
>
> Leave expiration to the validator and monitoring of both to the NMS and
> other means.

While I'm all for KISS, the expiration feature makes sure that the
cryptographic validity in the ROA's is respected not only on the
validator, but also on the RTR server. This is necessary, because
there is nothing in the RTR protocol that indicates the expiration and
this change brings it at least into the JSON exchange between
validator and RTR server.

It's like TTL in DNS, and it's about respecting the wishes of the
authority (CA and ROA ressource holder).

> The delegations should not be changing quickly[1] enough

How do you come to this conclusion? If I decide I'd like to originate
a /24 out of my aggregate, for DDoS mitigation purposes, why shouldn't
I be able to update my ROA and expect quasi-complete convergence in 1
or 2 hours?

> for me to prefer expiration over the grace period to correct a validator
> problem.  That does not prevent an operator from using other means to
> share fate; eg: if the validator does fails completely for 2 hours, stop
> the RTR server.
>
> I perceive this to be choosing stability in the RTR sessions over
> timeliness of updates.  And, if a 15 - 30 minute polling interval is
> reasonable, why isnt 8 - 24 hours.

Well for one, I'd like my ROAs to propagate in 1 or 2 hours. If I need
to wait for 24 hours, then this could cause operational issues for me
(the DDoS mitigation case above for example, or just any other normal
routing change).

The entire RPKI system is designed to fail, so if you have multiple
failures and *all* your RTR servers go down, the worst case is that
the routes on the BGP routers turn NotFound, so you'd lose the benefit
of RPKI validation. It's *way* *way* more harmful to have obsolete
VRP's on your routers. If it's just a few hours, then the impact will
probably not be catastrophic. But what if it's 36 hours, 72 hours?
What if the rpki-validation started failing 2 weeks ago, when Jerry
from IT ("the linux guy") started it's vacation?

On the other hand, if only one (of multiple) validator/rtr instances
has a problem and the number of VRP's slowly goes down, nothing will
happen at all on your routers, as they just use the union of the RTR
endpoints and the VRP's from the broken RTR server will slowly be
withdrawn. Your router will keep using healthy RTR servers, as opposed
to considering erroneous data from a poisoned RTR server.

I define stability not as "RTR session uptime and VRP count", but
whether or not my BGP routers are making correct or wrong decisions.

> I too prefer an approach where the validator and RTR are separate but
> co-habitated, but this naturally increases the possibility that the two
> might serve different data due to reachability, validator run-time, ....
> To what extend differences occur, I have not measured.
>
>
> [1] The NIST ROA graph confirms the rate of change is low, as I would
> expect.  But, I have no statistic for ROA stability, considering only
> the prefix and origin.

I don't see how the rate of global ROA changes is in any way related
to this issue. The operational issue a hung RTR endpoint creates for
other people's networks can't be measured with this.

lukas