Global Akamai Outage

Mon Jul 26 12:20:39 UTC 2021

Hello,

On Mon, 26 Jul 2021 at 11:40, Mark Tinka <mark at tinka.africa> wrote:
> I can count, on my hands, the number of RPKI-related outages that we
> have experienced, and all of them have turned out to be a
> misunderstanding of how ROA's work, either by customers or some other
> network on the Internet. The good news is that all of those cases were
> resolved within a few hours of notifying the affected party.

That's good, but the understanding of operational issues in the RPKI
systems in the wild is underwhelming, we are bound to make the same
mistakes of DNS all over again.

Yes, a complete failure of an RTR server theoretically does not have
big negative effects in networks. But failure of RPKI validation with
a separate RTR server can lead to outdated VRP's on the routers, just
as RTR server bugs will, which is why monitoring not only for
availability but also whether the data is actually not outdated is
*very* necessary.

Here some examples (both of operators POV as well as actual failure scenarios):

https://mailman.nanog.org/pipermail/nanog/2020-August/208982.html

> we are at fault for not deploying the validation service in a redundant
> setup and for failing at monitoring the service. But we did so because
> we thought it not to be too important, because a failed validation
> service should simply lead to no validation, not a crashed router.

In this case a RTR client bug crashed the router. But the point is
that it is not clear that setting up RPKI validators and RTR servers
is a serious endeavor and monitoring it is not optional.

https://github.com/cloudflare/gortr/issues/82

> we noticed that one the ROAs was wrong. When I pulled output.json
> from octorpki (/output.json), it had the correct value. However when
> I ran rtrdump, it had different ASN value for the prefix. Restarting
> gortr process did fix it. Sending SIGHUP did not.

https://github.com/RIPE-NCC/rpki-validator-3/issues/264

> yesterday we saw a unexpected ROA propagation delay.
>
> After updating a ROA in the RIPE lirportal, NTT, Telia and Cogent
> saw the update within an hour, but a specific rpki validator
> 3.1-2020.08.06.14.39 in a third party network did not converge
> for more than 4 hours.

I wrote a naive nagios script to check for stalled serials on a RTR server:
https://github.com/lukastribus/rtrcheck

and talked about it in this his blog post (shameless plug):
https://labs.ripe.net/author/lukas_tribus/rpki-rov-about-stale-rtr-servers-and-how-to-monitor-them/

This is on the validation/network side. On the CA side, similar issues apply.

I believe we still lack a few high level outages caused by
insufficient reliability in the RPKI stacks for people to start taking
it seriously.

Some specific failure scenarios are currently being addressed, but
this doesn't make monitoring optional:

rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
possible for RTR servers to stop considering outdated VRP's:
https://github.com/rpki-client/rpki-client-openbsd/commit/9e48b3b6ad416f40ac3b5b265351ae0bb13ca925

stayrtr (a gortr fork), will consider this attribute in the future:
https://github.com/bgp/stayrtr/issues/3

cheers,
lukas