plea for comcast/sprint handoff debug help

Sat Oct 31 12:02:04 UTC 2020

Hi Tony,

I realise there are quite some moving parts so I'll try to summarise our design choices and reasoning as clearly as possible. 

Rsync was the original transport for RPKI and is still mandatory to implement. RRDP (which uses HTTPS) was introduced to overcome some of the shortcomings of rsync. Right now, all five RIRs make their Trust Anchors available over HTTPS, all but two RPKI repositories support RRDP and all but one relying party software packages support RRDP. There is currently an IETF draft to deprecate the use of rsync.

As a result, the bulk of RPKI traffic is currently transported over RRDP and only a small amount relies on rsync. For example, our RPKI repository is configured accordingly: rrdp.rpki.nlnetlabs.nl is served by a CDN and rsync.rpki.nlnetlabs.nl runs rsyncd on a simple, small VM to deal with the remaining traffic. When operators deploying our Krill Delegated RPKI software ask us what to expect and how to provision their services, this is how we explain the current state of affairs.

With this is mind, Routinator currently has this fetching strategy:

1. It starts by connecting to the Trust Anchors of the RIRs over HTTPS, if possible, and otherwise use rsync. 
2. It follows the certificate tree, following several pointers to publication servers along the way. These pointers can be rsync only or there can be two pointers, one to rsync and one to RRDP.
3. If an RRDP pointer is found, Routinator will try to connect to the service, verify if there is a valid TLS certificate and data can be successfully fetched. If it can, the server is marked as usable and it'll prefer it. If the initial check fails, Routinator will use rsync, but verify RRDP works on the next validation run.
4. If RRDP worked before but is unavailable for any reason, Routinator will used cached data and try again on the next run instead of immediately falling back to rsync.
5. If the RPKI publication server operator takes away the pointer to RRDP to indicate they no longer offer this communication protocol, Routinator will use rsync.
6. If Routinator's cache is cleared, the process will start fresh

This strategy was implemented with repository server provisioning in mind. We are assuming that if you actively indicate that you offer RRDP, you actually provide a monitored service there. As such, an outage would be assumed to be transient in nature. Routinator could fall back immediately, of course. But our thinking was that if the RRDP service would have a small hiccup, currently a 1000+ Routinator instances would be hammering a possibly underprovisioned rsync server, perhaps causing even more problems for the operator.

"Transient" is currently the focus. In Randy's experiment, he is actively advertising he offers RRDP, but doesn't offer a service there for weeks at a time. As I write this, ca.rg.net. cb.rg.net and cc.rg.net have been returning a 404 on their RRDP endpoint several weeks and counting. cc.rg.net was unavailable over rsync for several days this week as well. 

I would assume this is not how operators would run their RPKI publication server normally. Not having an RRDP service for weeks when you advertise you do is fine for an experiment but constitutes pretty bad operational practice for a production network. If a service becomes unavailable, the operator would swiftly be contacted and the issue would be resolved, like Randy and I have done in happier times:

https://twitter.com/alexander_band/status/1209365918624755712
https://twitter.com/enoclue/status/1209933106720829440

On a personal note, I realise the situation has a dumpster fire feel to it. I have contacted Randy about his outages months ago, not knowing they were a research project. I never got a reply. Instead of discussing his research and the observed effects, it feels like a 'gotcha' to present the findings in this way. It could even be considered irresponsible, if the fallout is as bad as he claims. The notion that using our software is quote, "a disaster waiting to happen", is disingenuous at best:

https://www.ripe.net/ripe/mail/archives/members-discuss/2020-September/004239.html

Routinator design was to try to deal with outages in a responsible manner for all actors involved. Again, of course we can change our strategy as a result of this discussion, which I'm happy we're now actually having. In that case I would advise operators who offer an RPKI publication server to ensure that they provision their rsyncd service so that it is capable of handling all of the traffic that their RRDP service normally handles, in case RRDP has a glitch. And, even if people will scale their rsync service accordingly, they will only ever find out if it actually does in a time of crisis.

Kind regards,

-Alex

> On 31 Oct 2020, at 07:17, Tony Tauber <ttauber at 1-4-5.net> wrote:
> 
> As I've pointed out to Randy and others and I'll share here.
> We planned, but hadn't yet upgraded our Routinator RP (Relying Party) software to the latest v0.8 which I knew had some improvements.
> I assumed the problems we were seeing would be fixed by the upgrade.
> Indeed, when I pulled down the new SW to a test machine, loaded and ran it, I could get both Randy's ROAs.
> I figured I was good to go.  
> Then we upgraded the prod machine to the new version and the problem persisted.
> An hour or two of analysis made me realize that the "stickiness" of a particular PP (Publication Point) is encoded in the cache filesystem.
> Routinator seems to build entries in its cache directory under either rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but moving the cache directory aside and forcing it to rebuild fixed the issue.
> 
> A couple of points seem to follow:
> 	• Randy says: "finding the fort rp to be pretty solid!"  I'll say that if you loaded a fresh Fort and fresh Routinator install, they would both have your ROAs.
> 	• The sense of "stickiness" is local only; hence to my mind the protection against "downgrade" attack is somewhat illusory. A fresh install knows nothing of history.
> Tony
> 
> On Fri, Oct 30, 2020 at 11:57 PM Randy Bush <randy at psg.com> wrote:
> > If there is a covering less specific ROA issued by a parent, this will
> > then result in RPKI invalid routes.
> 
> i.e. the upstream kills the customer.  not a wise business model.
> 
> > The fall-back may help in cases where there is an accidental outage of
> > the RRDP server (for as long as the rsync servers can deal with the
> > load)
> 
> folk try different software, try different configurations, realize that
> having their CA gooey exposed because they wanted to serve rrdp and
> block, ...
> 
> randy, finding the fort rp to be pretty solid!