BGP route hijack by AS10990
baldur.norddahl at gmail.com
Mon Aug 3 15:09:07 UTC 2020
On Mon, Aug 3, 2020 at 3:54 PM Job Snijders <job at ntt.net> wrote:
> On Mon, Aug 03, 2020 at 02:36:25PM +0200, Alex Band wrote:
> > According to the information I received from the community, you
> > should read PR1461602 and PR1309944 before deploying.
> >  https://rpki.readthedocs.io/en/latest/rpki/router-support.html
> My take on PR1461602 is that it can be ignored, as it appears to only
> manifest itself in a mostly cosmetic way: initial RTR session
> establishment takes multiple minutes, but once RTR sessions are up
> things work smoothly.
> Under no circumstances should you enable RPKI ROV functionality on boxes
> that suffer from PR1309944. That one is a real showstopper.
We suffered a series of crashes that led to JTAC recommending disabling
RPKI. We had a core dump which matches PR1332626 which is confidential, so
I have no idea what it is about. Apparently what happened was the server
running the RPKI validation server rebooted and the service was not
configured to automatically restart. Also we did not have it redundant nor
did we monitor the service. So we had no working RPKI validation server and
that apparently caused the MX204 to become unstable in various ways. It
might run for a day but it would do all sorts of things like packet loss,
delays and generally be "strange". The first crash caused BGP, ssh and
subscriber management to be down, but LDP, OSPF, SNMP to be up. It became a
black hole we could not login to. The worst possible kind of crash for a
router. We had to go onsite and pull the power.
The router appears to run fine after disabling RPKI. I suppose starting the
validation service may also fix the issue. But I am not going to go there
until I know what is in that PR and also I feel the RPKI funktion needs to
be failsafe before we can use it. I know we are at fault for not deploying
the validation service in a redundant setup and for failing at monitoring
the service. But we did so because we thought it not to be too important,
because a failed validation service should simply lead to no validation,
not a crashed router.
This is on JUNOS 20.1R1.11.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NANOG