CloudFlare issues?

Tue Jun 25 00:57:29 UTC 2019

> On Jun 24, 2019, at 8:03 PM, Tom Beecher <beecher at beecher.cc> wrote:
> 
> Disclaimer : I am a Verizon employee via the Yahoo acquisition. I do not work on 701.  My comments are my own opinions only. 
> 
> Respectfully, I believe Cloudflare’s public comments today have been a real disservice. This blog post, and your CEO on Twitter today, took every opportunity to say “DAMN THOSE MORONS AT 701!”. They’re not. 

I presume that seeing a CF blog post isn’t regular for you. :-). — please read on

> You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn’t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many  jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it. 
> 
> You guys have repeatedly accused them of being dumb without even speaking to anyone yet from the sounds of it. Shouldn’t we be working on facts? 
> 
> Should they have been easier to reach once an issue was detected? Probably. They’re certainly not the first vendor to have a slow response time though. Seems like when an APAC carrier takes 18 hours to get back to us, we write it off as the cost of doing business. 
> 
> It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn’t seem like there is much appetite to shame the other contributors. 
> 
> You’re right to use this as a lever to push for proper filtering , RPKI, best practices. I’m 100% behind that. We can all be a hell of a lot better at what we do. This stuff happens more than it should, but less than it could. 
> 
> But this industry is one big ass glass house. What’s that thing about stones again? 

I’m careful to not talk about the people impacted.  There were a lot of people impacted, roughly 3-4% of the IP space was impacted today and I personally heard from more providers than can be counted on a single hand about their impact.

Not everyone is going to write about their business impact in public.  I’m not authorized to speak for my employer about any impacts that we may have had (for example) but if there was impact to 3-4% of IP space, statistically speaking there’s always a chance someone was impacted.

I do agree about the glass house thing.  There’s a lot of blame to go around, and today I’ve been quoting “go read _normal accidents_” to people.  It’s because sufficiently complex systems tend to have complex failures where numerous safety systems or controls were bypassed.  Those of us with more than a few days of experience likely know what some of them are, we also don’t know if those safety systems were disabled as part of debugging by one or more parties.  Who hasn’t dropped an ACL to debug why it isn’t working, or if that fixed the problem?

I don’t know what happened, but I sure know the symptoms and sets of fixes that the industry should apply and enforce.  I have been communicating some of them in public and many of them in private today, including offering help to other operators with how to implement some of the fixes.

It’s a bad day when someone changes your /16 to two /17’s and sends them out regardless of if the packets flow through or not.  These things aren’t new, nor do I expect things to be significantly better tomorrow either.  I know people at VZ and suspect once they woke up they did something about it.  I also know how hard it is to contact someone you don’t have a business relationship with.  A number of the larger providers have no way for a non-customer to phone, message or open a ticket online about problems they may have.  Who knows, their ticket system may be in the cloud and was also impacted.

What I do know is that if 3-4% of the home/structures were flooded or temporarily unusable because of some form of disaster or evacuation, people would be proposing better engineering methods or inspection techniques for these structures.

If you are a small network and just point default, there is nothing for you to see here and nothing that you can do.  If you speak BGP with your upstream, you can filter out some of the bad routes.  You perhaps know that 1239, 3356 and others should only be seen directly from a network like 701 and can apply filters of this sort to prevent from accepting those more specifics.  I don’t believe it’s just 174 that the routes went to, but they were one of the networks aside from 701 where I saw paths from today.

(Now the part where you as a 3rd party to this event can help!)

If you peer, build some pre-flight and post-flight scripts to check how many routes you are sending.  Most router vendors support either on-box scripting, or you can do a show | display xml, JSON or some other structured language you can automate with.  AS_PATH filters are simple, low cost and can help mitigate problems.  Consider monitoring your routes with a BMP server (pmacct has a great one!).  Set max-prefix (and monitor if you near thresholds!).  Configure automatic restarts if you won’t be around to fix it.

I hate to say “automate all the things”, but at least start with monitoring so you can know when things go bad.  Slack and other things have great APIs and you can have alerts sent to your systems telling you of problems.  Try hard to automate your debugging.  Monitor for announcements of your space.  The new RIS Live API lets you do this and it’s super easy to spin something up.

Hold your suppliers accountable as well.  If you are a customer of a network that was impacted or accepted these routes, ask for a formal RFO and what the corrective actions are.  Don’t let them off the hook as it will happen again.

If you are using route optimization technology, make double certain it’s not possible to leak routes.  Cisco IOS and Noction are two products that I either know or have been told don’t have default safe settings enabled.  I learned early on in the 90s the perils of having “everything on, unprotected” by default.  There were great bugs in software that allowed devices to be compromised at scale which made comparable cleanup problems to what we’ve seen in recent years with IoT or other technologies.  Tell your vendors you want them to be secure by default, and vote with your personal and corporate wallet when you can.

It won’t always work, some vendors will not be able or willing to clean up their acts, but unless we act together as an industry to clean up the glass inside our own homes, expect someone from the outside to come at some point who can force it, and it may not even make sense (ask anyone who deals with security audit checklists) but you will be required to do it.

Please take action within your power at your company.  Stand up for what is right for everyone with this shared risk and threat.  You may not enjoy who the messenger is (or the one who is the loudest) but set that aside for the industry.

</soapbox>

- Jared

PS. We often call ourselves network engineers or architects.  If we are truly that, we are using those industry standards as building blocks to ensure a solid foundation.  Make sure your foundation is stable.  Learn from others mistakes to design and operate the best network feasible.