2006.06.05 NANOG-NOTES BGP tools BOF notes

Matthew Petach mpetach at netflight.com
Tue Jun 6 11:15:46 UTC 2006


(ok, last set of notes for tonight, and then it's off to bed for 90
minutes of sleep
before heading back to the convention center.  ^_^;  --MNP)


2006.06.05 Welcome to the 4th BGP Tools BOF!
[slides are at
http://www.nanog.org/mtg-0606/pdf/lixia-zhang.pdf

Nick Feamster GeorgeTech
Dan Massey CUS
Mohit Lad and Lixia Zhang, UCLA

The Goal
sharing some tools develop from our research
efforts.
hopefully will be useful for operations community.
Also to collect input on new tools we would like
to see so they can develop them.

Routing Configuration Checker
Nick Feamster

O-BGP data organization tool
Dan Massey
[slides are at
http://www.nanog.org/mtg-0606/pdf/dan-massey.pdf

The Datapository by Nick Feamster
[I'm sorry, that just sounds *far* too much like something
you do *NOT* want your bedside nurse administering...--MNP]

Visualizing BGP dynamics using Link-Rank by
Mohit Lad

Open discussions and demos

Nick Feamster
Network Troubleshooting: rcc and beyond

rcc: router configuration checker
proactive routing configuration analysis
idea: analyze configs before deployment
many faults can be detected with static analysis.

rcc implementation.
http://nms.csail.mit.edu/rcc/

preprocessor -> parser -> relational database (mySQL),
 constraints <-> verifier <-> faults

verifier is a template checker and set of constraints
your configs are checked against.

He's looking for GUI developers.
very bare-bones command line right now.

Parsing configurations--shows some output.

He shows examples of the abilene configs, which
are non anonymized.
show all routers peering with a given AS, can look
at route maps in each direction, etc.

After running rcc on it, you get a web output
which shows relationships--oh, pictures don't matter,
with some more grease could be a reasonable representation
of your network.

Q: Randy Bush asks if it could show which peering
sessions are missing?
A: Not yet, but it could be added, thank you!

Shows processing and errors;
you get a page that summarizes the things RCC thinks
are errors.

Signalling partition?  that's a missing iBGP session;
he needs some better lingo in places.

Also shows anomalous imports, could be intended for
traffic engineering; that's "inconsistent policy"
in ISP speak.

Some of the names will get fixed to make Randy Bush
happy.

Yes, but surprises happen!
link failures
node failures
traffic volumes shift
network devices "wedged"
...

two problems
 detection
 localization

Need to marry static config analysis with dynamic
information (route is configured but isn't in the
dynamic table)

he skips a closer look, just some jargon.

Detection: analyze routing dynamics;
drill down on interesting operational issues.
idea: routers exhibit correlated behaviour
blips across signals may be more operationally
interesting than any spike in one signalling system.
How do you spot things in the churn?

Detection three types of events
 single-router bursts
 correlated bursts
 multi-router bursts <---common; and commonly missed
                         using simple thresholds

Localization: joint dynamic/static
which routers are "border routers" for that burst
topological properties of routers in the burst.

proactive analysis -> deployment -> dynamic ->
  reactive detection -> diagnosis/correction -> static ->

By going back to the configs, lets you see if it's
something happening inside the network, or on the edge.

Specific Focus: firewall configuration
difficult to understand and audit configs

subject to continual modifications
  roughly 1-2 touches per day

federated policy, distributed dependencies
 each department has independent policies
 local changes may affect global behaviour

(These are pulled from Georgia Tech; 130 firewall
configs.  Builds static connectivity matrix.)

Reactive monitoring...use probes from subnets to
verify reachability/connectivity.

(immediate) open issues
reachability and reliability of controller
service-level probes
 diagnostic tools != service-level happiness
policy conformance.

Q: can it give suggested remediation, or provide
config templates for new routers being added?
A:  Good idea!


OK, over to next presenter.  Helps with understanding
BGP data.

BGP data collection and organization (OBGP) Tool
Colorado state university/university of Arizona/UCLA

BGP data collection
takes lots of BGP data, from RIPE RIS, etc.
ISP BGP peer router -> update oreg -> rib+update ->

feeds into gigabytes of data, different formats,
potential errors enter in, and severe lack of metadata.

Other tools can use it, LinkRank, BGP-Inspect, and a
bunch of people cite it in reports and research.

OBGP motivation
Large Volume of Data
 data from many sources (RIPE, RV, private data)
 Long time scales and very recent (real-time?) data
Slightly different formats
 RIPE/RV use different naming conventions
 different dump intervals
 different timezones for older data
Lack of MetaData
 would like to only see desired peers and desired update
  types
Possible errors in the data
 are updates missing due to log errors?
 what is lost due to session failures?

So, OBGP is the "thing" in the middle.
A simple perl script called oBGP that simplifies
data.

Features:
Uniform data organization
 consistent and easy to use for scripts
consistent view of multiple monitoring points
annonatations/labels
 can be stripped, help locate useful data easily
table transfer detection
 distinguish updates from data collection peering
Data inconsistency detection and correction
 understand and fix possible data errors

Uniform data organization
Uniform naming and organization conventions for all
  monitoring points
RIB and update data split by peer
One rib and update file per peer per day,
 dumped at beginning of the day.

Labels and Annotations are more interesting
Existing format labels update as
 announce (A) or Withdraw (W)
 also includes some STATE messages
OBGP enhances the labels
 Adds a status message
 Adds an update type
 More STATE messages
  route table dump
  table transfers

A:INC:DPATH
(shows it's an announcement, it's incremental, and
it's updating the destination path

OBGP Added labels
|<original update type:<status info>:<OBGP udate type>|

<orginal update type>
 add E for error correction
<status infor>
 INC incremental update
 TT table transfer update
 RIB: correction update
<OBGP update type>
 new announcement
 duplicate announcement
 change in AS path (DPATH)
 change in other attribute (not ASpath)
 withdraw
 duplicate withdraw

If you don't need this, it's just a few extra characters
in your log; but could be useful.

Using Labels to filter data
example: find suballocation hijacks
Only need new announcements and withdraws
 so 83% of the update data can be ignored.

Is the collected data accurate?
May lose updates due to data collection errors
 start with an accurate RIB
 apply updates in log
 should match the next RIB dumped by the router
  modulo some race conditions near dump time
 does this clearly work with RouteViews?

85 of 111 peers from RV suffered inconsistencies in
2006 May
About 25 were rock solid right on.

One peer had 378,998 inconsistencies in one day.

Is this evenly distributed?  Not really.

Inconsistencies and session failures
session down: RIB-IN drops to empty
session up: table transfer
(failure to recognize a session dropping)

look for table transfer, can estimate where
sessions went down and came back up.

How long does an error persist?
Lifetime of correction updates can last 43 days!
If you miss an update, you can have bad data for
a long, LONG time!!

Correction updates added by OBGP
E:RIB updates; figure a change in RIB had to happen
due to a routing update that was missed.

Summary:
consistent format
adds label to easily sort and limit
adds additional state messages
identifies and corrects update error messages

http://netsec.cs.colostate.edu/tools.html
[NOTE URL at end of slide deck is WRONG --MNP]

If you're using RouteViews or RIPE RIS, consider
using this tool, and give feedback!

Randy is using it to check propagation of his
prefixes, and for research.

RIPE NCC--performance of these tools?  With
multiple collectors, perl didn't scale.

Perl is mainly demonstration.
He pulls data from RIPE and has it stored, hopes
to make it public some day.
he has stacks of disks with text format data for
easy search; considering binary format for it.

Randy--on that subject: Matt Rowan, he's spent
half his life getting the data out of the system;
make it in funny format and sticking it back in,
Disk is cheap!  Look at raw data.  With binary data,
what tools are there?  Hard enough to look at router
configs.
One tool to look at binary data, lots of tools to
look at text.

Q: Matt asks how much space it takes to store data
A: Takes about 1TB to store all the RIS data.

Q: Are they planning to make it available to the public?
A: Well, he'd like to host it at route-views or ripe,
rather than create a new site.
How long does it take to process the RIPE data?
Need a fast CPU, will take a couple of days to
process the data.

Q: can it deal with live updates?  It can keep up
with route-views and RIPE, but that's not live;
there is a lag; route-views is every 15 minutes.
The update files sometimes take 8 hour lags to show
up on the site.

The Datapository
Nick Feamster and David Anderson?
Architecture:
raw data -> compute engines-> storage and DB plus
archival storage ->analysis.

Very alpha right now
datapository.net
NOT realtime!  inserting data in greedy approach;
when he needs it, he inserts it, and starts running
queries.
You can see a list of feeds, he has abilene but not
route-views yet.

Can restrict it, look at neighbor ASes, etc.
see it in graphical form, or list form
can diagnose issues,
has an XML query engine and output for programatically
accessing it.

If you use matlab, could be interesting to throw this
into a multidimensional time series.

Randy Bush notes all his tools take MRT output.

Oh, he can spit out sparse matrices

He could spit out MRT format; he has python that
speaks MRT format.

he'll look at adding that.

Do spammers hijack BGP routes?
Theory:
 1 announce BGP route for mail server
 2 send lots of spam
 3 withdraw route, becoming invisible
reality?  let's check!

export formats
Web interface
XML/RPC
 text-based output
 programmatic interface
 output to matlab

and per Randy Bush, MRT format would be good too!

BGP-Inspect vs this tool?
this has additional datasets beside BGP, like active
probes, traffic, etc.  This has a better collection
setup as well; unified formats.

Mohit will do last one, Link-Rank show the dynamics

Visualizing BGP dynamics with Link-Rank
constructing rank-change graphs
closest to BGPlay.
weight is number of prefixes reached across that
link.
weight changes are on specific links, can do easy
root-cause analysis.
Activity bar--routing activity across time.

http://linkrank.cs.ucla.edu/
green shows gains, red shows losses,
sums all gains, sums all losses.

visualization graph of where prefixes gained and lost.
again, green are links that gain, red are links that lost.
other observation points highlighted in orange

May 23rd, instability
293 flapping from 1239 to 3356

for multiple observation points, dashes are lost,
solids are gains.

highlight sources and sinks
cutting one link explains most of the errors
3561 to 4134 link issue

case II, one node that sucks in all the flows,
no single link, 3356

Ongoing work,
automated root cause identification
 min-cut scheme
characterizing


Can look at destination link-rank graphs, see how the
rest of the internet change going to you.
connectivity issues to 7018

should show your prefix hijackings moving from one
link to another.

Could simplify this to BGPlay if you wanted.

open source
http://linkrank.cs.ucla.edu/
current version
 download client, configure...

Future version
 work with any BGP data

Q: Matt asks when we'll be able to use our own BGP data;
A: about 4-5 months, hopefully!
Haven't looked at netflow yet.

Q: Randy Bush.  Common problem we all face.  I'm at 42
peering points; my neighbors are X.  I have route views
dumps, I have my BGP dumps.  I have my netflow data.
Want a whatifatron that shows what happens to my
traffic if depeer someone, or add someone, or
peer with SingTel in singapore, or stop peering
with Joe in SF.
That's a question many operators ask every day.
A: Matt notes that if they can solve that question/write
something that does all that, they'll have Arbor and
others beating on their door.  ^_^

Panel wraps up at 1728 hours Pacific time.



More information about the NANOG mailing list