monitoring software design, was Re: [cacti-announce] Cacti 0.8.6j Released (fwd)

Thu Jan 25 23:51:50 UTC 2007

On Wed, Jan 24, 2007 at 08:05:24PM +0000, Paul Vixie wrote:
> glibly said, sir.  but i disasterously underestimated the amount of time
> and money it would take to build BIND9.

While I can't question your credentials at creating serious network
infrastructure, I wonder about the comparison between BIND9 and a
network monitoring framework that _I_ envision.  I think I know
a couple of requirement handicaps that BIND9 had which a new tool
wouldn't.

Specifically, you have to ensure compatibility with the RFCs, which
locks you into a fairly complicated parser for the least-writable data
format (the zone file) that I have ever had the displeasure of
editing.  While it gets easier over time, it seems remarkably
difficult to get right the first time.  Mostly, people forget to
update the serial number, but other problems are common too.

I imagine you also wanted to maintain the overall structure of
the config file, but I don't see this as particularly problematic;
it seems straightforward enough for me.

Furthermore, there is the monolithic issue; while I find it very
convenient to have two name servers instead of four on my home
network, it seems that it is serving too many masters (pun not
intended).  If recursive queries and queries to authoritative name
servers had different ports, then there would be little reason to have
both in the same package.  I can solve it easily right now with IP
aliases, which I consider a kluge, but the package I would use for it
doesn't support some things that would be nice, like dynamic updates,
but I suppose those too could be split off fairly easily.

Everybody I know who would have a use for a scalable monitoring system
is capable of scripting, and most capable of programming to extend the
framework.  I suspect an attempt to anticipate every possible need and
solve them all for once with one tool would end up growing to
unmanagable complexity far too quickly.

A framework is the easy part.  At the URL in my signature you can find
the dynamic firewall daemon, a framework for dynamically adjusting
firewall rules.  It has an asynch I/O core, so one thread, one
program, one firewall, many clients.  There is a python version for
netfilter/Linux (which is very alpha and needs a new maintainer) and
for BSD (pf of course).  It supports fixed-size rule queues, rules
which timeout at a particular time (can be relative to current), rule
sets that can be enabled or disabled by commands, variable
substitution (where variable means "modifiable by external programs",
and so on, without requiring chains, tables or lists in the firewall
syntax.

Although I spent a lot of time on design in my head, writing the code
is the easy part.  It took about a thousand lines of code.  I could
probably do it in less than 40 hours, but I couldn't do it all at
once.  The real problem appears to be thinking something over and over
and letting your subconscious work on it until you're pretty sure the
answer you had converged on consciously is the right one.

The hard part, I have found, is in getting people to contribute to it
(or generating awareness, which may be a precondition).  I'm thinking
about writing up a paper on it for submitting to Usenix ;login:
magazine, you might keep an eye open for it if you are interested.
If you are interested in python and netfilter/iptables, and have
some free time, then definitely send me an email; if you know anyone
who would like to be an author of a cutting-edge network security
system, let them know about it.

> and talk to devices that will never go to an snmp connectathon,

Here is a scoping problem.  If I started with this goal, I'd be stuck
in analysis paralysis forever.  I'd rather start with SNMP, and get a
usable product that could be extended.  The complexity of the task
goes up with the square of the things to consider, so I think it's
absolutely essential to start with limited objectives and generalize
where appropriate on subsequent generations.

It seems to me the scalability problem (where most of the data is
never read, and one box has to do everything) is more a problem of not
being able to have the clients provide some resources without also
having a complicated remote interface.  Computers are very fast and
only getting faster (of course disk I/O bandwidth is not accelerating
at all, compared to CPU or network bandwidth).  I'm not convinced it
would take something python or another very expressive language could
provide if properly distributed, and that alone would reduce the time
spent writing code by a factor of 10-100, excluding the time spent
coming up with a simple (secure) way of distributing the load.

I would be most interested in hearing what NANOG people would like
to see in a monitoring tool.  I think this is an excellent forum
for hashing out what it should really do, and how.
-- 
``Unthinking respect for authority is the greatest enemy of truth.''
-- Albert Einstein -><- <URL:http://www.subspacefield.org/~travis/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 827 bytes
Desc: not available
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20070125/a6769360/attachment.sig>