Pinging a Device Every Second

Richard Holbo holbor at sonss.net
Sun Dec 16 22:09:02 UTC 2018


YMMV... but most of the CPE routers I've seen lately have icmp turned
off by default, so you'll be messing with settings in the customer
router.  Do you provide the router? Also agree with Baldur, 2
minutes... is more than likely the customer router rebooting itself or
something like that.   If they support SNMP at ALL uptime is a VERY
useful OID.  I've finally given up an started to provide the customer
CPE.. since we're going to get the blame anyway... might as well be
able to monitor it in a fashion that we can choose and charge another
$10 a month for managed router.

TR-069 has settings to change the update frequency as well and it can
be persuaded to provide SNMPish information.

I also run a smokeping for _special_ customers.  I've found that 20
rapid pings every 1 minute gives me pretty good stats on jitter and if
they really are having an issue, I'll see it at that granularity.

/rh

On Sat, Dec 15, 2018 at 5:22 PM Baldur Norddahl
<baldur.norddahl at gmail.com> wrote:
>
> Hi
>
> Customers do not usually complain about 2 minutes of downtime unless it is a repeating event. We will therefore offer such customers to put their line on monitor mode, which means we will add them to smokeping. You could also start the ping once a second thing, which would be no problem if it is only a few customers on monitor mode.
>
> However 2 minutes of downtime is a symptom of bad wifi more often than the internet connection.
>
> Regards,
>
> Baldur
>
>
> On Sat, Dec 15, 2018 at 7:33 PM Colton Conor <colton.conor at gmail.com> wrote:
>>
>> The problem I am trying to solve is to accurately be able to tell a customer if their home internet connection was up or down.  Example, customer calls in and says my internet was down for 2 minutes yesterday. We need to be able to verify that their internet connection was indeed down. Right now we have no easy way to do this.  Getting metrics like packet loss and jitter would be great too, though I realize ICMP data path does not always equal customer experience as many network device prioritize ICMP traffic. However ICMP pings over the internet do usually accurately tell if a customers modem is indeed online or not.
>>
>> Most devices out in the field like ONT's and DSL modems do not support SNMP but rather use TR-069 for management. Most of these devices only check into the TR-069 ACS server once a day.
>> If the consumer device does support SNMP, they usually have weak broadcom or qualcom SoC processors, outdated linux kernel embedded operating systems, limited ram, and storage. Most of these can't handle SNMP walks every minute let alone every 5. We are talking about sub $100 routers here not Juniper, Cisco, Arista, etc.
>>
>> Most all of these consumer devices are connected to an carrier aggregation device like a DSLAM, OLT, ethernet switch, or wireless access point. These access devices do support SNMP, but most manufactures recommend only 5 minute SNMP poling, so a 2 minute outage would not easily be detected. Plus its hard to correlate that consumer X is on port Y on access switch, and get that right for a tier 1 CSR.
>>
>> The only two ways I think I can accomplish this is:
>> 1. ICMP pings to a device every so many seconds. Almost every device supports responding to WAN ICMP pings.
>> or
>> 2. IPFIX sampling at core router, and then drilling down by customer IP. I think this will tell me if any data was flowing to this customers IP on a second by second basis, but won't necessarily give us an up or down indicator. Requires nothing from the consumer's router.
>>
>>
>>
>>
>>
>> On Sat, Dec 15, 2018 at 10:51 AM Stephen Satchell <list at satchell.net> wrote:
>>>
>>> On 12/15/18 7:48 AM, Colton Conor wrote:
>>> > How much compute and network resources does it take for a NMS to:
>>> >
>>> > 1. ICMP ping a device every second
>>> > 2. Record these results.
>>> > 3. Report an alarm after so many seconds of missed pings.
>>> >
>>> > We are looking for a system to in near real-time monitor if an end
>>> > customers router is up or down. SNMP I assume would be too resource
>>> > intensive, so ICMP pings seem like the only logical solution.
>>> >
>>> > The question is once a second pings too polling on an NMS and a consumer
>>> > grade router? Does it take much network bandwidth and CPU resources from
>>> > both the NMS and CPE side?
>>> >
>>> > Lets say this is for a 1,000 customer ISP.
>>>
>>> What problem are you trying to solve, exactly?  That more than anything
>>> will dictate what you do.
>>>
>>> Short answer: about 1500 bits of bandwidth, and the CPU loading on the
>>> remote device is almost invisible.  Remember the only real difference
>>> between ping and SNMP monitoring (UDP) is the organization of the bits
>>> in the packet and the protocol number in the IP header.  It's still one
>>> packet pair exchanged, unless you get really ambitious with your SNMP
>>> OID list.
>>>
>>> When I was in a medium-sized hosting company, I developed an SNMP-based
>>> monitoring system that would query a number of load parameters (CPU,
>>> disk, network, overall) on a once a minute schedule, and would keep
>>> history for hours on the monitoring server.  The boss fretted about the
>>> load such monitoring would impose.  He never saw any.
>>>
>>> For pure link monitoring, which is what I'm hearing you want to do, in
>>> my experience I found that a six-second ping cycle gives lots of early
>>> warning for link failures.  Again, it depends on the specifications and
>>> detection targets.
>>>
>>> Some things to consider:
>>>
>>> 1.  Router restarts take a while.  Consumer-grade routers can take a
>>> minute or more to complete a restart to the point where it will respond
>>> to ping.  Carrier-grade routers are more variable but in general have so
>>> many options built into them that it takes longer to complete a restart
>>> cycle.  Since you are talking consumer-grade gear, you probably don't
>>> want to be sensitive to CP power sags.
>>>
>>> 2.  Depending on the technology used on the link, you may get some
>>> short-term outages, on the order of seconds, so doing "rapid" pings do
>>> nothing for you.  During my DSL time, ATM would drop out for short
>>> intervals -- so watch out for nuisance trips.
>>>
>>> 3.  Some routers implement ping limiting, so you have to balance your
>>> monitoring sample rate against DoS susceptibility. Offhand, I don't know
>>> the granularity of consumer router ping limiting, as I've never had that
>>> question pop up.
>>>
>>> 4.  How large a monitoring server are you willing to devote to such a
>>> system?  My web host monitoring used a 400-MHz Pentium II box, and it
>>> didn't even breathe hard.  (A 1U Cobalt box, repurposed with Red Had
>>> Linux, pulled from a junk pile.)  I was monitoring about 150 web host
>>> servers. Extraolatuing the system load on that Cobalt box, I could have
>>> handled 1500 web host servers and more.
>>>



More information about the NANOG mailing list