Monitoring other people's sites (Was: Website for ipv6.level3.com returns "HTTP/1.1 500 Internal Server Error")

Wed Mar 21 08:43:03 UTC 2012

On 2012-03-20 16:53 , Nick Hilliard wrote:
> On 20/03/2012 14:54, Jeroen Massar wrote:
>> For everybody who is "monitoring" other people's websites, please please
>> please, monitor something static like /robots.txt as that can be
>> statically served and is kinda appropriate as it is intended for robots.
> 
> Depends on what you are monitoring.  If you're looking for layer 4 ipv6
> connectivity then robots.txt is fine.  If you're trying to determine
> whether a site is serving active content on ipv6 and not serving http
> errors, then it's pretty pointless to monitor robots.txt - you need to
> monitor /.

And as can be seen with the monitoring of ipv6.level3.com it will tell
you 'it is broken' but as the person who is monitoring has no relation
or contact with them, it only leads to public complaints which do not
get resolved....

If the site themselves cannot be arsed to monitor their own, then why
would you bother to do so.

Indeed, I agree that it can be useful, especially as an access ISP, to
monitor popular websites so that you know that you can reach them, but
that does not mean you need to pull large amounts of data.
(for determining MTU issues yes, but likely you have a full 1500 path
anyway thus these should as good as possible not happen anyway)

But unless you have a contact at the site it will be tough to resolve
the issue anyway.

>> Oh and of course do set the User-Agent to something logical and to be
>> super nice include a contact address so that people who do check their
>> logs once in a while for fishy things they at least know what is
>> happening there and that it is not a process run afoul or something.
> 
> Good policy, yes.  Some robots do this but others don't.
> 
>> Of course, asking before doing tends to be a good idea too.
> 
> Depends on the scale.  I'm not going to ask permission to poll someone
> else's site every 5 minutes, and I would be surprised if they asked me the
> same.  OTOH, if they were polling to the point that it was causing issues,
> that might be different.

I was not talking about that low rate, not a lot of people will notice
that, but the 1000qps from 500 sources was quite noticed and thus at
first they got blocked, then we tried to find out who was doing it, and
then they repointed to robots.txt, unblocked them and all was fine.

>> The IPv6 Internet already consists way too much out of monitoring by
>> pulling pages and doing pings...
> 
> "way too much" for what?  IPv6 is not widely adopted.

In comparison to real traffic. There has been a saying since the 6bone
days already that IPv6 is just ICMPv6...

>> Fortunately that should heavily change in a few months.
> 
> We've been saying this for years.  World IPv6 day 2012 will come and go,
> and things are unlikely to change a whole lot.  The only thing that World
> IPv6 day 2012 will ensure is that people whose ipv6 configuration actively
> interferes with their daily Internet usage will be self-flagged and their
> configuration issues can be dealt with.

Fully agree, but at least at that point nobody will be able to claim
that they can't deploy IPv6 on the access side as there is no content ;)

>>  (who noticed a certain s....h company performing latency checks against
>> one of his sites, which was no problem, but the fact that they where
>> causing almost more hits/traffic/load than normal clients was a bit on
>> the much side
> 
> If that web page is configured to be as top-heavy as this, then I'd suggest
> putting a cache in front of it. nginx is good for this sort of thing.

nginx does not help if your content is not cacheable by nginx, for
instance if you simply show the IP address of the client and if they
thus have IPv6 or IPv4.

In our case, indeed, everything that is static is served by nginx, which
is why hammering on /robots.txt is not an issue at all...

On 2012-03-20 21:45 , Charles N Wyble wrote:
> On 03/20/2012 09:54 AM, Jeroen Massar wrote:
>> On 2012-03-20 15:40 , Vinny_Abello at Dell.com wrote:
>> 
>> For everybody who is "monitoring" other people's websites, please
>> please please, monitor something static like /robots.txt as that
>> can be statically served and is kinda appropriate as it is intended
>> for robots.
> 
> This could provide a false positive if one is interested in ensuring 
> that the full application stack is working.

As stated above and given the example of the original subject of
ipv6.level3.com, what exactly are you going to do when it does not?

And again, if the owner does not care, why should you?

Also, maybe they do a redesign of the site and remove the keywords or
other metrics you are looking for. It is not your problem to monitor it
for them, unless they hire you to do so of course.

>> Oh and of course do set the User-Agent to something logical and to
>> be super nice include a contact address so that people who do check
>> their logs once in a while for fishy things they at least know what
>> is happening there and that it is not a process run afoul or
>> something.
> 
> A server side process? Or client side?

Take a guess what something that polls a HTTP server is.

> If the client side monitoring
> is too aggressive , then your rate limiting firewall rules should
> kick in and block it. If you don't have a rate limiting firewall on
> your web server, (on the server itself, not in front of it) then you
> have bigger problems.

You indeed will have a lot of problems when you are doing connection
tracking on your website, be that on the box itself or in front of it in
a separate TCP state engine.

>> Of course, asking before doing tends to be a good idea too.
> 
> 
> If you are running a public service, expect it to get 
> monitored/attacked/probed etc. If you don't want traffic from
> certain sources then block it.

That is exactly what happened, but if they would have set a proper
user-agent it would not have taken time to figure out why they where
doing it.

There is a big difference between malicious and good traffic, people
tend to want to serve the latter one.

>> The IPv6 Internet already consists way too much out of monitoring
>> by pulling pages and doing pings...
> 
> Who made you the arbiter of acceptable automated traffic levels?

I did!

And as you state yourself, if you do not like it, block it, which is
what we do. But that was not what this thread was about, if you recall,
it started with noting that you might want to ask for permission and
that you might want to provide proper contact details in the probing.

>> (who noticed a certain s....h company performing latency checks
>> against one of his sites, which was no problem, but the fact that
>> they where causing almost more hits/traffic/load than normal
>> clients was a bit on the much side,
> 
> Again. Use a firewall and limit them if the traffic isn't in line
> with your site policies.

I can only suggest running a site once with more than a few hits per
second that is distributed around the world and with actual users ;)

>> And for the few folks putting nagios's on other people's sites,
>> they obviously do not understand that even if the alarm goes off
>> that something is broken that they cannot fix it anyway, thus why
>> bother...
> 
> You obviously do not understand why people are implementing these 
> monitors.

Having written various monitoring systems I know exactly why they are
doing it. I also know that they are monitoring the wrong thing.

> It's to serve as a canary for v6 connectivity issues.

Just polling robots.txt is good enough for that.

Asking the site operator if it is good with them is also a good idea.
Providing contact details in the User-Agent is also a good idea.

> If I
> was implementing a monitor like this, I'd use the following logic:
> 
> HTTP 200 returned via v4/v6 == all is well HTTP 200 returned via v4
> or v6 , no HTTP code returned via v4 or v6 (ie one path works) ==
> v6/v4 potentially broken. no HTTP code returned via either method ==
> end site problem. nothing we can do. don't alert.

And then you get an alert, who are you going to call?

> Presumably you'd also implement a TCP 80 check as well.

Ehmmm, you do realize that if you are able to get a HTTP response that
you have (unless doing HTTPS) actually already contacted port 80 over
TCP? :)

Greets,
 Jeroen