Looking for a performance/health monitoring and alerting solution

Mon Jun 16 03:49:06 IDT 2014

For a start, it looks like you put both trending and alerting in one
basket. I'd keep them separate though alerting based on collected trending
data is useful (e.g. don't alert just when a load threshold is crossed but
only if the trending average for the part X minutes is above the threshold,
or even only if it's derivative shows that it's not going to get better
soon enough).

See http://fractio.nl/2013/03/25/data-failures-compartments-pipelines/ for
high level theory about monitoring pipelines, and a bit of a pitch for
Flapjack (and start by reading the first link from it). Lindsay is a very
eloquent speaker and author in general and fun to watch and read.

Bottom line from the above - I'm currently not aware of a single silver
bullet to do everything you need for proper monitoring.

Last time I had to setup such a system (monitoring hundreds of servers for
trends AND alerts) I used:
1. collectd (https://collectd.org/) for trending data - it can sample
things down to once a second if you want
2. statsd (https://github.com/etsy/statsd/) for event counting (e.g. every
time a Bamboo build plan started or stopped, or failed or succeeded, or
other such events happend, an event was shot over to statsd to coalace and
ship over to graphite). nice overview:
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
3. both of the above send data to graphite (
https://github.com/graphite-project)
4. To track things like "upgraded Bamboo" events, we used tricks like
http://codeascraft.com/2010/12/08/track-every-release/. I since then
learned about another project to help stick extra data with events (e.g.
the version that Bamboo was upgraded to), but I can't find it right now.

Here is a good summary with Graphite tips:
http://kevinmccarthy.org/blog/2013/07/18/10-things-i-learned-deploying-graphite/

Alerts were generated by opsview (stay away from it, it was a mistake),
which is yet another Nagios wrapper, many of the checks were based on
reading the Graphite data whenever it was available (
https://github.com/olivierHa/check_graphite), but many also with plain old
"nrpe" (e.g. "is the collectd/bamboo/apache/mysql/postgres/whatever process
still running?").

I don't like nagios specifically and its centralization in general (which
affects all other "nagios replacement" impolementations) and would rather
look for something else, perhaps Sensu (http://sensuapp.org/), though it
wasn't ready last time I evaluated it about a year ago.

My main beef with Nagios and the other central monitoring systems is that
there is a central server which orchestrates most of the monitoring. This
means that:
1. There is one server which has to go through all the checks on all
monitored servers in each iteration to trigger a check. With hundreds of
servers and thousands of checks this could take a very long time. It could
be busy checking whether the root filesystem on a throw-away bamboo agent
is full (while the previous check showed that it's far from that) while
your central Maven repository is burning for a few minutes. And it wouldn't
help to say "check Maven repo more often" because it'll be like the IBM vs.
DEC boat race - "row harder!" (
http://www.panix.com/~clp/humor/computers/programming/dec-ibm.html).
2. That server is a single point of failure, or you have to start using
complex clustering solutions to keep it (and only one of it!) up - no
parallel servers.
3. This server has to be very beefy to keep up with all the checks AND
serve the results. In one of my former workplaces (second largest
Australian ISP at the time) there was a cluster of four such servers with
the checks carefully spread among them. Updating the cluster configuration
was a delicate business and keeping them up wasn't pleasant and still it
was very slow to serve the web interface.
4. The amount of traffic and load on the network and monitored servers is
VERY wasteful - open TCP for each check, fork/exec via the NRPE agent,
process exit, collect results, rinse, repeat, millions of times a day.

Nagios doesn't encourage what it calls "passive monitoring" (i.e. the
monitored servers initiate checks and send results, whether positive or
negative, to a central server) and in general its protocol (NRPE) means
that the central monitoring data collector is a bottleneck.

Sensu, on the other hand, works around this by encouraging more "passive
monitoring", i.e. each monitored server is responsible to monitor itself
without the overhead of a central server doing the rounds and loading the
network, it uses RabbitMQ message bus so its data transport and collection
servers are more scalable (it also supports multiple servers), and it's OK
with not sending anything if there is nothing to report (the system will
still has "keepalive" checks (http://sensuapp.org/docs/0.12/keepalives) to
monitor for nodes which went down).

But my favourite idea for scalability is the one presented in
http://linux-ha.org/source-doc/assimilation/html/index.html - each
monitored host is responsible to monitor itself, without bothering anyone
if there is nothing to write home about (so a bit like Sensu), and a couple
of servers near it, so the "is host is alive" external monitoring is
distributed across the network (and doesn't fall on the servers alone, like
in Sensu), it also saves unnecessary network traffic. Unfortunately, it
seems not to be ready yet (
http://linux-ha.org/source-doc/assimilation/html/_release_descriptions.html
).

More points:

Lack of VPN - if you can't setup a "proper" vpn then you can always look at
ssh vpn (e.g. Ubuntu instructions: https://help.ubuntu.com/community/SSH_VPN),
and if you can't be bothered with ssh_config "Tunnel"/"TunnelDevice" (ssh
"-w" flag) then even a simple ssh port redirection with ssh -NT and autossh
could do.

Log concentration - look at Logstash (http://logstash.net/) for proper log
collection and analysis.

Hope this gives you some ideas.

--Amos

On 16 Jun 2014 09:13, "Ori Berger" <linux-il at orib.net> wrote:

> I'm looking for a single system that can track all of a remote server's
> health and performance status, and which stores a detailed
> every-few-seconds history. So far, I haven't found one comprehensive system
> that does it all; also, triggering alarms in "bad" situations (such as no
> disk space, etc). Things I'm interested in (in parentheses - how I track
> them at the moment. Note shinken is a nagios-compatible thing).
>
> Free disk space (shinken)
> Server load (shinken)
> Debian package and security updates  (shinken)
> NTP drift (shinken)
> Service ping/reply time (shinken)
> Upload/download rates per interface (mrtg)
> Temperatures (sensord, hddtemp)
> Security logs, warning and alerts e.g. fail2ban, auth.log (rsync of log
> files)
>
> I have a few tens of servers to monitor, which I would like to do with one
> software and one console. Those servers are not all physically on the same
> network, nor do they have a VPN (so, no UDP) but tcp and ssh are mostly
> reliable even though they are low bandwidth.
>
> Please note that shinken (much like nagios) doesn't really give a good
> visible history of things it measures - only alerts; Also, it can't really
> sample things every few seconds - the lowest reasonable update interval
> (given shinken's architecture) is ~5 minutes for the things it measures
> above.
>
> Any recommendations?
>
> Thanks in advance,
> Ori
>
> _______________________________________________
> Linux-il mailing list
> Linux-il at cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20140616/a2a2d521/attachment-0001.html>