Looking for a performance/health monitoring and alerting solution

Mon Jun 16 04:19:59 IDT 2014

Another thing - while I was digging the Sydney DevOps meetups for a talk
about monitoring by a dude from Google, I stumbled across a reference to
InfluxDB: http://influxdb.com/.

On 16 June 2014 10:49, Amos Shapira <amos.shapira at gmail.com> wrote:

> For a start, it looks like you put both trending and alerting in one
> basket. I'd keep them separate though alerting based on collected trending
> data is useful (e.g. don't alert just when a load threshold is crossed but
> only if the trending average for the part X minutes is above the threshold,
> or even only if it's derivative shows that it's not going to get better
> soon enough).
>
> See http://fractio.nl/2013/03/25/data-failures-compartments-pipelines/
> for high level theory about monitoring pipelines, and a bit of a pitch for
> Flapjack (and start by reading the first link from it). Lindsay is a very
> eloquent speaker and author in general and fun to watch and read.
>
> Bottom line from the above - I'm currently not aware of a single silver
> bullet to do everything you need for proper monitoring.
>
> Last time I had to setup such a system (monitoring hundreds of servers for
> trends AND alerts) I used:
> 1. collectd (https://collectd.org/) for trending data - it can sample
> things down to once a second if you want
> 2. statsd (https://github.com/etsy/statsd/) for event counting (e.g.
> every time a Bamboo build plan started or stopped, or failed or succeeded,
> or other such events happend, an event was shot over to statsd to coalace
> and ship over to graphite). nice overview:
> http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
> 3. both of the above send data to graphite (
> https://github.com/graphite-project)
> 4. To track things like "upgraded Bamboo" events, we used tricks like
> http://codeascraft.com/2010/12/08/track-every-release/. I since then
> learned about another project to help stick extra data with events (e.g.
> the version that Bamboo was upgraded to), but I can't find it right now.
>
> Here is a good summary with Graphite tips:
> http://kevinmccarthy.org/blog/2013/07/18/10-things-i-learned-deploying-graphite/
>
> Alerts were generated by opsview (stay away from it, it was a mistake),
> which is yet another Nagios wrapper, many of the checks were based on
> reading the Graphite data whenever it was available (
> https://github.com/olivierHa/check_graphite), but many also with plain
> old "nrpe" (e.g. "is the collectd/bamboo/apache/mysql/postgres/whatever
> process still running?").
>
> I don't like nagios specifically and its centralization in general (which
> affects all other "nagios replacement" impolementations) and would rather
> look for something else, perhaps Sensu (http://sensuapp.org/), though it
> wasn't ready last time I evaluated it about a year ago.
>
> My main beef with Nagios and the other central monitoring systems is that
> there is a central server which orchestrates most of the monitoring. This
> means that:
> 1. There is one server which has to go through all the checks on all
> monitored servers in each iteration to trigger a check. With hundreds of
> servers and thousands of checks this could take a very long time. It could
> be busy checking whether the root filesystem on a throw-away bamboo agent
> is full (while the previous check showed that it's far from that) while
> your central Maven repository is burning for a few minutes. And it wouldn't
> help to say "check Maven repo more often" because it'll be like the IBM vs.
> DEC boat race - "row harder!" (
> http://www.panix.com/~clp/humor/computers/programming/dec-ibm.html).
> 2. That server is a single point of failure, or you have to start using
> complex clustering solutions to keep it (and only one of it!) up - no
> parallel servers.
> 3. This server has to be very beefy to keep up with all the checks AND
> serve the results. In one of my former workplaces (second largest
> Australian ISP at the time) there was a cluster of four such servers with
> the checks carefully spread among them. Updating the cluster configuration
> was a delicate business and keeping them up wasn't pleasant and still it
> was very slow to serve the web interface.
> 4. The amount of traffic and load on the network and monitored servers is
> VERY wasteful - open TCP for each check, fork/exec via the NRPE agent,
> process exit, collect results, rinse, repeat, millions of times a day.
>
> Nagios doesn't encourage what it calls "passive monitoring" (i.e. the
> monitored servers initiate checks and send results, whether positive or
> negative, to a central server) and in general its protocol (NRPE) means
> that the central monitoring data collector is a bottleneck.
>
> Sensu, on the other hand, works around this by encouraging more "passive
> monitoring", i.e. each monitored server is responsible to monitor itself
> without the overhead of a central server doing the rounds and loading the
> network, it uses RabbitMQ message bus so its data transport and collection
> servers are more scalable (it also supports multiple servers), and it's OK
> with not sending anything if there is nothing to report (the system will
> still has "keepalive" checks (http://sensuapp.org/docs/0.12/keepalives)
> to monitor for nodes which went down).
>
> But my favourite idea for scalability is the one presented in
> http://linux-ha.org/source-doc/assimilation/html/index.html - each
> monitored host is responsible to monitor itself, without bothering anyone
> if there is nothing to write home about (so a bit like Sensu), and a couple
> of servers near it, so the "is host is alive" external monitoring is
> distributed across the network (and doesn't fall on the servers alone, like
> in Sensu), it also saves unnecessary network traffic. Unfortunately, it
> seems not to be ready yet (
> http://linux-ha.org/source-doc/assimilation/html/_release_descriptions.html
> ).
>
> More points:
>
> Lack of VPN - if you can't setup a "proper" vpn then you can always look
> at ssh vpn (e.g. Ubuntu instructions:
> https://help.ubuntu.com/community/SSH_VPN), and if you can't be bothered
> with ssh_config "Tunnel"/"TunnelDevice" (ssh "-w" flag) then even a simple
> ssh port redirection with ssh -NT and autossh could do.
>
> Log concentration - look at Logstash (http://logstash.net/) for proper
> log collection and analysis.
>
> Hope this gives you some ideas.
>
> --Amos
>
> On 16 Jun 2014 09:13, "Ori Berger" <linux-il at orib.net> wrote:
>
>> I'm looking for a single system that can track all of a remote server's
>> health and performance status, and which stores a detailed
>> every-few-seconds history. So far, I haven't found one comprehensive system
>> that does it all; also, triggering alarms in "bad" situations (such as no
>> disk space, etc). Things I'm interested in (in parentheses - how I track
>> them at the moment. Note shinken is a nagios-compatible thing).
>>
>> Free disk space (shinken)
>> Server load (shinken)
>> Debian package and security updates  (shinken)
>> NTP drift (shinken)
>> Service ping/reply time (shinken)
>> Upload/download rates per interface (mrtg)
>> Temperatures (sensord, hddtemp)
>> Security logs, warning and alerts e.g. fail2ban, auth.log (rsync of log
>> files)
>>
>> I have a few tens of servers to monitor, which I would like to do with
>> one software and one console. Those servers are not all physically on the
>> same network, nor do they have a VPN (so, no UDP) but tcp and ssh are
>> mostly reliable even though they are low bandwidth.
>>
>> Please note that shinken (much like nagios) doesn't really give a good
>> visible history of things it measures - only alerts; Also, it can't really
>> sample things every few seconds - the lowest reasonable update interval
>> (given shinken's architecture) is ~5 minutes for the things it measures
>> above.
>>
>> Any recommendations?
>>
>> Thanks in advance,
>> Ori
>>
>> _______________________________________________
>> Linux-il mailing list
>> Linux-il at cs.huji.ac.il
>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>
>

-- 
 [image: View my profile on LinkedIn]
<http://www.linkedin.com/in/gliderflyer>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20140616/86f1e534/attachment.html>