Monitoring for Jessie and Beyond

What do we want/need to monitor (metrics/checks)

A list of non exhaustive metrics and checks we we need/would like to have, e.g. CPU usage for every machine, networking in/out, ...

Trying to create a detailed list of metrics to record.

Metrics

Should be monitored (compute nodes)

  • CPU: user/nice/system/wait IO (4)
  • disk: space available/free per local used block device (space/inodes), IO times/reads/writes per local physical device (8+6)
  • GPU: load/mem usage (2)
  • system: one minute load, number of processes (2)
  • memory: used/free/buffer (3)
  • network: received/sent (packets/bytes) per logical device, errors per physical device (8+8)
  • static host information: number of cores, type of cpu, speed (0)

Thus, about 50 metrics for up to 4000 machines assuming 4 bytes per datum, 15s resolution for 15 days: 70GByte (just raw data, no timestamps)

If recorded as it is now by ganglia this would mean:

Granularity
  • 15s with 5855 entries (one day)
  • 60s with 20160 entries (two weeks)
  • 600s with 52705 entries (one year)

Each rrd file has a size of 616kB. With roughly 3000 nodes, total of 27GB for all rrds. Benchmark RRD utilization capabilities of a system: https://gist.github.com/daniel-garcia/5470dced53234ba47ba4

Server metrics

About the same as for the compute nodes, possibly additional once for Condor?

Services

  • global file systems (usage/busy/free/...)
  • tape usage
  • head node/submit node stats (number of users logged in, ..., plus compute node metrics)
  • E@H
  • condor stats (users/jobs/pipelines/states)
  • aggregated compute nodes stats (possibly same as above)
  • infrastructure (cooling/power): racks/LCP/UPS/cooling machines/pumps/temperatures
  • switches (data rates/errors) aggregated/per port

Einstein@Home specific metric features

  • standard server metrics (CPU load, network I/O, disk I/O)
  • metrics for mysql/mariadb server
  • create plugins to extract BOINC specific metrics from SQL DB (current resolution 150s)
  • create custom dashboards and arrange metrics for wall monitors
  • ability to correlate metrics and do trend analysis

Service monitoring

Essentially all nagios is currently monitoring, moving the time series metrics to the former part.

Einstein@Home specific monitoring/alerting features

  • create alerts (via email) for CPU load, disk space/utilization, HTTP connectivity, network load
  • create custom alerts based on SQL queries or directory contents (basically check output of ls | wc -l)

Logfile analysis

It would be nice to plan for the ability to gather, normalize and analyze application logfiles centrally. The use-case for Einstein@Home would be to make it easier to correlate events between the different components which currently reside on different servers.

  • write plugins to parse multi-line application logfiles which use non-standard formats
  • "untangle" logfiles where the output of two requests at the same time is mixed up in the file

The current filesize we are producing at the moment is at least 10 GB of raw logfiles per day (upload and webserver). The recommendation is to transform logs into JSON as soon as possible as this is a serious CPU bottleneck in Logstash (because of a lot of regex parsing).

Other requirements/wishes

  • easily extensible - we need to add our own custom checks/metrics easily
  • file system based configuration - data may be stored in DB, but configuration needs to be file based (to allow easy manipulation by salt)
  • web interface that does not suck(TM), i.e. capable of fast browsing/finding metrics/problems
  • easy to create own graphs
  • able to handle our infrastructure (thus aiming at least for double current size)
  • filter metrics by names/values, e.g. show me all nodes from rack 52 with cpu_idle of 10% or more
  • interactive drill down, i.e. zoom into large graph, reorganize graph by hiding parts of it (e.g. show several metrics of several hosts in a graph and select/unselect some interactively)

Misc/Comments

  • maybe we want to keep the current separation between nodes/server(service) monitoring - possibly make new distinction into time series/metric based monitoring and service level monitoring/alerting?
  • ganglia will still be needed to feed LIGO's watchtower

Existing tools

CA thinks to keep to open source tools, i.e. no lock-in into any vendor imposed restrictions. Also, we should exclude any package not having seen at least a minor release in 2015 or 2016. Also, let's discuss/evaluate packages maintained within Debian versus apt repo maintained by project versus not packaged for Debian.

Trying to maintain alphabetical order (version numbers as currently packaged in Jessie/Stretch if available):

  • Ganglia (3.6.0/3.6.0) (stores data in RRDs)
  • Graphite (0.9.12/0.9.15) (stores data in RRDs)
  • Graylog (self hosted APT repo) (logfile analysis using ElasticSearch 2.4 and MongoDB)
  • Icinga 1 (1.11.6/1.13.3) (stores charts in RRDs)
  • Icinga 2 (2.1.1/?)
  • InfluxDB (1.0.2/?)
  • Kibana (ELK stack) (self hosted APT repo) (logfile analysis using ElasticSearch 5.1 and Logstash 5.1) Note: Kibana doesn't support authentication for free but https://floragunn.com/searchguard/ does (except LDAP/Kerberos)
  • MetricBeat (self hosted APT) (performance metrics on top of ElasticSearch with Kibana interface)
  • Monitorix - self hosted APT repo (stores data at least in RRDs)
  • Munin (2.0.25/2.0.27) (stores data in RRDs)
  • Nagios (2.1.1/2.1.2) (stores charts in RRDs)
  • Prometheus (-/1.2.3) (looks like self created bulk storage with levelDB for indexes)
  • Shinken (2.0.3/?) (very nagios like)
  • Zabbix (2.2.7/3.0.6) (stores metrics/service data in SQL - mysql/postgresql/..)
  • Zenoss () - seems to be limited to 1000 devices in open source version

Another list of tools: https://github.com/monitoringsucks/tool-repos

* means this is probably for storing time series metrics only, i.e. no real service awareness

to be tested

  • Prometheus
  • Zabbix

not to be tested right now

  • Shinken (looks like Nagios)

results

  • ganglia: overview page with 2400 graphs takes 20s to load, "limit filtering" not possible

prometheus

  • prometheus has been killed by the oom killer after one day of collecting data
  • when turning off prometheus, prometheus dies with an out of memory message, increase memory to 4 GB

the test

  • the monitoring server is one of the old RA server, the memory limited
  • 42 clients are providing 50 metricis which are recorded with a frequency of 1Hz.

tool Debianized charts customized charts alerts mobile client performance
Prometheus lenny backport y y y    
nagios y y by messing with rrds y y  
zabbix            
ganglia y y by messing with rrds      

Carsten's functional requirements

I'll try to label each item with W(wish) and R(requirement)
  • ingest all Atlas data from any host/service we directly control (R)
  • possibly ingest also any data relevant to E@H (W)
  • "hardware" specific metrics from every host (at least CPU usage, RAM usage, file systems, temperature) (R)
  • "service" specific metrics for every "important" service (specifics obviously needed per service) (R)
  • aggregation of metrics (interactive (R), programmatically (R))
  • alert generation based on metrics (R)
  • different alert levels (W)
  • alerts available remotely via mobile app (W!)
  • "fast" generation of predefined graphs (R)
  • creation of custom dashboards (W), auto-updates of dashboards (W)
  • retain full data set for at least a week (R), ideally for a month (W)
  • retain trend data for at least a year (R), ideally for three years (W)
  • easy to "drill-down" from (aggregated) graphs, e.g. try to correlate problem on storage unit X with usage pattern Y by user Z (R)
  • fully salt configurable, i.e. new hosts/service can simply be added via salt (R)

In the end, the whole monitoring/alerting system may consists of various pieces (backend/frontend), but they need to work together and must not duplicate existing feature sets.

Example scenarios which should be covered

  • bacnet queries from cooling system. Alert if water flow/valve values are out of good range iff cooling machine is free to start-up from DDC
  • Aggregate various values from cooling system and display these along with original values overlayed on image
  • how to create templates which are automatically filled by clients (LCP/Rack cubes or compute nodes)
  • custom data ingestion, i.e. what possibilities do we have to add Condor job data into central db?
  • data mining - assume we see high IO waits on scratch file servers, how can we correlate this with jobs in D state and/or user jobs just started/running?
-- CarstenAulbert - 15 Dec 2016
Topic revision: r23 - 11 Sep 2017, CarstenAulbert
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback