hscloud/ops/monitoring/README.md

hscloud monitoring
==================

Quick links
-----------

 - *Old Global Dashboard*: [monitoring.hackerspace.pl](https://monitoring.hackerspace.pl) - old monitoring system, unrelated to this one, configured using Chef at management.hackerspace.pl (long since dead). This setup is supposed to replace it.

Architecture
------------

The hscloud monitoring solution is two-tiered:

 - at the *global* tier we run metrics aggregation, long-term storage, dashboard and alerting.
 - at the *agent* tier we collect metrics from various sources (possibly even lower tiered agents).

All agent-tier agents send metrics to all global instances.


          .--------.     .--------.              '.
          | global |     | global |               > - global tier
          '--------'     '--------'              .'   (contains 'global instances')
            |    '---. .---'    |
            |         X         |
            |    .---' '---.    |
            |    |         |    |
    .--------------.     .--------------------. '.
    |   cluster    |     |    hswaw-proxy     |  |
    | k0.hswaw.net |     | waw.hackerspace.pl |   > - agent tier
    '--------------'     '--------------------' .'    (contains 'agents')


Agent - cluster
---------------

Cluster agents are responsible from collecting Kubernetes cluster metrics. They run a prometheus server that scrapes kubelet/cadvisor/... metrics and send them off to global instances.

Global Instances
----------------

Global agents run Victoria Metrics, ingest metrics from all agents, and perform long-term storage. In the future they will also run Grafana and AlertManager.