cheshire

hscloud

History

q3k 9f0e1e88f1 cluster/clustercfg: rewrite it in Go This replaces the old clustercfg script with a brand spanking new mostly-equivalent Go reimplementation. But it's not exactly the same, here are the differences: 1. No cluster deployment logic anymore - we expect everyone to use ops/ machine at this point. 2. All certs/keys are Ed25519 and do not expire by default - but support for short-lived certificates is there, and is actually more generic and reusable. Currently it's only used for admincreds. 3. Speaking of admincreds: the new admincreds automatically figure out your username. 4. admincreds also doesn't shell out to kubectl anymore, and doesn't override your default context. The generated creds can live peacefully alongside your normal prodaccess creds. 5. gencerts (the new nodestrap without deployment support) now automatically generates certs for all nodes, based on local Nix modules in ops/. 6. No secretstore support. This will be changed once we rebuild secretstore in Go. For now users are expected to manually run secretstore sync on cluster/secrets. Change-Id: Ida935f44e04fd933df125905eee10121ac078495 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1498 Reviewed-by: q3k <q3k@hackerspace.pl>		2023-06-19 22:23:52 +00:00
..
doc	ops/monitoring: split up jsonnet, add simple docs	2020-06-06 17:05:15 +02:00
lib	cluster/clustercfg: rewrite it in Go	2023-06-19 22:23:52 +00:00
secrets	ops/monitoring: deploy grafana	2020-12-17 22:10:31 +00:00
OWNERS	ops/monitoring: add implr to owners	2020-06-07 02:23:09 +02:00
README.md	monitoring: global: implement	2020-10-06 14:28:27 +00:00
k0.jsonnet	ops/monitoring: deploy grafana	2020-12-17 22:10:31 +00:00

README.md

hscloud monitoring

Quick links

Old Global Dashboard: monitoring.hackerspace.pl - old monitoring system, unrelated to this one, configured using Chef at management.hackerspace.pl (long since dead). This setup is supposed to replace it.

Architecture

The hscloud monitoring solution is two-tiered:

at the global tier we run metrics aggregation, long-term storage, dashboard and alerting.
at the agent tier we collect metrics from various sources (possibly even lower tiered agents).

All agent-tier agents send metrics to all global instances.

      .--------.     .--------.              '.
      | global |     | global |               > - global tier
      '--------'     '--------'              .'   (contains 'global instances')
        |    '---. .---'    |
        |         X         |
        |    .---' '---.    |
        |    |         |    |
.--------------.     .--------------------. '.
|   cluster    |     |    hswaw-proxy     |  |
| k0.hswaw.net |     | waw.hackerspace.pl |   > - agent tier
'--------------'     '--------------------' .'    (contains 'agents')

Agent - cluster

Cluster agents are responsible from collecting Kubernetes cluster metrics. They run a prometheus server that scrapes kubelet/cadvisor/... metrics and send them off to global instances.

Global Instances

Global agents run Victoria Metrics, ingest metrics from all agents, and perform long-term storage. In the future they will also run Grafana and AlertManager.