Commit graph

35 commits

Author SHA1 Message Date
9a88f28805 cluster/{machines,certs}: add dcr03s16.hswaw.net
Also make dataplane-only nodes actually work:
- make kubeproxy use the same package as kubelet
- disable firewall

Change-Id: I7babbb749656e6f75151c8eda6e3f09f3c6bff5f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1686
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-10-09 19:02:18 +00:00
9f0e1e88f1 cluster/clustercfg: rewrite it in Go
This replaces the old clustercfg script with a brand spanking new
mostly-equivalent Go reimplementation. But it's not exactly the same,
here are the differences:

 1. No cluster deployment logic anymore - we expect everyone to use ops/
    machine at this point.
 2. All certs/keys are Ed25519 and do not expire by default - but
    support for short-lived certificates is there, and is actually more
    generic and reusable. Currently it's only used for admincreds.
 3. Speaking of admincreds: the new admincreds automatically figure out
    your username.
 4. admincreds also doesn't shell out to kubectl anymore, and doesn't
    override your default context. The generated creds can live
    peacefully alongside your normal prodaccess creds.
 5. gencerts (the new nodestrap without deployment support) now
    automatically generates certs for all nodes, based on local Nix
    modules in ops/.
 6. No secretstore support. This will be changed once we rebuild
    secretstore in Go. For now users are expected to manually run
    secretstore sync on cluster/secrets.

Change-Id: Ida935f44e04fd933df125905eee10121ac078495
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1498
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 22:23:52 +00:00
7e841065b0 *: post-certmanager manifests update
Change-Id: I745c850268c31777c5722a9833c8152a55615aed
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1512
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 21:20:44 +00:00
f6e6abb0f5 ops: repin cluster machines to older nixpkgs checkout
Change-Id: I592c689e33d81c131d389d87153900165aac19e5
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1486
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-31 22:53:59 +00:00
8f0842341a ops: repin edge01.waw to old nixpkgs
We accidentally bumped nixpkgs at https://gerrit.hackerspace.pl/1441 and
forgot to upgrade it. We don't wanna upgrade it right now.

This doesn't give us back a zero-diff, but it's close enough.

Change-Id: I1a9f50df88e564cd4de76f67adfaa1e88a746f2e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1471
Reviewed-by: patryk <patryk@hackerspace.pl>
2023-03-10 20:17:15 +00:00
712a5dc3e3 cluster: add bc01n05.hswaw.net
This will be our postgres pet machine.

Change-Id: Ifff6648394ca6407fb5b5daa853f4abc42541703
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1467
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-04 22:26:46 +00:00
3a9562ecfd cluster: k0: remove native ceph
After installing HBJ11s and spreading out the mons we're going full
Rook.

Change-Id: Ia00cbe953548f06cf27343371fc67890619c8262
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1466
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-04 22:26:39 +00:00
ef3aab6a14 k0: host os bump wip
This bumps it on bc01n01, but nowhere else yet.

We have to vendor some more kubelet bits unfortunately.

Change-Id: Ifb169dd9c2c19d60f88d946d065d4446141601b1
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1465
Reviewed-by: implr <implr@hackerspace.pl>
2023-03-04 22:26:14 +00:00
deeeff861e hswaw/machines: add sound.waw.hackerspace.pl
Change-Id: Id0e6a02d9ae4cf61d758713a99d21c6da0c72b66
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1401
Reviewed-by: vuko <vuko@hackerspace.pl>
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-10-09 19:35:18 +00:00
957d91180a bgpwtf: edge01: bump nixpkgs, use networkd
Change-Id: I038f9518e090aecc90f464475f29c5b3c1570eff
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1339
Reviewed-by: implr <implr@hackerspace.pl>
2022-07-07 23:51:57 +00:00
c35ea6a220 ops: inject the machine's pkgs into the machine's hscloud tree
This ensures, for example, that the packets are for the correct
architecture.

Change-Id: If17c307fbad02ee72c6dd21a874c59514415ab2e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1334
Reviewed-by: implr <implr@hackerspace.pl>
2022-07-07 18:10:40 +00:00
dcdbd8425c hswaw/machines: add tv2
Change-Id: I657c316bcc663c79b6886d5843b9de5cbf17f1c3
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1333
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-07-07 18:07:18 +00:00
5ac5e4bec3 hswaw/machines: add tv1, larrythebuilder
This adds two brand new AArch64 machines: a generic builder (and
instructions on how to use it) and tv1.waw, an RPi4 acting as digital
signage in the space.

Change-Id: I8d38344ec35f99f4b872cf9526f6e6771fbffc43
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1330
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-07-06 19:49:37 +00:00
55a486ae49 cluster: refactor nix machinery to fit //ops
This is a chonky refactor that get rids of the previous cluster-centric
defs-* plain nix file setup.

Now, nodes are configured individually in plain nixos modules, and are
provided a view of all other nodes in the 'machines' attribute. Cluster
logic is moved into modules which inspect this array to find other nodes
within the same cluster.

Kubernetes options are not fully clusterified yet (ie., they are still
hardcode to only provide the 'k0' cluster) but that can be fixed later.
The Ceph machinery is a good example of how that can be done.

The new NixOS configs are zero-diff against prod. While this is done
mostly by keeping the logic, we had to keep a few newly discovered
'bugs' around by adding some temporary options which keeps things as they
are. These will be removed in a future CL, then introducing a diff (but
no functional changes, hopefully).

We also remove the nix eval from clustercfg as it was not used anymore
(basically since we refactored certs at some point).

Change-Id: Id79772a96249b0e6344046f96f9c2cb481c4e1f4
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1322
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-06-19 11:48:52 +00:00
a13208bf9b ops/sso: bump to latest version, roll out RSA JWT signing
Bump to:
https://code.hackerspace.pl/informatic/sso-v2/commit/?id=682322c98063c596d2e46f1e7844551c5a7226db

This introduces (and enables) support for RSA id_tokens (that are
required by oauth2_proxy for example) and fixes/improves handling of
non-active members.

Change-Id: Ia7d5e5ca7a2769f11f6190add78114e3b6141c6e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1304
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-05-01 08:17:57 +00:00
b6bc3e69b9 hswaw/machines/customs: upgrade to workspace nixos-unstable 2021-08-11
Change-Id: I6eb4408d40e14f24ebbe3f9f3aef0be952b44e8b
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1167
Reviewed-by: vuko <vuko@hackerspace.pl>
2021-10-20 20:58:16 +00:00
a01905ae64 hswaw/machines/customs: check in code.hackerspace.pl/vuko/customs
Change-Id: Ic698cce2ef0060a54b195cf90574696b8be1eb0f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1162
Reviewed-by: informatic <informatic@hackerspace.pl>
2021-10-20 20:58:16 +00:00
a16af2db91 ops/machines.nix: inject workspace
This makes the hscloud readTree object available as following in NixOS
modules:

  { config, pkgs, workspace, ... }: {
    environment.systemPackages = [
      workspace.hswaw.laserproxy
    ];
  }

Change-Id: I9c8146f5156ffe5d06cb8408a2ce632657990d59
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1164
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-10-16 21:24:22 +00:00
9848e7e15f cluster: deploy NixOS-based ceph
First pass at a non-rook-managed Ceph cluster. We call it k0 instead of
ceph-waw4, as we pretty much are sure now that we will always have a
one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph
pools for different media kinds (if at all).

For now this has one mon and spinning rust OSDs. This can be iterated on
to make it less terrible with time.

See b/6 for more details.

Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11
2021-09-11 20:33:24 +00:00
b3c6770f8d ops, cluster: consolidate NixOS provisioning
This moves the diff-and-activate logic from cluster/nix/provision.nix
into ops/{provision,machines}.nix that can be used for both cluster
machines and bgpwtf machines.

The provisioning scripts now live per-NixOS-config, and anything under
ops.machines.$fqdn now has a .passthru.hscloud.provision derivation
which is that script. When ran, it will attempt to deploy onto the
target machine.

There's also a top-level tool at `ops.provision` which builds all
configurations / machines and can be called with the machine name/fqdn
to call the corresponding provisioner script.

clustercfg is changed to use the new provisioning logic.

Change-Id: I258abce9e8e3db42af35af102f32ab7963046353
2021-09-10 23:55:52 +00:00
0ec06d7b75 ops: update deploy instructions to include profile set
This is necessary for the NixOS EFI boot machinery to pick up the new
derivation when switching to it, otherwise the machine will not boot
into the newly switched configuration.

Change-Id: I8b18956d2afeea09c38462f09a00c345cf86f80d
2021-04-18 18:13:33 +00:00
a0332a75a0 ops/machines: pin edge01.waw to its current version of nixpkgs
Stopgap until we finish b/3, need to deploy some changes on it without
rebooting into newer nixpkgs.

Change-Id: Ic2690dfcb398a419338961c8fcbc7e604298977a
2021-03-18 19:22:41 +00:00
7f8f3e9f9c ops/sso: upgrade sso-v2
Change in sso-v2 unifies id_token and userinfo endpoint handling - now
groups, nickname, email and preferred_username keys are present in
id_tokens as well.

https://code.hackerspace.pl/informatic/sso-v2/commit/?id=c4c810cd255a7bfcab5ced3fb88c8b311b518c34

Change-Id: Ib22994edc067fd83701590182f8096f6fca692ba
2021-02-01 17:03:27 +01:00
9e3ca9c841 ops/sso: move jsonnets to kube/
This is in preparation for moving the sso source code into hscloud.

Change-Id: I4325df617dc82c17fb4c96762743f0b70122976f
2021-01-31 15:52:06 +01:00
cc2ff79f01 ops/monitoring: move grafana to sso.
Change-Id: Ib2ecf6820454a160834db2ac212b31d9d5306972
2021-01-30 17:26:47 +01:00
q3k
d82807e024 Merge changes I84873bc3,I1eedb190
* changes:
  ops/monitoring: deploy grafana
  ops/monitoring: scrape apiserver, scheduler, and controller-manager
2021-01-30 16:22:44 +00:00
d6c97596cd ops/sso: "the hackerspace oidc/oauth2 provider" deployment
Change-Id: I092b844364ed30037eff00188dcdf5d6d3c228c5
2021-01-29 23:23:09 +01:00
4f7caf8d86 ops/monitoring: deploy grafana
This is a basic grafana running on:

    https://monitoring-global-dashboard.k0.hswaw.net/

It contains a data source pointing at the corresponding global victoria
metrics. There's no dashboards, these will be provisioned soon via
jsonnet/grafonnet.

Change-Id: I84873bc323d1727096e3ce818fae122a9af3e191
2020-12-17 22:10:31 +00:00
cfc0496266 ops/monitoring: scrape apiserver, scheduler, and controller-manager
These get scraped by public IP address, which get retrieved via service
discovery in Prometheus (by using the endpoints role on the
default/kubernetes service).

Also drive-by fix cluster prometheus resources - the default
configuration wants at least 3GB of physical memory.

Change-Id: I1eedb19051f62b40613f69e5f0f736d5958acf42
2020-12-17 22:09:56 +00:00
7d311e9602 ops/monitoring: pull in grafonnet-7.0
Change-Id: Ie036ef767419418876a18255a5ad378f5cfa1535
2020-10-10 15:59:45 +00:00
363bf4f341 monitoring: global: implement
This creates a basic Global instance, running Victoria Metrics on k0.

Change-Id: Ib03003213d79b41cc54efe40cd2c4837f652c0f4
2020-10-06 14:28:27 +00:00
6abe4fa771 bgpwtf/machines: init edge01.waw
This configures our WAW edge router using NixOS. This replaces our
previous Ubuntu installation.

Change-Id: Ibd72bde66ec413164401da407c5b268ad83fd3af
2020-10-03 14:57:38 +00:00
c1364e8d8a ops/monitoring: add implr to owners
This will fix future reviews from him having to require my +2.

Change-Id: Icde1f64fe4387e92d19943d7469ce0569eb45257
2020-06-07 02:23:09 +02:00
2022ac2338 ops/monitoring: split up jsonnet, add simple docs
Change-Id: I8120958a6862411de0446896875766834457aba9
2020-06-06 17:05:15 +02:00
ce81c39081 ops/metrics: basic cluster setup with prometheus
We handwavingly plan on implementing monitoring as a two-tier system:

 - a 'global' component that is reponsible for global aggregation,
   long-term storage and alerting.
 - multiple 'per-cluster' components, that collect metrics from
   Kubernetes clusters and export them to the global component.

In addition, several lower tiers (collected by per-cluster components)
might also be implemented in the future - for instance, specific to some
subprojects.

Here we start sketching out some basic jsonnet structure (currently all
in a single file, with little parametrization) and a cluster-level
prometheus server that scrapes Kubernetes Node and cAdvisor metrics.

This review is mostly to get this commited as early as possible, and to
make sure that the little existing Prometheus scrape configuration is
sane.

Change-Id: If37ac3b1243b8b6f464d65fee6d53080c36f992c
2020-06-06 15:56:10 +02:00