Commit Graph

332 Commits (fd505b8154e307ca23a0cb9eef8574c40e1f6bd3)

Author SHA1 Message Date
radex 26fb573055 doc: improve cluster/user docs, make it more discoverable
Change-Id: Icbb348865a442a01a3ab191dad88662a88635007
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1565
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-09-22 20:44:48 +00:00
q3k b6504238e7 *: add gomod placeholders for generated files
Change-Id: I8a4824ff31590185cd45fd43cc065bb8e2fa7bb2
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1580
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-09-01 16:50:48 +00:00
radex c2c66bf770 cluster/kube: update admitomatic settings for inventory
Change-Id: I62279519f93da338591b1b164878e33027b8f851
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1576
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-08-17 12:39:56 +00:00
q3k 8100a2de97 third_party: replace jq with gojq
Building jq portably is annoying, and the way we were doing it (which we
iirc stole from some google project?) sucked. Let's use a Go jq clone
instead.

This is an alternative for 1535. jq is currently used only in one
script, which could really be replaced by a Go program, but let's keep
it simple for now.

Change-Id: Ie25dffadd545df143490f510e9b75a74adf81492
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1540
Reviewed-by: palid <palid@hackerspace.pl>
2023-07-24 14:47:54 +00:00
q3k 03c2d996a0 cluster: fix prodvider deploy (after new CA)
Change-Id: Icbdb5e3ac592e9eac3a033ba50af401b706c3e78
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1541
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-07-24 14:15:46 +00:00
informatic 10384cd394 cluster/registry: fix common namespaces
Public pull ACL in the middle had priority over our more specific rules
- moving these to the top fixes common registry namespace ACLs.

Change-Id: Ia6f05cef09c0db4eb71155d2c0e2d9944b81f903
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1522
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 23:15:37 +00:00
q3k c1f372561a cluster/admitomatic: implement opt-out namespaces
Change-Id: I32d4b019211fa755e2b3b103b88ea3f4c14e500f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1521
Reviewed-by: informatic <informatic@hackerspace.pl>
2023-06-19 22:54:33 +00:00
q3k 9f0e1e88f1 cluster/clustercfg: rewrite it in Go
This replaces the old clustercfg script with a brand spanking new
mostly-equivalent Go reimplementation. But it's not exactly the same,
here are the differences:

 1. No cluster deployment logic anymore - we expect everyone to use ops/
    machine at this point.
 2. All certs/keys are Ed25519 and do not expire by default - but
    support for short-lived certificates is there, and is actually more
    generic and reusable. Currently it's only used for admincreds.
 3. Speaking of admincreds: the new admincreds automatically figure out
    your username.
 4. admincreds also doesn't shell out to kubectl anymore, and doesn't
    override your default context. The generated creds can live
    peacefully alongside your normal prodaccess creds.
 5. gencerts (the new nodestrap without deployment support) now
    automatically generates certs for all nodes, based on local Nix
    modules in ops/.
 6. No secretstore support. This will be changed once we rebuild
    secretstore in Go. For now users are expected to manually run
    secretstore sync on cluster/secrets.

Change-Id: Ida935f44e04fd933df125905eee10121ac078495
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1498
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 22:23:52 +00:00
informatic 7e841065b0 *: post-certmanager manifests update
Change-Id: I745c850268c31777c5722a9833c8152a55615aed
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1512
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 21:20:44 +00:00
q3k 3dd3ff5dcd cluster/cert-manager: update to v1.5.0
Change-Id: I7a4cdadc9956141292302bc004d09d6e9e22855e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1497
Reviewed-by: informatic <informatic@hackerspace.pl>
2023-05-26 10:38:16 +00:00
q3k ffdb97b7dd cluster/prodaccess: fix cert migration bug
Change-Id: I7426e60731b09c571aa7385f5213e998f04675a6
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1510
Reviewed-by: ironbound <ironbound@hackerspace.pl>
2023-04-14 08:13:39 +00:00
q3k 57df027f28 cluster/kube: add k0-cert-manager.jsonnet view
Change-Id: I4d008839f6d6190d0d88fd3fff44974c4f2db2c0
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1499
Reviewed-by: implr <implr@hackerspace.pl>
2023-04-01 14:58:50 +00:00
q3k 9251121fa9 cluster/certs: remove old kube CA
This completes the migration away from the old CA/cert infrastructure.

The tool which was used to generate all these certs will come next. It's
effectively a reimplementation of clustercfg in Go.

We also removed the unused kube-serviceaccounts cert, which was
generated by the old tooling for no good reason (we only need a key for
service accounts, not an actual cert...).

Change-Id: Ied9e5d8fc90c64a6b4b9fdd20c33981410c884b4
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1501
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-04-01 13:55:18 +00:00
q3k bdf2fa326f cluster/certs: finish replacing all CAs
This finishes the regeneration of all cluster CAs/certs to be never
expiring ED25519 certs.

We still have leftovers of the old Kube CA (and it's still being
accepted in Kubernetes components). Cleaning that up is the next step.

Change-Id: I883f94fd8cef3e3b5feefdf56ee106e462bb04a9
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1500
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-04-01 13:55:14 +00:00
q3k 989dfa3183 cluster/kube: add k0-prodvider.jsonnet view
Change-Id: I170fbef3008f906c26ed79387858c3c1e4e2e10c
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1496
Reviewed-by: implr <implr@hackerspace.pl>
2023-04-01 13:54:49 +00:00
q3k 7572f0790c k0: add disks
Already deployed, now rebalancing.

Change-Id: I536a063bc346effd07a1700aeffe598cc35f6f7a
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1493
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-04-01 11:21:54 +00:00
q3k 073d850a95 cluster/prodvider: redeploy
Change-Id: I7a6cce06bb7c2f495d5354d3a2bebef64e307e42
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1491
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-04-01 11:18:25 +00:00
q3k bbc5a43d77 cluster: move kubernetes services to temporary CA bundle
This is already deployed, and it allows Kubernetes components
(temporary) freedom to use the old or new CA cert.

Change-Id: I8ac7f773a333c30fa22902b8edc327c0c700a482
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1490
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-31 22:53:59 +00:00
q3k 3a6d67e0c4 cluster/prodvider: rewrite against x509 lib for ed25519 support
This gets rid of cfssl for the kubernetes bits of prodvider, instead
using plain crypto/x509. This also allows to support our new fancy
ED25519 CA.

Change-Id: If677b3f4523014f56ea802b87499d1c0eb6d92e9
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1489
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-31 22:53:59 +00:00
q3k 777aab92a9 cluster/prodaccess: use new kube CA cert
Change-Id: I1bff03008a4a212ad93e5eaa112adaa2b0cad3e7
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1488
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-31 22:53:59 +00:00
q3k a4f8a459b9 cluster: partial cert bump
Done:

 1. etcd peer CA & certs
 2. etcd client CA & certs
 3. kube CA (currently all components set to accept both new and old CA,
    new CA called ca-kube-new)
 4. kube apiserver
 5. kubelet & kube-proxy
 6. prodvider intermediate

TODO:

 1. kubernetes controller-manager & kubernetes scheduler
 2. kubefront CA
 3. admitomatic?
 4. undo bundle on kube CA components to fully transition away from old
    CA

Change-Id: If529eeaed9a6a2063bed23c9d81c57b36b9a0115
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1487
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-31 22:53:59 +00:00
implr 779727b39e machines/bc01n05: postgres: auth, hba, more ram
Change-Id: Id10b97efa3588a2a9147a349391da559e6cce7e5
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1482
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-28 21:22:50 +00:00
implr 3b0887397a machines/bc01n05: postgres tuning
Change-Id: I30925a84216b45bde9e92b67b007f15b2cdf58e8
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1481
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-26 12:16:20 +00:00
implr 821b839b16 machines/bc01n05: zfsify; initial postgres
Change-Id: I355ac4aa3c56a1e6a564b7a3c7cfc4e67b072dae
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1470
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-11 21:33:14 +00:00
implr 3320155d23 cluster/machines/base: enable microcode loading
This will happen at next boot via early microcode - no risk to currently
running processes.

Change-Id: I88553fa9a1350ebb80aaf978e29e8f1156783a2c
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1469
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-11 21:33:05 +00:00
q3k 712a5dc3e3 cluster: add bc01n05.hswaw.net
This will be our postgres pet machine.

Change-Id: Ifff6648394ca6407fb5b5daa853f4abc42541703
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1467
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-04 22:26:46 +00:00
q3k 3a9562ecfd cluster: k0: remove native ceph
After installing HBJ11s and spreading out the mons we're going full
Rook.

Change-Id: Ia00cbe953548f06cf27343371fc67890619c8262
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1466
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-04 22:26:39 +00:00
q3k ef3aab6a14 k0: host os bump wip
This bumps it on bc01n01, but nowhere else yet.

We have to vendor some more kubelet bits unfortunately.

Change-Id: Ifb169dd9c2c19d60f88d946d065d4446141601b1
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1465
Reviewed-by: implr <implr@hackerspace.pl>
2023-03-04 22:26:14 +00:00
implr 0156ab24ca cluster/kube/k0: remove implr-spark bucket, add implr bucket
the spark one has been an abandoned experiment from years ago, and
I could use a personal one right now

Change-Id: I78a706c3371d441b2f8460fd796d0cfd9a198cc6
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1464
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-02-26 16:41:23 +00:00
implr 0173f501d7 cockroach: v20.2 -> v21.1
Following https://www.cockroachlabs.com/docs/v21.1/upgrade-cockroach-version?filters=linux
--logtostderr is deprecated/removed, but AFAICT from the default config
it will still log there: https://www.cockroachlabs.com/docs/v21.1/configure-logs#default-logging-configuration

Change-Id: I7fb3f835693f955b37de24dc581140ea34b11630
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1461
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-01-30 21:16:42 +00:00
informatic 3b2a2a2ce1 cluster/k0: add paperless to admitomatic config
Change-Id: I54df444cddca8a05febfb96af07b9e2f614639fc
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1453
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-01-05 09:12:18 +00:00
patryk a2bcfeaf0b cluster: bump vm.max_map_count sysctl tunable to a higher value
This is needed for running some memory-intensive workloads, like
ElasticSearch/OpenSearch.

Change-Id: I7b00ec5faca73ec69bdbf1ca41c025d7efeae55c
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1443
Reviewed-by: implr <implr@hackerspace.pl>
2022-12-11 20:28:51 +00:00
q3k d171263d6e k0: remove waw-hdd-yolo-3
This was never used and only caused scary warnings during OSDs reboots
due to lack of availability.

Change-Id: I14eacd88855bc56e06f2a61cc2d914d985330852
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1423
Reviewed-by: implr <implr@hackerspace.pl>
2022-11-20 12:28:20 +00:00
implr 4d98cf5ca8 calico: move from etcd to crd
Leaving the CRD definitions as YAML, extracted without modifications
from the original install file - this should make upgrades simpler.

Change-Id: I7211d2711e2af014b36dd887a951abb9e1032eb9
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1179
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-11-19 21:40:34 +00:00
q3k 16842119d1 app/mastodon: deploy
Change-Id: I88c104d1a8d5627355b01a8c48dc235635fca5ed
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1421
Reviewed-by: implr <implr@hackerspace.pl>
2022-11-18 12:15:22 +00:00
q3k ee41e94e0a k0: bump certs
Change-Id: I9d7a48d64de5d1aa82a134a8c22bfc50ba8ad270
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1402
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-10-09 20:22:43 +00:00
q3k 3c31f32307 cluster: bump prodvider certs
Change-Id: Ieefe3c733dd40a94c13a5e1c1648dd43d27c180a
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1386
Reviewed-by: implr <implr@hackerspace.pl>
2022-09-10 15:46:39 +00:00
implr e69e98da47 third_party/py: update rules_python, use pip-compile for requirements
Change-Id: If8309e8e3a4b58142f7479005a9eb4cbb1043cdb
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1324
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-07-05 21:27:31 +00:00
q3k 437b0c335f rook: fix benji
This unforks benji back into upstream. The old fork didn't support a new
authentication method on Ceph, and we don't have multiple clusters
anymore (so we don't need the functionality of the fork).

Change-Id: Ie79313b2321ca2e22ad2874b75a71385af95105f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1321
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-06-19 11:49:12 +00:00
q3k 55a486ae49 cluster: refactor nix machinery to fit //ops
This is a chonky refactor that get rids of the previous cluster-centric
defs-* plain nix file setup.

Now, nodes are configured individually in plain nixos modules, and are
provided a view of all other nodes in the 'machines' attribute. Cluster
logic is moved into modules which inspect this array to find other nodes
within the same cluster.

Kubernetes options are not fully clusterified yet (ie., they are still
hardcode to only provide the 'k0' cluster) but that can be fixed later.
The Ceph machinery is a good example of how that can be done.

The new NixOS configs are zero-diff against prod. While this is done
mostly by keeping the logic, we had to keep a few newly discovered
'bugs' around by adding some temporary options which keeps things as they
are. These will be removed in a future CL, then introducing a diff (but
no functional changes, hopefully).

We also remove the nix eval from clustercfg as it was not used anymore
(basically since we refactored certs at some point).

Change-Id: Id79772a96249b0e6344046f96f9c2cb481c4e1f4
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1322
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-06-19 11:48:52 +00:00
q3k b0e3693c0e cluster/kube: calico: fix etcd endpoints
Change-Id: Ia93d355ca343fa5a42ec37fbcae9135cb5304f6e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1285
Reviewed-by: implr <implr@hackerspace.pl>
2022-06-11 19:00:52 +00:00
implr 0544d27c04 tools, cluster/tools: bazel5 compat: remove unused import
Change-Id: I8b264a6c36e4d0f1535f38ad1f41495e62061f26
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1308
Reviewed-by: daz <daz@hackerspace.pl>
2022-06-04 19:56:40 +00:00
q3k d584e76ea3 cluster/clustercfg: fix for nix 2.4
Change-Id: I3f9ebd895495a23ec179ccd237389e8f3e531768
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1284
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-04-04 17:51:44 +00:00
q3k 42c17872fd cluster/certs: bump certs
Change-Id: I549364c050a96f72859886e6b724e07924ee3964
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1282
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-04-04 17:51:44 +00:00
implr 54a34b24a1 cluster/k0: ceph: add tape staging
Change-Id: I7fdba86b15f92157888850d2905440b45fb36f17
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1263
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-03-05 22:45:29 +00:00
patryk d0a0b18e54 cluster: allow namespace admins to access certificate resources
Change-Id: I532dadfe1799da43d12598e388141f8f9a3872de
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1250
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-02-05 15:08:47 +00:00
q3k bdd403c587 cluster: k0: move cockroachdb away from bc01n01, fixup joins
Reminded by a power failure on bc01n0{1,2}, we migrate away from at
least one of them into another server.

We also fix up the startup join parameter to not include the node itself
(which is not necessary, but a nice thing to have nonetheless).

Since bc01n01 was the initial node of the cluster, we also disable the
init job for k0 (which we don't care about anyway).

Change-Id: I3406471c0f9542e9d802d39138e400b5a5e74794
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1176
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-12-13 22:30:46 +00:00
implr eca1e080d7 calico: restore CNI_NET_DIR
Change-Id: I04e17f8639505f5b7cc42e86392abc175b7922db
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1178
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-12-03 03:10:13 +00:00
implr 12f176c1eb calico 3.14 -> 1.15
Change-Id: I9eceaf26017e483235b97c8d08717d2750fabe25
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/995
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-11-20 22:12:52 +00:00
q3k 0f8e5a2132 *: do not require env.sh
This removes the need to source env.{sh,fish} when working with hscloud.

This is done by:

 1. Implementing a Go library to reliably detect the location of the
    active hscloud checkout. That in turn is enabled by
    BUILD_WORKSPACE_DIRECTORY being now a thing in Bazel.
 2. Creating a tool `hscloud`, with a command `hscloud workspace` that
    returns the workspace path.
 3. Wrapping this tool to be accessible from Python and Bash.
 4. Bumping all users of hscloud_root to use either the Go library or
    one of the two implemented wrappers.

We also drive-by replace tools/install.sh to be a proper sh_binary, and
make it yell at people if it isn't being ran as `bazel run
//tools:install`.

Finally, we also drive-by delete cluster/tools/nixops.sh which was never used.

Change-Id: I7873714319bfc38bbb930b05baa605c5aa36470a
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1169
Reviewed-by: informatic <informatic@hackerspace.pl>
2021-10-17 21:21:58 +00:00
q3k 3b67afe81b cluster/certs: refresh
Change-Id: I2aa8fead4427b917afa4758ea0078125d9c4e914
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1153
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-10-07 19:58:35 +00:00
informatic e839f95079 cluster/kube/k0: add matrix and informatic personal ceph users
Change-Id: Ied8d474709b8053e9fc339435d3ca1ca5fdfa710
2021-09-14 22:21:22 +02:00
q3k 4b8ee32246 cluster/kube: always enable flexdriver
Documentation says [1] this is disabled by default in 1.1, but that
documentation kinda lies [2].

[1] - 235d5a384b/Documentation/flexvolume.md (ceph-flexvolume-configuration)

[2] - 64e28af741 (diff-d1eb5cba50e3770b61ccd3c730cd40514053e1da0233dfe09b5e7967e76a2a6cL424-L425)

Change-Id: Ia92c99e137ed751db62c0f56d42c4901986d0bb8
2021-09-14 21:39:39 +02:00
q3k 38f72fe094 cluster: k0: move ceph-waw3 to proper realm/zonegroup
With this we can use Ceph's multi-site support to easily migrate to our
new k0 Ceph cluster.

This migration was done by using radosgw-admin to rename the existing
realm/zonegroup to the new names (hscloud and eu), and then reworking
the jsonnet so that the Rook operator would effectively do nothing.

It sounds weird that creating a bunch of CRs like
Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator,
but that's how Rook works - a CephObjectStore generally creates
everything that the above CRs would create too, but implicitly. Adding
the extra CRs just allows specifying extra settings, like names.

(it wasn't fully a no-op, as the rgw daemon is parametrized by
realm/zonegroup/zone names, so that had to be restarted)

We also make the radosgw serve under object.ceph-eu.hswaw.net, which
allows us to right away start using a zonegroup URL instead of the
zone-only URL.

Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877
2021-09-14 21:39:39 +02:00
q3k 18084c1e86 cluster/nix: k0: enable rgw on osds
This enables radosgw wherever osds are. This should be fast and works
for us because we have little osd hosts.

Change-Id: I4ed014d2790d6c02a2ba8e775aaa1846032dee1e
2021-09-14 21:39:39 +02:00
q3k 085a8ff247 cluster: k0: upgrade to ceph 16.2.5
This was fun. See b/6 for a log of how swimmingly this went.

Change-Id: I96c3c18b5d33ef86523b3506f49a390419e9ca7f
2021-09-14 21:39:39 +02:00
q3k 464fb04f39 cluster: k0: bump rook to 1.6
This is needed to get Rook to talk to an external Ceph 16/Pacific
cluster.

This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our
own CRD rewrite and just slurp in upstream CRD defs.

Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc
2021-09-14 21:39:37 +02:00
q3k 6579e842b0 kartongips: paper over^W^Wfix CRD updates
Ceph CRD updates would fail with:

  ERROR Error updating customresourcedefinitions cephclusters.ceph.rook.io: expected kind, but got map

This wasn't just https://github.com/bitnami/kubecfg/issues/259 . We pull
in the 'solution' from Pulumi
(https://github.com/pulumi/pulumi-kubernetes/pull/622) which just
retries the update via a JSON update instead, and that seems to have
worked.

We also add some better error return wrapping, which I used to debug
this issue properly.

Oof.

Change-Id: I2007a7857e44128d74760174b61b59efa58e9cbc
2021-09-11 20:54:34 +00:00
q3k 05c4b5515b cluster/nix: symlink /sbin/lvm
This is needed by the new Rook OSD daemons.

Change-Id: I16eb24332db40a8209e7eb9747a81fa852e5cad9
2021-09-11 20:45:45 +00:00
q3k 9848e7e15f cluster: deploy NixOS-based ceph
First pass at a non-rook-managed Ceph cluster. We call it k0 instead of
ceph-waw4, as we pretty much are sure now that we will always have a
one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph
pools for different media kinds (if at all).

For now this has one mon and spinning rust OSDs. This can be iterated on
to make it less terrible with time.

See b/6 for more details.

Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11
2021-09-11 20:33:24 +00:00
q3k 1dbefed537 Merge "cluster/kube: remove ceph diff against k0 production" 2021-09-11 20:32:57 +00:00
q3k 9f639694ba Merge "kartongips: switch default diff behaviour to subset, nag users" 2021-09-11 20:18:34 +00:00
q3k 29f314b620 Merge "kartongips: implement proper diffing of aggregated ClusterRoles" 2021-09-11 20:18:28 +00:00
q3k 4f0468fa26 cluster/kube: remove ceph diff against k0 production
This now has a zero diff against prod.

location fields in CephCluster.storage.nodes seem to have been removed
from the CRD at some point. Not sure how the CRUSH tree now gets
populated, but whatever, it's been working like this for a while
already. Same for CephObjectStore.gateway.type.

The Rook Operator has been zero-scaled for a while now due to b/6.

Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59
2021-09-11 12:43:53 +00:00
q3k 59c8149df4 kartongips: switch default diff behaviour to subset, nag users
Change-Id: I998cdf7e693f6d1ce86c7ea411f47320d72a5906
2021-09-11 12:43:50 +00:00
q3k 72d7574536 kartongips: implement proper diffing of aggregated ClusterRoles
For a while now we've had spurious diffs against Ceph on k0 because of
a ClusterRole with an aggregationRule.

The way these behave is that the config object has an empty rule list,
and instead populates an aggregationRule which combines other existing
ClusterRoles into that ClusterRole. The control plane then populates the
rule field when the object is read/acted on, which caused us to always
see a diff between the configuration of that ClusterRole.

This hacks together a hardcoded fix for this particular behaviour.
Porting kubecfg over to SSA would probably also fix this - but that's
too much work for now.

Change-Id: I357c1417d4023691e5809f1af23f58f364353388
2021-09-11 12:40:18 +00:00
q3k b3c6770f8d ops, cluster: consolidate NixOS provisioning
This moves the diff-and-activate logic from cluster/nix/provision.nix
into ops/{provision,machines}.nix that can be used for both cluster
machines and bgpwtf machines.

The provisioning scripts now live per-NixOS-config, and anything under
ops.machines.$fqdn now has a .passthru.hscloud.provision derivation
which is that script. When ran, it will attempt to deploy onto the
target machine.

There's also a top-level tool at `ops.provision` which builds all
configurations / machines and can be called with the machine name/fqdn
to call the corresponding provisioner script.

clustercfg is changed to use the new provisioning logic.

Change-Id: I258abce9e8e3db42af35af102f32ab7963046353
2021-09-10 23:55:52 +00:00
q3k 432fa30ded cluster/certs: bump ca-kube-prodivider
Redeployed.

Change-Id: I01110433f89df5595de0f9587508104d6091a774
2021-08-29 17:20:59 +00:00
q3k 89a16f4de4 cluster/admitomatic: allow use-regex n-i-c annotation
This annotation is used to permit routes defined by regexes instead of
simple prefix matching. This is used by our synapse deployment for
routing incomming HTTP requests to diffferent Synapse components.

I've stumbled upon this while deploying a new Matrix/Synapse instance.
This hasn't been yet a problem because the existing ingresses for Matrix
deployments predate admitomatic.

Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0
2021-06-06 12:58:11 +00:00
q3k 7251f2720e Merge changes Ib068109f,I9a00487f,I1861fe7c,I254983e5,I3e2bedca, ...
* changes:
  cluster/identd/ident: update README
  cluster/kube: deploy identd
  cluster/identd: implement
  cluster/identd/kubenat: implement
  cluster/identd/cri: import
  cluster/identd/ident: add TestE2E
  cluster/identd/ident: add Query function
  cluster/identd/ident: add IdentError
  cluster/identd/ident: add basic ident protocol server
  cluster/identd/ident: add basic ident protocol client
2021-05-28 23:08:10 +00:00
q3k 46c3137d36 cluster/identd/ident: update README
Change-Id: Ib068109ff37749207e7b2a18c07f51d3c4ed3fd6
2021-05-26 19:46:13 +00:00
q3k 2414afe3c0 cluster/kube: deploy identd
Change-Id: I9a00487fc4a972ecb0904055dbaaab08221062c1
2021-05-26 19:46:09 +00:00
q3k 044386d638 cluster/identd: implement
This implements the main identd service that will run on our production
hosts. It's comparatively small, as most of the functionality is
implemented in //cluster/identd/ident and //cluster/identd/kubenat.

Change-Id: I1861fe7c93d105faa19a2bafbe9c85fe36502f73
2021-05-26 19:46:06 +00:00
q3k 6b649f8234 cluster/identd/kubenat: implement
This is a library to find pod information for a given TCP 4-tuple.

Change-Id: I254983e579e3aaa04c0c5491851f4af94a3f4249
2021-05-26 19:46:02 +00:00
q3k ae052f0804 cluster/identd/cri: import
This imports the CRI protobuf/gRPC specs. These are pulled from:

    https://raw.githubusercontent.com/kubernetes/cri-api/master/pkg/apis/runtime/v1alpha2/api.proto

Our host containerd does not implement v1, so we go with v1alpha2.

Change-Id: I3e2bedca76edc85eea9b61a8634c92175f0d2a30
2021-05-26 19:45:58 +00:00
q3k 3638a3d76a cluster/identd/ident: add TestE2E
Change-Id: I8a95fadf19376de2806cb63897b77e370559392f
2021-05-23 16:27:22 +00:00
q3k 8e603e13e5 cluster/identd/ident: add Query function
This is a high-level wrapper for querying identd, and uses IdentError to
carry errors received from the server.

Change-Id: I6444a67117193b97146ffd1548151cdb234d47b5
2021-05-23 16:27:17 +00:00
q3k 1c2bc12ad0 cluster/identd/ident: add IdentError
This adds a Go error type that can be used to wrap any ErrorResponse.

Change-Id: I57fbd056ac774f4e2ae3bdf85941c1010ada0656
2021-05-23 16:26:59 +00:00
q3k ce2737f2e7 cluster/identd/ident: add basic ident protocol server
This adds an ident protocol server and tests for it.

Change-Id: I830f85faa7dce4220bd7001635b20e88b4a8b417
2021-05-23 16:26:54 +00:00
q3k d4438d67a2 cluster/identd/ident: add basic ident protocol client
This is the first pass at an ident protocol client. In the end, we want
to implement an ident protocol server for our in-cluster identd, but
starting out with a client helps me getting familiar with the protocol,
and will allow the server implementation to be tested against the
client.

Change-Id: Ic37b84577321533bab2f2fbf7fb53409a5defb95
2021-05-23 16:26:50 +00:00
q3k e17f7edde0 cluster/kube: nginx: add Hscloud-Nic-Source-* headers
These can be used by production jobs to get the source port of the
client connecting over HTTP. A followup CR implements just that.

Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a
2021-05-22 19:16:39 +00:00
q3k ba2f4d8215 cluster/prodvider: deploy
Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53
2021-05-19 22:13:26 +00:00
q3k 02e1598eb3 cluster/prodvider: emit crdb certs
This emits short-lived user credentials for a `dev-user` in crdb-waw1
any time someone prodaccesses.

Change-Id: I0266a05c1f02225d762cfd2ca61976af0658639d
2021-05-19 22:13:22 +00:00
q3k bade46d45f go/pki: fix error return
DeveloperCredentialsLocation used to glog.Exitf instead of returning an
error, and a consumer (prodaccess) used to not check the return code.
Bad refactor?

Change-Id: I6c2d05966ba6b3eb300c24a51584ccf5e324cd49
2021-05-19 22:12:08 +00:00
q3k 5ae5cbec81 Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k" 2021-05-19 15:34:45 +00:00
q3k 99b91b11f1 cluster/k0/admitomatic: add .hswaw.net to hswaw-prod namespace
This was preventing certificate refresh in the hswaw-prod mirko ingress.

Change-Id: I14b18b642a3948a9864e2d9a90b2a2b2c145b9b1
2021-03-28 17:34:34 +00:00
q3k 7967ca177b cluster/certs: update k0 certs
This leaves us with the next set of expiring certs in September 2021.

Fixes b/36.

Change-Id: I536497626c0dd3807fccf28d4b61e5e531cf8d9c
2021-03-27 12:19:25 +00:00
q3k 41b882d053 cluster: remove bc01n03 certs/secrets
Decomissioned node, noticed while rolling over certs in b/36.

Change-Id: Ia386ff846998c52799662179c325b24e78f2eca8
2021-03-27 12:18:56 +00:00
q3k 2e8d24b84a cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k
This fixes CVE-2021-3450 and CVE-2021-3449.

Deployed on prod:

$ kubectl -n nginx-system exec nginx-ingress-controller-5c69c5cb59-2f8v4 -- openssl version
OpenSSL 1.1.1k  25 Mar 2021

Change-Id: I7115fd2367cca7b687c555deb2134b22d19a291a
2021-03-25 18:16:13 +00:00
q3k bf266c6aaf cluster/k0: add dns crdb user
In preparation for running PowerDNS on k0.

Change-Id: I853c7465a6a32d02628fa6cfdeb445eb9937b3be
2021-03-17 21:49:00 +00:00
q3k 3b8935378a cluster/crdb: make init job 'idempotent'
This enables its redeployment with a newer crdb image.

Change-Id: If039992674f401af53738c80d22cc2ca2818fe00
2021-03-17 21:48:30 +00:00
q3k 64de7afe32 cluster/kube/k0: fix syntax errors
This happened in 793ca1b3 and slipped past review.

Change-Id: Ie31f0e1ec03d6e4545d6683b21f528550bf4ef9f
2021-03-17 21:47:51 +00:00
q3k 793ca1b3b2 cluster/kube: limit OSDs in ceph-waw3 to 8GB RAM
Each OSD is connected to a 6TB drive, and with the good ol' 1TB storage
-> 1GB RAM rule of thumb for OSDs, we end up with 6GB. Or, to round up,
8GB.

I'm doing this because over the past few weeks OSDs in ceph-waw3 have
been using a _ton_ of RAM. This will probably not prevent that (and
instead they wil OOM more often :/), but it at will prevent us from
wasting resources (k0 started migrating pods to other nodes, and running
full nodes like that without an underlying request makes for a terrible
draining experience).

We need to get to the bottom of why this is happening in the first
place, though. Did this happen as we moved to containerd?

Followup: b.hswaw.net/29

Already deployed to production.

Change-Id: I98df63763c35017eb77595db7b9f2cce71756ed1
2021-03-07 00:09:58 +00:00
q3k 3ba5c1b591 *: docs pass
Change-Id: I87ca80d3f7728ed407071468ac233e6ad4574929
2021-03-06 22:21:28 +00:00
q3k bc0d3cb227 hackdoc: link to cs instead of gitweb
Change-Id: Ifca7a63517bceffe7ccc0452474d9d16626486de
2021-03-06 22:16:54 +00:00
q3k 0d26fc9780 cluster: disable nginx/acme
These are unused.

Change-Id: I2a428dabd0a27c060c595f5e0843d7d8d8e26dcd
2021-02-15 22:14:41 +01:00
q3k 765e369255 cluster: replace docker with containerd
This removes Docker and docker-shim from our production kubernetes, and
moves over to containerd/CRI. Docker support within Kubernetes was
always slightly shitty, and with 1.20 the integration was dropped
entirely. CRI/Containerd/runc is pretty much the new standard.

Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe
2021-02-15 22:14:15 +01:00
q3k 4b613303b1 RFC: *: move away from rules_nixpkgs
This is an attempt to see how well we do without rules_nixpkgs.

rules_nixpkgs has the following problems:

 - complicates our build system significantly (generated external
   repository indirection for picking local/nix python and go)
 - creates builds that cannot run on production (as they are tainted by
   /nix/store libraries)
 - is not a full solution to the bazel hermeticity problem anyway, and
   we'll have to tackle that some other way (eg. by introducing proper
   C++ cross-compilation toolchains and building everything from C,
   including Python and Go)

Instead of rules_nixpkgs, we ship a shell.nix file, so NixOS users can
just:

  jane@hacker:~/hscloud $ nix-shell
  hscloud-build-chrootenv:jane@hacker:~/hscloud$ prodaccess

This shell.nix is in a way nicer, as it immediately gives you all tools
needed to access production straight away.

Change-Id: Ieceb5ae0fb4d32e87301e5c99416379cedc900c5
2021-02-15 22:11:35 +01:00
q3k 4842705406 cluster/nix: integrate with readtree
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:

   nix-build -A cluster.nix.provision

   result/bin/provision

Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
2021-02-14 14:46:07 +00:00
q3k 225a5c7ee9 nixpkgs: bump
Fixes b/3.

Change-Id: I2f734422cdad00f78956477815c4aea645c6c49e
2021-02-14 14:43:07 +00:00