forked from hswaw/hscloud
Serge Bazanski
9f0e1e88f1
This replaces the old clustercfg script with a brand spanking new mostly-equivalent Go reimplementation. But it's not exactly the same, here are the differences: 1. No cluster deployment logic anymore - we expect everyone to use ops/ machine at this point. 2. All certs/keys are Ed25519 and do not expire by default - but support for short-lived certificates is there, and is actually more generic and reusable. Currently it's only used for admincreds. 3. Speaking of admincreds: the new admincreds automatically figure out your username. 4. admincreds also doesn't shell out to kubectl anymore, and doesn't override your default context. The generated creds can live peacefully alongside your normal prodaccess creds. 5. gencerts (the new nodestrap without deployment support) now automatically generates certs for all nodes, based on local Nix modules in ops/. 6. No secretstore support. This will be changed once we rebuild secretstore in Go. For now users are expected to manually run secretstore sync on cluster/secrets. Change-Id: Ida935f44e04fd933df125905eee10121ac078495 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1498 Reviewed-by: q3k <q3k@hackerspace.pl>
97 lines
3.6 KiB
Markdown
97 lines
3.6 KiB
Markdown
Cluster Admin Docs
|
|
==================
|
|
|
|
Current cluster: `k0.hswaw.net`
|
|
|
|
Persistent Storage (waw3)
|
|
-------------------------
|
|
|
|
HDDs on dcr01s2{2,4}. 40TB total capacity for now. Use this.
|
|
|
|
The following storage classes use this cluster:
|
|
|
|
- `waw-hdd-yolo-3` - 1 replica
|
|
- `waw-hdd-redundant-3` - 2 replicas
|
|
- `waw-hdd-redundant-3-object` - 2 replicas, object store
|
|
|
|
Rados Gateway (S3) is available at https://object.ceph-waw3.hswaw.net/. To
|
|
create a user, ask an admin.
|
|
|
|
PersistentVolumes currently bound to PVCs get automatically backed up (hourly
|
|
for the next 48 hours, then once every 4 weeks, then once every month for a
|
|
year).
|
|
|
|
Administration
|
|
==============
|
|
|
|
Provisioning nodes
|
|
------------------
|
|
|
|
- bring up a new node with nixos, the configuration doesn't matter and will be
|
|
nuked anyway
|
|
- add machine to cluster/machines and ops/machines.nix
|
|
- generate certs with `bazel run //cluster/clustercfg gencerts`
|
|
- deploy using ops (see ops/README.md)
|
|
|
|
Applying kubecfg state
|
|
----------------------
|
|
|
|
First, decrypt/sync all secrets:
|
|
|
|
secretstore sync cluster/secrets/
|
|
|
|
Then, run kubecfg. There's multiple top-level 'view' files that you can run,
|
|
all located in `//cluster/kube`. All of them use `k0.libsonnet` as the master
|
|
state of Kubernetes configuration, just expose subsets of it to work around the
|
|
fact that kubecfg gets somewhat slow with a lot of resources.
|
|
|
|
- `k0.jsonnet`: everything that is defined for k0 in `//cluster/kube/...`.
|
|
- `k0-core.jsonnet`: definitions that re in common across all clusters
|
|
(networking, registry, etc), without Rook.
|
|
- `k0-registry.jsonnet`: just the docker registry on k0 (useful when changing
|
|
ACLs).
|
|
- `k0-ceph.jsonnet`: everything ceph/rook related on k0.
|
|
|
|
When in doubt, run `k0.jsonnet`. There's no harm in doing it, it might just be
|
|
slow. Running individual files without realizing that whatever change you
|
|
implemented also influenced something that was rendered in another file can
|
|
cause to production inconsistencies.
|
|
|
|
Feel free to add more view files for typical administrative tasks.
|
|
|
|
Ceph - Debugging
|
|
-----------------
|
|
|
|
We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system`
|
|
namespace. To debug Ceph issues, start by looking at its logs.
|
|
|
|
A dashboard is available at https://ceph-waw2.hswaw.net/ and
|
|
https://ceph-waw3.hswaw.net, to get the admin password run:
|
|
|
|
kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
|
|
kubectl -n ceph-waw3 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
|
|
|
|
|
|
Ceph - Backups
|
|
--------------
|
|
|
|
Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob
|
|
runs in every Ceph cluster. You can also manually trigger a run by doing:
|
|
|
|
kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)
|
|
kubectl -n ceph-waw3 create job --from=cronjob/ceph-waw3-benji ceph-waw3-benji-manual-$(date +%s)
|
|
|
|
Ceph ObjectStorage pools (RADOSGW) are _not_ backed up yet!
|
|
|
|
Ceph - Object Storage
|
|
---------------------
|
|
|
|
To create an object store user consult rook.io manual
|
|
(https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html).
|
|
User authentication secret is generated in ceph cluster namespace
|
|
(`ceph-waw{2,3}`), thus may need to be manually copied into application namespace.
|
|
(see `app/registry/prod.jsonnet` comment)
|
|
|
|
`tools/rook-s3cmd-config` can be used to generate test configuration file for
|
|
s3cmd. Remember to append `:default-placement` to your region name (ie.
|
|
`waw-hdd-redundant-3-object:default-placement`)
|