2021-03-06 22:16:17 +00:00
|
|
|
Cluster Admin Docs
|
|
|
|
==================
|
2020-04-10 20:09:35 +00:00
|
|
|
|
2020-04-10 21:20:53 +02:00
|
|
|
Current cluster: `k0.hswaw.net`
|
2020-04-10 20:09:35 +00:00
|
|
|
|
2020-02-15 00:58:47 +01:00
|
|
|
Persistent Storage (waw3)
|
|
|
|
-------------------------
|
|
|
|
|
|
|
|
HDDs on dcr01s2{2,4}. 40TB total capacity for now. Use this.
|
|
|
|
|
|
|
|
The following storage classes use this cluster:
|
|
|
|
|
|
|
|
- `waw-hdd-yolo-3` - 1 replica
|
|
|
|
- `waw-hdd-redundant-3` - 2 replicas
|
|
|
|
- `waw-hdd-redundant-3-object` - 2 replicas, object store
|
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
Rados Gateway (S3) is available at https://object.ceph-waw3.hswaw.net/. To
|
|
|
|
create a user, ask an admin.
|
2020-02-15 00:58:47 +01:00
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
PersistentVolumes currently bound to PVCs get automatically backed up (hourly
|
|
|
|
for the next 48 hours, then once every 4 weeks, then once every month for a
|
|
|
|
year).
|
2019-08-29 20:12:24 +02:00
|
|
|
|
|
|
|
Administration
|
|
|
|
==============
|
|
|
|
|
|
|
|
Provisioning nodes
|
|
|
|
------------------
|
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
- bring up a new node with nixos, the configuration doesn't matter and will be
|
|
|
|
nuked anyway
|
2020-02-15 00:58:47 +01:00
|
|
|
- edit cluster/nix/defs-machines.nix
|
2019-09-22 02:19:18 +02:00
|
|
|
- `bazel run //cluster/clustercfg nodestrap bc01nXX.hswaw.net`
|
2019-08-29 20:12:24 +02:00
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
Applying kubecfg state
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
First, decrypt/sync all secrets:
|
|
|
|
|
|
|
|
secretstore sync cluster/secrets/
|
|
|
|
|
|
|
|
Then, run kubecfg. There's multiple top-level 'view' files that you can run,
|
|
|
|
all located in `//cluster/kube`. All of them use `k0.libsonnet` as the master
|
|
|
|
state of Kubernetes configuration, just expose subsets of it to work around the
|
|
|
|
fact that kubecfg gets somewhat slow with a lot of resources.
|
|
|
|
|
|
|
|
- `k0.jsonnet`: everything that is defined for k0 in `//cluster/kube/...`.
|
|
|
|
- `k0-core.jsonnet`: definitions that re in common across all clusters
|
|
|
|
(networking, registry, etc), without Rook.
|
|
|
|
- `k0-registry.jsonnet`: just the docker registry on k0 (useful when changing
|
|
|
|
ACLs).
|
|
|
|
- `k0-ceph.jsonnet`: everything ceph/rook related on k0.
|
|
|
|
|
|
|
|
When in doubt, run `k0.jsonnet`. There's no harm in doing it, it might just be
|
|
|
|
slow. Running individual files without realizing that whatever change you
|
|
|
|
implemented also influenced something that was rendered in another file can
|
|
|
|
cause to production inconsistencies.
|
|
|
|
|
|
|
|
Feel free to add more view files for typical administrative tasks.
|
|
|
|
|
Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.
Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
- runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
up their respective PVs to Wasabi
- runs benji enforce, marking backups outside our backup policy [2] as
to be deleted
- runs benji cleanup, to remove unneeded backups
- runs a custom script to backup benji's sqlite3 database into wasabi
(unencrypted, but we're fine with that - as the metadata only contains
image/pool names, thus Ceph PV and pool names)
[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
then one backup for the next 48 hours, then one backup for the next
7 days, then one backup for the next 12 months, for a total of 65
backups (deduplicated, of course)
We also drive-by update some docs (make them mmore separated into
user/admin docs).
Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-08-31 16:33:29 +02:00
|
|
|
Ceph - Debugging
|
|
|
|
-----------------
|
2019-08-29 20:12:24 +02:00
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system`
|
|
|
|
namespace. To debug Ceph issues, start by looking at its logs.
|
2019-08-29 20:12:24 +02:00
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
A dashboard is available at https://ceph-waw2.hswaw.net/ and
|
|
|
|
https://ceph-waw3.hswaw.net, to get the admin password run:
|
Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.
Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
- runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
up their respective PVs to Wasabi
- runs benji enforce, marking backups outside our backup policy [2] as
to be deleted
- runs benji cleanup, to remove unneeded backups
- runs a custom script to backup benji's sqlite3 database into wasabi
(unencrypted, but we're fine with that - as the metadata only contains
image/pool names, thus Ceph PV and pool names)
[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
then one backup for the next 48 hours, then one backup for the next
7 days, then one backup for the next 12 months, for a total of 65
backups (deduplicated, of course)
We also drive-by update some docs (make them mmore separated into
user/admin docs).
Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-08-31 16:33:29 +02:00
|
|
|
|
|
|
|
kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
|
2020-06-06 01:21:45 +02:00
|
|
|
kubectl -n ceph-waw3 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
|
Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.
Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
- runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
up their respective PVs to Wasabi
- runs benji enforce, marking backups outside our backup policy [2] as
to be deleted
- runs benji cleanup, to remove unneeded backups
- runs a custom script to backup benji's sqlite3 database into wasabi
(unencrypted, but we're fine with that - as the metadata only contains
image/pool names, thus Ceph PV and pool names)
[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
then one backup for the next 48 hours, then one backup for the next
7 days, then one backup for the next 12 months, for a total of 65
backups (deduplicated, of course)
We also drive-by update some docs (make them mmore separated into
user/admin docs).
Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-08-31 16:33:29 +02:00
|
|
|
|
|
|
|
|
|
|
|
Ceph - Backups
|
|
|
|
--------------
|
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob
|
|
|
|
runs in every Ceph cluster. You can also manually trigger a run by doing:
|
Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.
Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
- runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
up their respective PVs to Wasabi
- runs benji enforce, marking backups outside our backup policy [2] as
to be deleted
- runs benji cleanup, to remove unneeded backups
- runs a custom script to backup benji's sqlite3 database into wasabi
(unencrypted, but we're fine with that - as the metadata only contains
image/pool names, thus Ceph PV and pool names)
[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
then one backup for the next 48 hours, then one backup for the next
7 days, then one backup for the next 12 months, for a total of 65
backups (deduplicated, of course)
We also drive-by update some docs (make them mmore separated into
user/admin docs).
Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-08-31 16:33:29 +02:00
|
|
|
|
|
|
|
kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)
|
2020-04-10 21:20:53 +02:00
|
|
|
kubectl -n ceph-waw3 create job --from=cronjob/ceph-waw3-benji ceph-waw3-benji-manual-$(date +%s)
|
Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.
Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
- runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
up their respective PVs to Wasabi
- runs benji enforce, marking backups outside our backup policy [2] as
to be deleted
- runs benji cleanup, to remove unneeded backups
- runs a custom script to backup benji's sqlite3 database into wasabi
(unencrypted, but we're fine with that - as the metadata only contains
image/pool names, thus Ceph PV and pool names)
[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
then one backup for the next 48 hours, then one backup for the next
7 days, then one backup for the next 12 months, for a total of 65
backups (deduplicated, of course)
We also drive-by update some docs (make them mmore separated into
user/admin docs).
Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-08-31 16:33:29 +02:00
|
|
|
|
|
|
|
Ceph ObjectStorage pools (RADOSGW) are _not_ backed up yet!
|
|
|
|
|
|
|
|
Ceph - Object Storage
|
|
|
|
---------------------
|
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
To create an object store user consult rook.io manual
|
|
|
|
(https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html).
|
|
|
|
User authentication secret is generated in ceph cluster namespace
|
|
|
|
(`ceph-waw{2,3}`), thus may need to be manually copied into application namespace.
|
|
|
|
(see `app/registry/prod.jsonnet` comment)
|
2019-08-29 20:12:24 +02:00
|
|
|
|
2020-06-06 01:21:45 +02:00
|
|
|
`tools/rook-s3cmd-config` can be used to generate test configuration file for
|
|
|
|
s3cmd. Remember to append `:default-placement` to your region name (ie.
|
|
|
|
`waw-hdd-redundant-3-object:default-placement`)
|