hscloud

cheshire

hscloud

History

q3k aa76e55eea cert-manager: fix DNS for http01 k0 splitdns Change-Id: I73847daec9796cb891cf2fe58c2633c5fa768861		2019-12-29 02:49:30 +01:00
..
certs	kube-apiserver: fix cert mismatch, again	2019-12-17 02:13:55 +01:00
clustercfg	cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net	2019-10-04 13:52:34 +02:00
kube	cert-manager: fix DNS for http01 k0 splitdns	2019-12-29 02:49:30 +01:00
nix	cluster: add q3k's new SSH key	2019-12-17 01:58:58 +01:00
prodaccess	…
prodvider	…
secrets	kube-apiserver: fix cert mismatch, again	2019-12-17 02:13:55 +01:00
tools	*: bump to q3k's kubecfg, kubernetes 1.16	2019-11-17 22:38:40 +01:00
README	rules_pip: update to new version	2019-09-25 14:05:07 +02:00

README

HSCloud Clusters
================

Current cluster: `k0.hswaw.net`

Accessing via kubectl
---------------------

    prodaccess # get a short-lived certificate for your use via SSO
    kubectl version
    kubectl top nodes

Every user gets a `personal-$username` namespace. Feel free to use it for your own purposes, but watch out for resource usage!

Persistent Storage
------------------

HDDs on bc01n0{1-3}. 3TB total capacity.

The following storage classes use this cluster:

 - `waw-hdd-paranoid-1` - 3 replicas
 - `waw-hdd-redundant-1` - erasure coded 2.1
 - `waw-hdd-yolo-1` - unreplicated (you _will_ lose your data)
 - `waw-hdd-redundant-1-object` - erasure coded 2.1 object store

Rados Gateway (S3) is available at https://object.ceph-waw2.hswaw.net/. To create a user, ask an admin.

PersistentVolumes currently bound to PVCs get automatically backued up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

Administration
==============

Provisioning nodes
------------------

 - bring up a new node with nixos, running the configuration.nix from bootstrap (to be documented)
 - `bazel run //cluster/clustercfg nodestrap bc01nXX.hswaw.net`

Ceph - Debugging
-----------------

We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.

A dashboard is available at https://ceph-waw2.hswaw.net/, to get the admin password run:

    kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo


Ceph - Backups
--------------

Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob runs in every Ceph cluster. You can also manually trigger a run by doing:

    kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)

Ceph ObjectStorage pools (RADOSGW) are _not_ backed up yet!

Ceph - Object Storage
---------------------

To create an object store user consult rook.io manual (https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html)
User authentication secret is generated in ceph cluster namespace (`ceph-waw2`),
thus may need to be manually copied into application namespace. (see
`app/registry/prod.jsonnet` comment)

`tools/rook-s3cmd-config` can be used to generate test configuration file for s3cmd.
Remember to append `:default-placement` to your region name (ie. `waw-hdd-redundant-1-object:default-placement`)