hscloud/cluster
Sergiusz Bazanski d186e9468d cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net
In https://gerrit.hackerspace.pl/c/hscloud/+/70 we accidentally
introduced a split-horizon DNS situation:

 - k0.hswaw.net from the Internet resolves to nodes running the k8s API
   servers, and as such can serve API server traffic
 - k0.hswaw.net from the cluster returned no results

This broke prodvider in two ways:
 - it dialed the API servers at k0.hswaw.net
 - even after the endpoint was moved to
   kubernetes.default.svc.k0.hswaw.net, the apiserver cert didn't cover
   that

Thus, not only we had to change the prodvider endpoint but also change
the APIserver certs to cover this new name.

I'm not sure this should be the target fix. I think at some point we
should only start referring to in-cluster services via their full (or
cluster.local) names, but right now k0.hswaw.net is an exception and as
such a split, and we have no way to access the internal services from
the outside just yet.

However, getting prodvider to work is important enough that this fix is
IMO good enough for now.

Change-Id: I13d0681208c66f4060acecc78b7ae14b8f8d7125
2019-10-04 13:52:34 +02:00
..
certs cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net 2019-10-04 13:52:34 +02:00
clustercfg cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net 2019-10-04 13:52:34 +02:00
kube cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net 2019-10-04 13:52:34 +02:00
nix cluster: disable unauthenticated read only port on kubelets 2019-09-02 16:33:02 +02:00
prodaccess prod{access,vider}: implement 2019-08-30 23:08:18 +02:00
prodvider prodvider: clean up LDAP connections 2019-08-31 15:00:51 +02:00
secrets Get in the Cluster, Benji! 2019-09-02 16:33:02 +02:00
tools cluster: add nextcloud user for object store 2019-09-02 16:33:02 +02:00
README rules_pip: update to new version 2019-09-25 14:05:07 +02:00

HSCloud Clusters
================

Current cluster: `k0.hswaw.net`

Accessing via kubectl
---------------------

    prodaccess # get a short-lived certificate for your use via SSO
    kubectl version
    kubectl top nodes

Every user gets a `personal-$username` namespace. Feel free to use it for your own purposes, but watch out for resource usage!

Persistent Storage
------------------

HDDs on bc01n0{1-3}. 3TB total capacity.

The following storage classes use this cluster:

 - `waw-hdd-paranoid-1` - 3 replicas
 - `waw-hdd-redundant-1` - erasure coded 2.1
 - `waw-hdd-yolo-1` - unreplicated (you _will_ lose your data)
 - `waw-hdd-redundant-1-object` - erasure coded 2.1 object store

Rados Gateway (S3) is available at https://object.ceph-waw2.hswaw.net/. To create a user, ask an admin.

PersistentVolumes currently bound to PVCs get automatically backued up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

Administration
==============

Provisioning nodes
------------------

 - bring up a new node with nixos, running the configuration.nix from bootstrap (to be documented)
 - `bazel run //cluster/clustercfg nodestrap bc01nXX.hswaw.net`

Ceph - Debugging
-----------------

We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.

A dashboard is available at https://ceph-waw2.hswaw.net/, to get the admin password run:

    kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo


Ceph - Backups
--------------

Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob runs in every Ceph cluster. You can also manually trigger a run by doing:

    kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)

Ceph ObjectStorage pools (RADOSGW) are _not_ backed up yet!

Ceph - Object Storage
---------------------

To create an object store user consult rook.io manual (https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html)
User authentication secret is generated in ceph cluster namespace (`ceph-waw2`),
thus may need to be manually copied into application namespace. (see
`app/registry/prod.jsonnet` comment)

`tools/rook-s3cmd-config` can be used to generate test configuration file for s3cmd.
Remember to append `:default-placement` to your region name (ie. `waw-hdd-redundant-1-object:default-placement`)