hscloud/cluster
2019-04-04 16:54:00 +02:00
..
certs cluster/kube: initial cert-manager implementation 2019-04-02 13:20:15 +02:00
kube cluster/kube/lib/metallb: bump memory hoping to prevent crashes 2019-04-04 16:54:00 +02:00
secrets cluster/secrets: add implr 2019-01-17 23:37:36 +01:00
openssl.cnf *: reorganize 2019-01-13 14:15:09 +01:00
README cluster: some doc updates 2019-04-02 14:45:17 +02:00

HSCloud Clusters
================

Current cluster: `k0.hswaw.net`

Accessing via kubectl
---------------------

There isn't yet a service for getting short-term user certificates. Instead, you'll have to get admin certificates:

    clustercfg admincreds $(whoami)-admin
    kubectl get nodes

Provisioning nodes
------------------

 - bring up a new node with nixos, running the configuration.nix from bootstrap (to be documented)
 - `clustercfg nodestrap bc01nXX.hswaw.net`

That's it!

Ceph
====

We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.

The following Ceph clusters are available:

ceph-waw1
---------

HDDs on bc01n0{1-3}. 3TB total capacity.

The following storage classes use this cluster:

 - `waw-hdd-redundant-1` - erasure coded 2.1

A dashboard is available at https://ceph-waw1.hswaw.net/, to get the admin password run:

    kubectl -n ceph-waw1 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo

Known Issues
============

After running `nixos-configure switch` on the hosts, the shared host/container CNI plugin directory gets nuked, and pods will fail to schedule on that node (TODO(q3k): error message here). To fix this, restart calico-node pods running on nodes that have this issue. The Calico Node pod will reschedule automatically and fix the CNI plugins directory.

    kubectl -n kube-system get pods -o wide | grep calico-node
    kubectl -n kube-system delete pod calico-node-XXXX