cluster: some doc updates

2019-04-02 14:45:17 +02:00 · 2019-04-02 14:45:17 +02:00 · 2fd5861d24
commit 2fd5861d24
parent 5f2dc8530d
1 changed files with 29 additions and 0 deletions
--- a/cluster/README
+++ b/cluster/README
@ -18,3 +18,32 @@ Provisioning nodes
 - `clustercfg nodestrap bc01nXX.hswaw.net`

 That's it!
+
+Ceph
+====
+
+We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.
+
+The following Ceph clusters are available:
+
+ceph-waw1
+---------
+
+HDDs on bc01n0{1-3}. 3TB total capacity.
+
+The following storage classes use this cluster:
+
+ - `waw-hdd-redundant-1` - erasure coded 2.1
+
+A dashboard is available at https://ceph-waw1.hswaw.net/, to get the admin password run:
+
+    kubectl -n ceph-waw1 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
+
+Known Issues
+============
+
+After running `nixos-configure switch` on the hosts, the shared host/container CNI plugin directory gets nuked, and pods will fail to schedule on that node (TODO(q3k): error message here). To fix this, restart calico-node pods running on nodes that have this issue. The Calico Node pod will reschedule automatically and fix the CNI plugins directory.
+
+    kubectl -n kube-system get pods -o wide | grep calico-node
+    kubectl -n kube-system delete pod calico-node-XXXX
+