Each OSD is connected to a 6TB drive, and with the good ol' 1TB storage
-> 1GB RAM rule of thumb for OSDs, we end up with 6GB. Or, to round up,
8GB.
I'm doing this because over the past few weeks OSDs in ceph-waw3 have
been using a _ton_ of RAM. This will probably not prevent that (and
instead they wil OOM more often :/), but it at will prevent us from
wasting resources (k0 started migrating pods to other nodes, and running
full nodes like that without an underlying request makes for a terrible
draining experience).
We need to get to the bottom of why this is happening in the first
place, though. Did this happen as we moved to containerd?
Followup: b.hswaw.net/29
Already deployed to production.
Change-Id: I98df63763c35017eb77595db7b9f2cce71756ed1
This removes Docker and docker-shim from our production kubernetes, and
moves over to containerd/CRI. Docker support within Kubernetes was
always slightly shitty, and with 1.20 the integration was dropped
entirely. CRI/Containerd/runc is pretty much the new standard.
Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe
This is an attempt to see how well we do without rules_nixpkgs.
rules_nixpkgs has the following problems:
- complicates our build system significantly (generated external
repository indirection for picking local/nix python and go)
- creates builds that cannot run on production (as they are tainted by
/nix/store libraries)
- is not a full solution to the bazel hermeticity problem anyway, and
we'll have to tackle that some other way (eg. by introducing proper
C++ cross-compilation toolchains and building everything from C,
including Python and Go)
Instead of rules_nixpkgs, we ship a shell.nix file, so NixOS users can
just:
jane@hacker:~/hscloud $ nix-shell
hscloud-build-chrootenv:jane@hacker:~/hscloud$ prodaccess
This shell.nix is in a way nicer, as it immediately gives you all tools
needed to access production straight away.
Change-Id: Ieceb5ae0fb4d32e87301e5c99416379cedc900c5
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:
nix-build -A cluster.nix.provision
result/bin/provision
Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
This will permit any binding to system:admin-namespaces (eg. personal-*
namespaces, per-namespace extra admin access like matrix-0x3c) the
ability to create and updates ingresses.
Change-Id: I522896ebe290fe982d6fe46b7b1d604d22b4f72c
This turns admitomatic into a self-standing service that can be used as
an admission controller.
I've tested this E2E on a local k3s server, and have some early test
code for that - but that'll land up in a follow up CR, as it first needs
to be cleaned up.
Change-Id: I46da0fc49f9d1a3a1a96700a36deb82e5057249b
This gives us nearly everything required to run the admission
controller. In addition to checking for allowed domains, we also do some
nginx-inress-controller security checks.
Change-Id: Ib187de6d2c06c58bd8c320503d4f850df2ec8abd
This is the beginning of a validating admission controller which we will
use to permit end-users access to manage Ingresses.
This first pass implements an ingressFilter, which is the main structure
through which allowed namespace/dns combinations will be allowed. The
interface is currently via a test, but in the future this will likely be
configured via a command line, or via a serialized protobuf config.
Change-Id: I22dbed633ea8d8e1fa02c2a1598f37f02ea1b309
This change reflects the current production state.
Upgrade was done by going through following versions:
19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4
Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c
It reached the stage of being crapped out so much that the OSDs spurious
IOPS killed the performance of disks colocated on the same M610 RAID
controllers. This made etcd _very_ slow, to the point of churning
through re-elections due to timeouts.
etcd/apiserver latencies, observe the difference at ~15:38:
https://object.ceph-waw3.hswaw.net/q3k-personal/4fbe8d4cfc8193cad307d487371b4e44358b931a7494aa88aff50b13fae9983c.png
I moved gerrit/* and matrix/appservice-irc-freenode PVCs to ceph-waw3 by
hand. The rest were non-critical so I removed them, they can be
recovered from benji backups if needed.
Change-Id: Iffbe87aefc06d8324a82b958a579143b7dd9914c
More as-builts. This has already been bumped. Had to coax ceph-waw2 to
upgrade despite the fact that it's horribly broken.
Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3
This is after the monster^Wrook outage of the week two weeks ago caused
by bc01n03 dying.
Plan is to migrate ceph-waw3 to be external, yeet ceph-waw2, and extend
crdb-waw1 to another node.
Change-Id: I133af3b1171fea383b45bf06c51e48a5c40341e4
This disables DHCP on all k0 nodes. This change has been tentatively
deployed to bc01n01 (which is cordoned off in kube), and I will deploy
it to the rest of k0 machines once merged.
Change-Id: I96253a9d0acedb4512c877c64174992ffdb43d58
These tests are broken as they depend on some test data that we
currently don't have in hscloud. They should be fixed ASAP.
Change-Id: I2571c2958cb84e145a7e3a44171685ecf43cf499
This forks bitnami/kubecfg into kartongips. The rationale is that we
want to implement hscloud-specific functionality that wouldn't really be
upstreamable into kubecfg (like secret support, mulit-cluster support).
We forked off from github.com/q3k/kubecfg at commit b6817a94492c561ed61a44eeea2d92dcf2e6b8c0.
Change-Id: If5ba513905e0a86f971576fe7061a471c1d8b398
We want to be able to scrape controller-manager and scheduler metrics
into Prometheus. For that, each of them needs to:
1) listen on a secure port
2) have authn enabled
With this, any k8s user with the right permissions (and a bearer token
or TLS certificate) can come in and access metrics over a node's public
IP address. Access without a certificate/token gets thrown into the
system:anonymous user, which as no access to any API.
Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b
This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.
Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!
Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
This has been deployed to k0 nodes.
Current state of cluster certificates:
cluster/certs/ca-etcd.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-etcdpeer.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-kube.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-kubefront.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-kube-prodvider.cert
Not After : Sep 1 21:30:00 2021 GMT
cluster/certs/etcd-bc01n01.hswaw.net.cert
Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcd-bc01n02.hswaw.net.cert
Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcd-bc01n03.hswaw.net.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-calico.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-dcr01s22.hswaw.net.cert
Not After : Oct 3 15:33:00 2021 GMT
cluster/certs/etcd-dcr01s24.hswaw.net.cert
Not After : Oct 3 15:38:00 2021 GMT
cluster/certs/etcd-kube.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-bc01n01.hswaw.net.cert
Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcdpeer-bc01n02.hswaw.net.cert
Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcdpeer-bc01n03.hswaw.net.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert
Not After : Oct 3 15:33:00 2021 GMT
cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert
Not After : Oct 3 15:38:00 2021 GMT
cluster/certs/etcd-root.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-apiserver.cert
Not After : Oct 3 15:26:00 2021 GMT
cluster/certs/kube-controllermanager.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kubefront-apiserver.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert
Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert
Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert
Not After : Oct 3 15:33:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert
Not After : Oct 3 15:38:00 2021 GMT
cluster/certs/kube-proxy.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-scheduler.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-serviceaccounts.cert
Not After : Mar 28 15:15:00 2021 GMT
Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314