This removes Docker and docker-shim from our production kubernetes, and
moves over to containerd/CRI. Docker support within Kubernetes was
always slightly shitty, and with 1.20 the integration was dropped
entirely. CRI/Containerd/runc is pretty much the new standard.
Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:
nix-build -A cluster.nix.provision
result/bin/provision
Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
This disables DHCP on all k0 nodes. This change has been tentatively
deployed to bc01n01 (which is cordoned off in kube), and I will deploy
it to the rest of k0 machines once merged.
Change-Id: I96253a9d0acedb4512c877c64174992ffdb43d58
We want to be able to scrape controller-manager and scheduler metrics
into Prometheus. For that, each of them needs to:
1) listen on a secure port
2) have authn enabled
With this, any k8s user with the right permissions (and a bearer token
or TLS certificate) can come in and access metrics over a node's public
IP address. Access without a certificate/token gets thrown into the
system:anonymous user, which as no access to any API.
Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b
This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.
Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!
Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
- we update NixOS to 20.09pre
- we fix an ACME option that's now required
- we switch from systemd-timesyncd to chrony (as timesyncd took a long
time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
from ceph)
This has been deployed in production.
Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
We change the existing behaviour (copy files & run nixos-rebuild switch)
to something closer to nixops-style. This now means that provisioning
admin machines need Nix installed locally, but that's probably an okay
choice to make.
The upside of this approach is that it's easier to debug and test
derivations, as all data is local to the repo and the workstation, and
deploying just means copying a configuration closure and switching the
system to it. At some point we should even be able to run the entire
cluster within a set of test VMs.
We also bump the kubernetes control plane to 1.14. Kubelets are still at
1.13 and their upgrade is comint up today too.
Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd
This port was leaking kubelet state, including information on running
pods. No secrets were leaked (if they were not text-pasted into
env/args), but this still shouldn't be available.
As far as I can tell, nothing depends on this port, other than some
enterprise load balancers that require HTTP for node 'health' checks.
Change-Id: I9549b73e0168fe3ea4dce43cbe8fdc2ca4575961
Prodaccess/Prodvider allow issuing short-lived certificates for all SSO
users to access the kubernetes cluster.
Currently, all users get a personal-$username namespace in which they
have adminitrative rights. Otherwise, they get no access.
In addition, we define a static CRB to allow some admins access to
everything. In the future, this will be more granular.
We also update relevant documentation.
Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153