First pass at a non-rook-managed Ceph cluster. We call it k0 instead of
ceph-waw4, as we pretty much are sure now that we will always have a
one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph
pools for different media kinds (if at all).
For now this has one mon and spinning rust OSDs. This can be iterated on
to make it less terrible with time.
See b/6 for more details.
Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11
This moves the diff-and-activate logic from cluster/nix/provision.nix
into ops/{provision,machines}.nix that can be used for both cluster
machines and bgpwtf machines.
The provisioning scripts now live per-NixOS-config, and anything under
ops.machines.$fqdn now has a .passthru.hscloud.provision derivation
which is that script. When ran, it will attempt to deploy onto the
target machine.
There's also a top-level tool at `ops.provision` which builds all
configurations / machines and can be called with the machine name/fqdn
to call the corresponding provisioner script.
clustercfg is changed to use the new provisioning logic.
Change-Id: I258abce9e8e3db42af35af102f32ab7963046353
This removes Docker and docker-shim from our production kubernetes, and
moves over to containerd/CRI. Docker support within Kubernetes was
always slightly shitty, and with 1.20 the integration was dropped
entirely. CRI/Containerd/runc is pretty much the new standard.
Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:
nix-build -A cluster.nix.provision
result/bin/provision
Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
This disables DHCP on all k0 nodes. This change has been tentatively
deployed to bc01n01 (which is cordoned off in kube), and I will deploy
it to the rest of k0 machines once merged.
Change-Id: I96253a9d0acedb4512c877c64174992ffdb43d58
We want to be able to scrape controller-manager and scheduler metrics
into Prometheus. For that, each of them needs to:
1) listen on a secure port
2) have authn enabled
With this, any k8s user with the right permissions (and a bearer token
or TLS certificate) can come in and access metrics over a node's public
IP address. Access without a certificate/token gets thrown into the
system:anonymous user, which as no access to any API.
Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b
This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.
Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!
Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
- we update NixOS to 20.09pre
- we fix an ACME option that's now required
- we switch from systemd-timesyncd to chrony (as timesyncd took a long
time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
from ceph)
This has been deployed in production.
Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
We change the existing behaviour (copy files & run nixos-rebuild switch)
to something closer to nixops-style. This now means that provisioning
admin machines need Nix installed locally, but that's probably an okay
choice to make.
The upside of this approach is that it's easier to debug and test
derivations, as all data is local to the repo and the workstation, and
deploying just means copying a configuration closure and switching the
system to it. At some point we should even be able to run the entire
cluster within a set of test VMs.
We also bump the kubernetes control plane to 1.14. Kubelets are still at
1.13 and their upgrade is comint up today too.
Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd
This port was leaking kubelet state, including information on running
pods. No secrets were leaked (if they were not text-pasted into
env/args), but this still shouldn't be available.
As far as I can tell, nothing depends on this port, other than some
enterprise load balancers that require HTTP for node 'health' checks.
Change-Id: I9549b73e0168fe3ea4dce43cbe8fdc2ca4575961
Prodaccess/Prodvider allow issuing short-lived certificates for all SSO
users to access the kubernetes cluster.
Currently, all users get a personal-$username namespace in which they
have adminitrative rights. Otherwise, they get no access.
In addition, we define a static CRB to allow some admins access to
everything. In the future, this will be more granular.
We also update relevant documentation.
Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153