hscloud

cheshire

hscloud

Author	SHA1	Message	Date
q3k	05c4b5515b	cluster/nix: symlink /sbin/lvm This is needed by the new Rook OSD daemons. Change-Id: I16eb24332db40a8209e7eb9747a81fa852e5cad9	2021-09-11 20:45:45 +00:00
q3k	9848e7e15f	cluster: deploy NixOS-based ceph First pass at a non-rook-managed Ceph cluster. We call it k0 instead of ceph-waw4, as we pretty much are sure now that we will always have a one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph pools for different media kinds (if at all). For now this has one mon and spinning rust OSDs. This can be iterated on to make it less terrible with time. See b/6 for more details. Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11	2021-09-11 20:33:24 +00:00
q3k	b3c6770f8d	ops, cluster: consolidate NixOS provisioning This moves the diff-and-activate logic from cluster/nix/provision.nix into ops/{provision,machines}.nix that can be used for both cluster machines and bgpwtf machines. The provisioning scripts now live per-NixOS-config, and anything under ops.machines.$fqdn now has a .passthru.hscloud.provision derivation which is that script. When ran, it will attempt to deploy onto the target machine. There's also a top-level tool at `ops.provision` which builds all configurations / machines and can be called with the machine name/fqdn to call the corresponding provisioner script. clustercfg is changed to use the new provisioning logic. Change-Id: I258abce9e8e3db42af35af102f32ab7963046353	2021-09-10 23:55:52 +00:00
q3k	0d26fc9780	cluster: disable nginx/acme These are unused. Change-Id: I2a428dabd0a27c060c595f5e0843d7d8d8e26dcd	2021-02-15 22:14:41 +01:00
q3k	765e369255	cluster: replace docker with containerd This removes Docker and docker-shim from our production kubernetes, and moves over to containerd/CRI. Docker support within Kubernetes was always slightly shitty, and with 1.20 the integration was dropped entirely. CRI/Containerd/runc is pretty much the new standard. Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe	2021-02-15 22:14:15 +01:00
q3k	4842705406	cluster/nix: integrate with readtree This unifies nixpkgs with the one defined in //default.nix and makes it possible to use readTree to build the provisioners: nix-build -A cluster.nix.provision result/bin/provision Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862	2021-02-14 14:46:07 +00:00
q3k	225a5c7ee9	nixpkgs: bump Fixes b/3. Change-Id: I2f734422cdad00f78956477815c4aea645c6c49e	2021-02-14 14:43:07 +00:00
q3k	f684535c6e	k0: remove bc01n03 from nix defs This only affects ETCD_INITIAL_* env vars, so is is effectively a no-op. Deployed to prod. Change-Id: Ic9118e17b088d1b58ebaf1ac0708a1ee6fcf2c06	2021-01-19 20:20:33 +01:00
q3k	acdd665b08	cluster: use static addresses This disables DHCP on all k0 nodes. This change has been tentatively deployed to bc01n01 (which is cordoned off in kube), and I will deploy it to the rest of k0 machines once merged. Change-Id: I96253a9d0acedb4512c877c64174992ffdb43d58	2020-12-14 19:10:52 +01:00
q3k	e77f7717d4	k0: bump to 1.16.5 Change-Id: I548808ce4e0deb0513a1e00963f383d84b9d920c	2020-10-10 22:39:50 +02:00
q3k	1257389d3d	k0: expose controller-manager and scheduler metrics We want to be able to scrape controller-manager and scheduler metrics into Prometheus. For that, each of them needs to: 1) listen on a secure port 2) have authn enabled With this, any k8s user with the right permissions (and a bearer token or TLS certificate) can come in and access metrics over a node's public IP address. Access without a certificate/token gets thrown into the system:anonymous user, which as no access to any API. Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b	2020-10-10 16:00:15 +00:00
q3k	36224c617a	clustercfg: show diff before switching to new configuration This is mildly hacky, but lets us be more informed before we switch to a new configuration. Change-Id: I008f3f698db702f1e0992bd41a8d1050449d59b5	2020-10-10 16:00:11 +00:00
q3k	2e001e5046	k0: bump to 1.15.4 This notably fixes the annoying loopback issues that prevented hosts from accessing externalip services with externalTrafficPolicy: local from nodes that weren't running the service. Which means, hopefuly, no more registry pull failures when nginx-ingress gets misplaced! Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec	2020-10-03 16:32:38 +00:00
q3k	fbe234bdb2	cluster: rename module-* into modules/* Change-Id: I65e06f3e9cec2ba0071259eb755eddbbd1025b97	2020-10-03 14:57:30 +00:00
q3k	316411790a	cluster/nix: update nodes - we update NixOS to 20.09pre - we fix an ACME option that's now required - we switch from systemd-timesyncd to chrony (as timesyncd took a long time to sync clocks after restart, leading to MON_CLOCK_SKEW errors from ceph) This has been deployed in production. Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad	2020-08-23 00:58:29 +02:00
q3k	d5918c8e72	cluster: change q3k's laptop key Paranoia is dead, long live Mimeomia. This has already been deployed to production. Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0	2020-08-22 22:29:42 +02:00
q3k	d7364520e9	cluster: bump kubelets to 1.14.3 Change-Id: I02ed978a49629cdfc3f3587ad640e8cc5a5fad23	2020-02-02 23:43:28 +01:00
q3k	e2095b2ce9	cluster: remove unused module-cluster.nix Change-Id: I819d803fc7454cfd63a11a109ec73c9578f598b8	2020-02-02 23:43:00 +01:00
q3k	c78cc13528	cluster/nix: locally build nixos derivations We change the existing behaviour (copy files & run nixos-rebuild switch) to something closer to nixops-style. This now means that provisioning admin machines need Nix installed locally, but that's probably an okay choice to make. The upside of this approach is that it's easier to debug and test derivations, as all data is local to the repo and the workstation, and deploying just means copying a configuration closure and switching the system to it. At some point we should even be able to run the entire cluster within a set of test VMs. We also bump the kubernetes control plane to 1.14. Kubelets are still at 1.13 and their upgrade is comint up today too. Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd	2020-02-02 22:31:53 +01:00
q3k	050af01b83	cluster: add q3k's new SSH key Change-Id: I872a75cc89a62c9487433fa5e8e5767953e309c9	2019-12-17 01:58:58 +01:00
q3k	d493ab66ca	*: add dcr01s{22,24} Change-Id: I072e825e2e1d199d9da50b9d38a9ffba68e61182	2019-10-31 17:07:50 +01:00
q3k	42553cd044	cluster: disable unauthenticated read only port on kubelets This port was leaking kubelet state, including information on running pods. No secrets were leaked (if they were not text-pasted into env/args), but this still shouldn't be available. As far as I can tell, nothing depends on this port, other than some enterprise load balancers that require HTTP for node 'health' checks. Change-Id: I9549b73e0168fe3ea4dce43cbe8fdc2ca4575961	2019-09-02 16:33:02 +02:00
q3k	b13b7ffcdb	prod{access,vider}: implement Prodaccess/Prodvider allow issuing short-lived certificates for all SSO users to access the kubernetes cluster. Currently, all users get a personal-$username namespace in which they have adminitrative rights. Otherwise, they get no access. In addition, we define a static CRB to allow some admins access to everything. In the future, this will be more granular. We also update relevant documentation. Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153	2019-08-30 23:08:18 +02:00
q3k	d07861b7df	ceph-waw1 -> ceph-waw2 Change-Id: I03d6244b9697a9efc06492114ef90cdb01e17601	2019-08-08 17:49:31 +02:00
q3k	116da981c9	nix/ -> cluster/nix/ These are related to cluster bootstrapping, not generic language libraries (like go/ and bzl/). Change-Id: I03a83c64f3e0fa6cb615d36b4e618f5e92d886ec	2019-07-21 15:53:20 +02:00

25 Commits (5cc64bf60e40e8386b164c2e84cd7a641dc793e6)