hscloud

cheshire

hscloud

Author	SHA1	Message	Date
implr	12f176c1eb	calico 3.14 -> 1.15 Change-Id: I9eceaf26017e483235b97c8d08717d2750fabe25 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/995 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-11-20 22:12:52 +00:00
q3k	0f8e5a2132	*: do not require env.sh This removes the need to source env.{sh,fish} when working with hscloud. This is done by: 1. Implementing a Go library to reliably detect the location of the active hscloud checkout. That in turn is enabled by BUILD_WORKSPACE_DIRECTORY being now a thing in Bazel. 2. Creating a tool `hscloud`, with a command `hscloud workspace` that returns the workspace path. 3. Wrapping this tool to be accessible from Python and Bash. 4. Bumping all users of hscloud_root to use either the Go library or one of the two implemented wrappers. We also drive-by replace tools/install.sh to be a proper sh_binary, and make it yell at people if it isn't being ran as `bazel run //tools:install`. Finally, we also drive-by delete cluster/tools/nixops.sh which was never used. Change-Id: I7873714319bfc38bbb930b05baa605c5aa36470a Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1169 Reviewed-by: informatic <informatic@hackerspace.pl>	2021-10-17 21:21:58 +00:00
q3k	3b67afe81b	cluster/certs: refresh Change-Id: I2aa8fead4427b917afa4758ea0078125d9c4e914 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1153 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-10-07 19:58:35 +00:00
informatic	e839f95079	cluster/kube/k0: add matrix and informatic personal ceph users Change-Id: Ied8d474709b8053e9fc339435d3ca1ca5fdfa710	2021-09-14 22:21:22 +02:00
q3k	4b8ee32246	cluster/kube: always enable flexdriver Documentation says [1] this is disabled by default in 1.1, but that documentation kinda lies [2]. [1] - `235d5a384b/Documentation/flexvolume.md (ceph-flexvolume-configuration)` [2] - `64e28af741 (diff-d1eb5cba50e3770b61ccd3c730cd40514053e1da0233dfe09b5e7967e76a2a6cL424-L425)` Change-Id: Ia92c99e137ed751db62c0f56d42c4901986d0bb8	2021-09-14 21:39:39 +02:00
q3k	38f72fe094	cluster: k0: move ceph-waw3 to proper realm/zonegroup With this we can use Ceph's multi-site support to easily migrate to our new k0 Ceph cluster. This migration was done by using radosgw-admin to rename the existing realm/zonegroup to the new names (hscloud and eu), and then reworking the jsonnet so that the Rook operator would effectively do nothing. It sounds weird that creating a bunch of CRs like Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator, but that's how Rook works - a CephObjectStore generally creates everything that the above CRs would create too, but implicitly. Adding the extra CRs just allows specifying extra settings, like names. (it wasn't fully a no-op, as the rgw daemon is parametrized by realm/zonegroup/zone names, so that had to be restarted) We also make the radosgw serve under object.ceph-eu.hswaw.net, which allows us to right away start using a zonegroup URL instead of the zone-only URL. Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877	2021-09-14 21:39:39 +02:00
q3k	18084c1e86	cluster/nix: k0: enable rgw on osds This enables radosgw wherever osds are. This should be fast and works for us because we have little osd hosts. Change-Id: I4ed014d2790d6c02a2ba8e775aaa1846032dee1e	2021-09-14 21:39:39 +02:00
q3k	085a8ff247	cluster: k0: upgrade to ceph 16.2.5 This was fun. See b/6 for a log of how swimmingly this went. Change-Id: I96c3c18b5d33ef86523b3506f49a390419e9ca7f	2021-09-14 21:39:39 +02:00
q3k	464fb04f39	cluster: k0: bump rook to 1.6 This is needed to get Rook to talk to an external Ceph 16/Pacific cluster. This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our own CRD rewrite and just slurp in upstream CRD defs. Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc	2021-09-14 21:39:37 +02:00
q3k	6579e842b0	kartongips: paper over^W^Wfix CRD updates Ceph CRD updates would fail with: ERROR Error updating customresourcedefinitions cephclusters.ceph.rook.io: expected kind, but got map This wasn't just https://github.com/bitnami/kubecfg/issues/259 . We pull in the 'solution' from Pulumi (https://github.com/pulumi/pulumi-kubernetes/pull/622) which just retries the update via a JSON update instead, and that seems to have worked. We also add some better error return wrapping, which I used to debug this issue properly. Oof. Change-Id: I2007a7857e44128d74760174b61b59efa58e9cbc	2021-09-11 20:54:34 +00:00
q3k	05c4b5515b	cluster/nix: symlink /sbin/lvm This is needed by the new Rook OSD daemons. Change-Id: I16eb24332db40a8209e7eb9747a81fa852e5cad9	2021-09-11 20:45:45 +00:00
q3k	9848e7e15f	cluster: deploy NixOS-based ceph First pass at a non-rook-managed Ceph cluster. We call it k0 instead of ceph-waw4, as we pretty much are sure now that we will always have a one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph pools for different media kinds (if at all). For now this has one mon and spinning rust OSDs. This can be iterated on to make it less terrible with time. See b/6 for more details. Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11	2021-09-11 20:33:24 +00:00
q3k	1dbefed537	Merge "cluster/kube: remove ceph diff against k0 production"	2021-09-11 20:32:57 +00:00
q3k	9f639694ba	Merge "kartongips: switch default diff behaviour to subset, nag users"	2021-09-11 20:18:34 +00:00
q3k	29f314b620	Merge "kartongips: implement proper diffing of aggregated ClusterRoles"	2021-09-11 20:18:28 +00:00
q3k	4f0468fa26	cluster/kube: remove ceph diff against k0 production This now has a zero diff against prod. location fields in CephCluster.storage.nodes seem to have been removed from the CRD at some point. Not sure how the CRUSH tree now gets populated, but whatever, it's been working like this for a while already. Same for CephObjectStore.gateway.type. The Rook Operator has been zero-scaled for a while now due to b/6. Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59	2021-09-11 12:43:53 +00:00
q3k	59c8149df4	kartongips: switch default diff behaviour to subset, nag users Change-Id: I998cdf7e693f6d1ce86c7ea411f47320d72a5906	2021-09-11 12:43:50 +00:00
q3k	72d7574536	kartongips: implement proper diffing of aggregated ClusterRoles For a while now we've had spurious diffs against Ceph on k0 because of a ClusterRole with an aggregationRule. The way these behave is that the config object has an empty rule list, and instead populates an aggregationRule which combines other existing ClusterRoles into that ClusterRole. The control plane then populates the rule field when the object is read/acted on, which caused us to always see a diff between the configuration of that ClusterRole. This hacks together a hardcoded fix for this particular behaviour. Porting kubecfg over to SSA would probably also fix this - but that's too much work for now. Change-Id: I357c1417d4023691e5809f1af23f58f364353388	2021-09-11 12:40:18 +00:00
q3k	b3c6770f8d	ops, cluster: consolidate NixOS provisioning This moves the diff-and-activate logic from cluster/nix/provision.nix into ops/{provision,machines}.nix that can be used for both cluster machines and bgpwtf machines. The provisioning scripts now live per-NixOS-config, and anything under ops.machines.$fqdn now has a .passthru.hscloud.provision derivation which is that script. When ran, it will attempt to deploy onto the target machine. There's also a top-level tool at `ops.provision` which builds all configurations / machines and can be called with the machine name/fqdn to call the corresponding provisioner script. clustercfg is changed to use the new provisioning logic. Change-Id: I258abce9e8e3db42af35af102f32ab7963046353	2021-09-10 23:55:52 +00:00
q3k	432fa30ded	cluster/certs: bump ca-kube-prodivider Redeployed. Change-Id: I01110433f89df5595de0f9587508104d6091a774	2021-08-29 17:20:59 +00:00
q3k	89a16f4de4	cluster/admitomatic: allow use-regex n-i-c annotation This annotation is used to permit routes defined by regexes instead of simple prefix matching. This is used by our synapse deployment for routing incomming HTTP requests to diffferent Synapse components. I've stumbled upon this while deploying a new Matrix/Synapse instance. This hasn't been yet a problem because the existing ingresses for Matrix deployments predate admitomatic. Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0	2021-06-06 12:58:11 +00:00
q3k	7251f2720e	Merge changes Ib068109f,I9a00487f,I1861fe7c,I254983e5,I3e2bedca, ... * changes: cluster/identd/ident: update README cluster/kube: deploy identd cluster/identd: implement cluster/identd/kubenat: implement cluster/identd/cri: import cluster/identd/ident: add TestE2E cluster/identd/ident: add Query function cluster/identd/ident: add IdentError cluster/identd/ident: add basic ident protocol server cluster/identd/ident: add basic ident protocol client	2021-05-28 23:08:10 +00:00
q3k	46c3137d36	cluster/identd/ident: update README Change-Id: Ib068109ff37749207e7b2a18c07f51d3c4ed3fd6	2021-05-26 19:46:13 +00:00
q3k	2414afe3c0	cluster/kube: deploy identd Change-Id: I9a00487fc4a972ecb0904055dbaaab08221062c1	2021-05-26 19:46:09 +00:00
q3k	044386d638	cluster/identd: implement This implements the main identd service that will run on our production hosts. It's comparatively small, as most of the functionality is implemented in //cluster/identd/ident and //cluster/identd/kubenat. Change-Id: I1861fe7c93d105faa19a2bafbe9c85fe36502f73	2021-05-26 19:46:06 +00:00
q3k	6b649f8234	cluster/identd/kubenat: implement This is a library to find pod information for a given TCP 4-tuple. Change-Id: I254983e579e3aaa04c0c5491851f4af94a3f4249	2021-05-26 19:46:02 +00:00
q3k	ae052f0804	cluster/identd/cri: import This imports the CRI protobuf/gRPC specs. These are pulled from: https://raw.githubusercontent.com/kubernetes/cri-api/master/pkg/apis/runtime/v1alpha2/api.proto Our host containerd does not implement v1, so we go with v1alpha2. Change-Id: I3e2bedca76edc85eea9b61a8634c92175f0d2a30	2021-05-26 19:45:58 +00:00
q3k	3638a3d76a	cluster/identd/ident: add TestE2E Change-Id: I8a95fadf19376de2806cb63897b77e370559392f	2021-05-23 16:27:22 +00:00
q3k	8e603e13e5	cluster/identd/ident: add Query function This is a high-level wrapper for querying identd, and uses IdentError to carry errors received from the server. Change-Id: I6444a67117193b97146ffd1548151cdb234d47b5	2021-05-23 16:27:17 +00:00
q3k	1c2bc12ad0	cluster/identd/ident: add IdentError This adds a Go error type that can be used to wrap any ErrorResponse. Change-Id: I57fbd056ac774f4e2ae3bdf85941c1010ada0656	2021-05-23 16:26:59 +00:00
q3k	ce2737f2e7	cluster/identd/ident: add basic ident protocol server This adds an ident protocol server and tests for it. Change-Id: I830f85faa7dce4220bd7001635b20e88b4a8b417	2021-05-23 16:26:54 +00:00
q3k	d4438d67a2	cluster/identd/ident: add basic ident protocol client This is the first pass at an ident protocol client. In the end, we want to implement an ident protocol server for our in-cluster identd, but starting out with a client helps me getting familiar with the protocol, and will allow the server implementation to be tested against the client. Change-Id: Ic37b84577321533bab2f2fbf7fb53409a5defb95	2021-05-23 16:26:50 +00:00
q3k	e17f7edde0	cluster/kube: nginx: add Hscloud-Nic-Source-* headers These can be used by production jobs to get the source port of the client connecting over HTTP. A followup CR implements just that. Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a	2021-05-22 19:16:39 +00:00
q3k	ba2f4d8215	cluster/prodvider: deploy Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53	2021-05-19 22:13:26 +00:00
q3k	02e1598eb3	cluster/prodvider: emit crdb certs This emits short-lived user credentials for a `dev-user` in crdb-waw1 any time someone prodaccesses. Change-Id: I0266a05c1f02225d762cfd2ca61976af0658639d	2021-05-19 22:13:22 +00:00
q3k	bade46d45f	go/pki: fix error return DeveloperCredentialsLocation used to glog.Exitf instead of returning an error, and a consumer (prodaccess) used to not check the return code. Bad refactor? Change-Id: I6c2d05966ba6b3eb300c24a51584ccf5e324cd49	2021-05-19 22:12:08 +00:00
q3k	5ae5cbec81	Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k"	2021-05-19 15:34:45 +00:00
q3k	99b91b11f1	cluster/k0/admitomatic: add .hswaw.net to hswaw-prod namespace This was preventing certificate refresh in the hswaw-prod mirko ingress. Change-Id: I14b18b642a3948a9864e2d9a90b2a2b2c145b9b1	2021-03-28 17:34:34 +00:00
q3k	7967ca177b	cluster/certs: update k0 certs This leaves us with the next set of expiring certs in September 2021. Fixes b/36. Change-Id: I536497626c0dd3807fccf28d4b61e5e531cf8d9c	2021-03-27 12:19:25 +00:00
q3k	41b882d053	cluster: remove bc01n03 certs/secrets Decomissioned node, noticed while rolling over certs in b/36. Change-Id: Ia386ff846998c52799662179c325b24e78f2eca8	2021-03-27 12:18:56 +00:00
q3k	2e8d24b84a	cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k This fixes CVE-2021-3450 and CVE-2021-3449. Deployed on prod: $ kubectl -n nginx-system exec nginx-ingress-controller-5c69c5cb59-2f8v4 -- openssl version OpenSSL 1.1.1k 25 Mar 2021 Change-Id: I7115fd2367cca7b687c555deb2134b22d19a291a	2021-03-25 18:16:13 +00:00
q3k	bf266c6aaf	cluster/k0: add dns crdb user In preparation for running PowerDNS on k0. Change-Id: I853c7465a6a32d02628fa6cfdeb445eb9937b3be	2021-03-17 21:49:00 +00:00
q3k	3b8935378a	cluster/crdb: make init job 'idempotent' This enables its redeployment with a newer crdb image. Change-Id: If039992674f401af53738c80d22cc2ca2818fe00	2021-03-17 21:48:30 +00:00
q3k	64de7afe32	cluster/kube/k0: fix syntax errors This happened in `793ca1b3` and slipped past review. Change-Id: Ie31f0e1ec03d6e4545d6683b21f528550bf4ef9f	2021-03-17 21:47:51 +00:00
q3k	793ca1b3b2	cluster/kube: limit OSDs in ceph-waw3 to 8GB RAM Each OSD is connected to a 6TB drive, and with the good ol' 1TB storage -> 1GB RAM rule of thumb for OSDs, we end up with 6GB. Or, to round up, 8GB. I'm doing this because over the past few weeks OSDs in ceph-waw3 have been using a _ton_ of RAM. This will probably not prevent that (and instead they wil OOM more often :/), but it at will prevent us from wasting resources (k0 started migrating pods to other nodes, and running full nodes like that without an underlying request makes for a terrible draining experience). We need to get to the bottom of why this is happening in the first place, though. Did this happen as we moved to containerd? Followup: b.hswaw.net/29 Already deployed to production. Change-Id: I98df63763c35017eb77595db7b9f2cce71756ed1	2021-03-07 00:09:58 +00:00
q3k	3ba5c1b591	*: docs pass Change-Id: I87ca80d3f7728ed407071468ac233e6ad4574929	2021-03-06 22:21:28 +00:00
q3k	bc0d3cb227	hackdoc: link to cs instead of gitweb Change-Id: Ifca7a63517bceffe7ccc0452474d9d16626486de	2021-03-06 22:16:54 +00:00
q3k	0d26fc9780	cluster: disable nginx/acme These are unused. Change-Id: I2a428dabd0a27c060c595f5e0843d7d8d8e26dcd	2021-02-15 22:14:41 +01:00
q3k	765e369255	cluster: replace docker with containerd This removes Docker and docker-shim from our production kubernetes, and moves over to containerd/CRI. Docker support within Kubernetes was always slightly shitty, and with 1.20 the integration was dropped entirely. CRI/Containerd/runc is pretty much the new standard. Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe	2021-02-15 22:14:15 +01:00
q3k	4b613303b1	RFC: *: move away from rules_nixpkgs This is an attempt to see how well we do without rules_nixpkgs. rules_nixpkgs has the following problems: - complicates our build system significantly (generated external repository indirection for picking local/nix python and go) - creates builds that cannot run on production (as they are tainted by /nix/store libraries) - is not a full solution to the bazel hermeticity problem anyway, and we'll have to tackle that some other way (eg. by introducing proper C++ cross-compilation toolchains and building everything from C, including Python and Go) Instead of rules_nixpkgs, we ship a shell.nix file, so NixOS users can just: jane@hacker:~/hscloud $ nix-shell hscloud-build-chrootenv:jane@hacker:~/hscloud$ prodaccess This shell.nix is in a way nicer, as it immediately gives you all tools needed to access production straight away. Change-Id: Ieceb5ae0fb4d32e87301e5c99416379cedc900c5	2021-02-15 22:11:35 +01:00
q3k	4842705406	cluster/nix: integrate with readtree This unifies nixpkgs with the one defined in //default.nix and makes it possible to use readTree to build the provisioners: nix-build -A cluster.nix.provision result/bin/provision Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862	2021-02-14 14:46:07 +00:00
q3k	225a5c7ee9	nixpkgs: bump Fixes b/3. Change-Id: I2f734422cdad00f78956477815c4aea645c6c49e	2021-02-14 14:43:07 +00:00
q3k	78d6f11cb2	Merge "cluster/admitomatic: allow whitelist-source-range"	2021-02-08 17:21:59 +00:00
q3k	877cf0af26	🅱️ Fixes b/8 Change-Id: I5a5779c3688451d89c0601dc913143d75048c9f6	2021-02-08 15:10:11 +00:00
q3k	943ab5b1a6	cluster/admitomatic: allow whitelist-source-range Without this, cert-manager get stuck. Deployed to prod. Change-Id: I356cd44f455b6f4aecea9ae396f6a05e1a727859	2021-02-07 23:35:28 +00:00
q3k	f40c9249ce	cluster/kube: allow system:admin-namespaces to modify ingresses This will permit any binding to system:admin-namespaces (eg. personal-* namespaces, per-namespace extra admin access like matrix-0x3c) the ability to create and updates ingresses. Change-Id: I522896ebe290fe982d6fe46b7b1d604d22b4f72c	2021-02-07 19:24:43 +00:00
q3k	41bbf1436a	cluster/kube: deploy admitomatic webhook This has been (succesfully) tested on prod and then rolled back. Change-Id: I22657f66b4aeaa8a0ae452035ba18a79f4549b14	2021-02-07 19:19:23 +00:00
q3k	3c5d836c56	cluster/kube: deploy admitomatic This doesn't yet enable a webhook, but deploys admitomatic itself. Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e	2021-02-07 19:19:02 +00:00
q3k	3ab5f07c64	cluster/admitomatic: build docker image Change-Id: I086a8b17a4dc7257de1bae3a6f0c95400af7e115	2021-02-07 19:18:53 +00:00
q3k	c80321d17e	Merge "cluster: add admitomatic CA/certificate"	2021-02-06 23:18:59 +00:00
q3k	04604b2aae	cluster: add admitomatic CA/certificate Change-Id: Idb32dc38b897aa266b6d2d6fd57a5e38b47db7fc	2021-02-06 17:18:58 +00:00
informatic	f4a6a56662	cluster/kube/k0: add issues.hackerspace.pl crdb user Change-Id: If78f795e0e35360b65c666e6b217037fc34a2ccf	2021-02-01 21:32:25 +01:00
informatic	3b8a43f35d	cluster/kube/k0: add issues.hackerspace.pl ceph s3 user Change-Id: If5eef3404bdc08ded88e46f45bad0f9abcdb0f1c	2021-02-01 21:19:59 +01:00
q3k	c6118649ab	cluster/admitomatic: finish up service This turns admitomatic into a self-standing service that can be used as an admission controller. I've tested this E2E on a local k3s server, and have some early test code for that - but that'll land up in a follow up CR, as it first needs to be cleaned up. Change-Id: I46da0fc49f9d1a3a1a96700a36deb82e5057249b	2021-01-31 12:18:16 +01:00
q3k	5d2c8fcda0	cluster/admitomatic: finish up ingress admission logic This gives us nearly everything required to run the admission controller. In addition to checking for allowed domains, we also do some nginx-inress-controller security checks. Change-Id: Ib187de6d2c06c58bd8c320503d4f850df2ec8abd	2021-01-31 12:18:16 +01:00
q3k	649565324b	cluster/admitomatic: implement basic dns/ns filtering This is the beginning of a validating admission controller which we will use to permit end-users access to manage Ingresses. This first pass implements an ingressFilter, which is the main structure through which allowed namespace/dns combinations will be allowed. The interface is currently via a test, but in the future this will likely be configured via a command line, or via a serialized protobuf config. Change-Id: I22dbed633ea8d8e1fa02c2a1598f37f02ea1b309	2021-01-30 19:19:35 +01:00
patryk	edf14cc5f4	crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4 This change reflects the current production state. Upgrade was done by going through following versions: 19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4 Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c	2021-01-23 23:00:29 +01:00
patryk	f3153888a8	cluster/kube: Add k0-cockroach.jsonnet, add Gitea client cert Change-Id: Ibc5db1b0114b2540b6dc806e75e9a36cf9a3bc50	2021-01-23 15:38:50 +01:00
q3k	61f978a0a0	: tear down ceph-waw2 It reached the stage of being crapped out so much that the OSDs spurious IOPS killed the performance of disks colocated on the same M610 RAID controllers. This made etcd _very_ slow, to the point of churning through re-elections due to timeouts. etcd/apiserver latencies, observe the difference at ~15:38: https://object.ceph-waw3.hswaw.net/q3k-personal/4fbe8d4cfc8193cad307d487371b4e44358b931a7494aa88aff50b13fae9983c.png I moved gerrit/ and matrix/appservice-irc-freenode PVCs to ceph-waw3 by hand. The rest were non-critical so I removed them, they can be recovered from benji backups if needed. Change-Id: Iffbe87aefc06d8324a82b958a579143b7dd9914c	2021-01-22 16:26:09 +01:00
q3k	3b9ee5f1c0	ceph: bump to 14.2.16 More as-builts. This has already been bumped. Had to coax ceph-waw2 to upgrade despite the fact that it's horribly broken. Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3	2021-01-19 21:45:26 +00:00
q3k	2c04c8410a	rook: bump to 1.2.7 As-built: deployed to ceph-waw{2,3} already. Change-Id: I27189b273cf72638cf2036681054832db99591da	2021-01-19 21:41:13 +01:00
q3k	f684535c6e	k0: remove bc01n03 from nix defs This only affects ETCD_INITIAL_* env vars, so is is effectively a no-op. Deployed to prod. Change-Id: Ic9118e17b088d1b58ebaf1ac0708a1ee6fcf2c06	2021-01-19 20:20:33 +01:00
q3k	cf842b0442	k0: reflect reality This is after the monster^Wrook outage of the week two weeks ago caused by bc01n03 dying. Plan is to migrate ceph-waw3 to be external, yeet ceph-waw2, and extend crdb-waw1 to another node. Change-Id: I133af3b1171fea383b45bf06c51e48a5c40341e4	2021-01-19 20:08:26 +01:00
q3k	9708ba02ec	Merge "cluster: use static addresses"	2020-12-15 18:53:54 +00:00
q3k	acdd665b08	cluster: use static addresses This disables DHCP on all k0 nodes. This change has been tentatively deployed to bc01n01 (which is cordoned off in kube), and I will deploy it to the rest of k0 machines once merged. Change-Id: I96253a9d0acedb4512c877c64174992ffdb43d58	2020-12-14 19:10:52 +01:00
patryk	cae7cf776f	k0: add missing curly brace termination in woju's S3 user name Change-Id: Ib2752d798f6e23493daee446a834e244f858330e	2020-11-28 14:36:48 +01:00
patryk	34668a5b7b	k0: add cz3's personal s3 user Change-Id: I51ee80eb05c34cfd8b03e15fcaefb5f235587c50	2020-11-28 13:45:25 +01:00
q3k	f18a531f9b	prodvider: bump to Go 1.15.5 Change-Id: I0f7999deb571aef12533f0ceee21c0283bc0bdc4	2020-11-27 09:50:09 +00:00
q3k	0754ed86a2	prodvider: fix build after k8s update, add to CI presubmit Change-Id: I5a3794541853abd1fb16e67e285edfa29c2f5cf7	2020-11-27 09:43:47 +00:00
q3k	e00fe3a448	cluster/tools/kartongips: skip tests broken by fork These tests are broken as they depend on some test data that we currently don't have in hscloud. They should be fixed ASAP. Change-Id: I2571c2958cb84e145a7e3a44171685ecf43cf499	2020-11-12 00:45:15 +01:00
q3k	640336144d	cluster/tools: integrate kartongips as main kubecfg tool Change-Id: If6a6c8e9c9163f0fc25adcaa8680857fdca69cd3	2020-11-12 00:40:08 +01:00
q3k	be538db63b	cluster/tools/kartongips: init This forks bitnami/kubecfg into kartongips. The rationale is that we want to implement hscloud-specific functionality that wouldn't really be upstreamable into kubecfg (like secret support, mulit-cluster support). We forked off from github.com/q3k/kubecfg at commit b6817a94492c561ed61a44eeea2d92dcf2e6b8c0. Change-Id: If5ba513905e0a86f971576fe7061a471c1d8b398	2020-11-12 00:39:34 +01:00
q3k	bfe9bb0e3a	k0: add woju's personal s3 user Change-Id: I8ed5bb5428594b74460f1b89185d684cb6c26268	2020-10-27 20:50:50 +01:00
q3k	e77f7717d4	k0: bump to 1.16.5 Change-Id: I548808ce4e0deb0513a1e00963f383d84b9d920c	2020-10-10 22:39:50 +02:00
q3k	1257389d3d	k0: expose controller-manager and scheduler metrics We want to be able to scrape controller-manager and scheduler metrics into Prometheus. For that, each of them needs to: 1) listen on a secure port 2) have authn enabled With this, any k8s user with the right permissions (and a bearer token or TLS certificate) can come in and access metrics over a node's public IP address. Access without a certificate/token gets thrown into the system:anonymous user, which as no access to any API. Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b	2020-10-10 16:00:15 +00:00
q3k	36224c617a	clustercfg: show diff before switching to new configuration This is mildly hacky, but lets us be more informed before we switch to a new configuration. Change-Id: I008f3f698db702f1e0992bd41a8d1050449d59b5	2020-10-10 16:00:11 +00:00
q3k	2e001e5046	k0: bump to 1.15.4 This notably fixes the annoying loopback issues that prevented hosts from accessing externalip services with externalTrafficPolicy: local from nodes that weren't running the service. Which means, hopefuly, no more registry pull failures when nginx-ingress gets misplaced! Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec	2020-10-03 16:32:38 +00:00
q3k	2a223705fd	cluster: bump certs This has been deployed to k0 nodes. Current state of cluster certificates: cluster/certs/ca-etcd.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-etcdpeer.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-kube.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-kubefront.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-kube-prodvider.cert Not After : Sep 1 21:30:00 2021 GMT cluster/certs/etcd-bc01n01.hswaw.net.cert Not After : Mar 28 15:53:00 2021 GMT cluster/certs/etcd-bc01n02.hswaw.net.cert Not After : Mar 28 16:45:00 2021 GMT cluster/certs/etcd-bc01n03.hswaw.net.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcd-calico.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcd-dcr01s22.hswaw.net.cert Not After : Oct 3 15:33:00 2021 GMT cluster/certs/etcd-dcr01s24.hswaw.net.cert Not After : Oct 3 15:38:00 2021 GMT cluster/certs/etcd-kube.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcdpeer-bc01n01.hswaw.net.cert Not After : Mar 28 15:53:00 2021 GMT cluster/certs/etcdpeer-bc01n02.hswaw.net.cert Not After : Mar 28 16:45:00 2021 GMT cluster/certs/etcdpeer-bc01n03.hswaw.net.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert Not After : Oct 3 15:33:00 2021 GMT cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert Not After : Oct 3 15:38:00 2021 GMT cluster/certs/etcd-root.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-apiserver.cert Not After : Oct 3 15:26:00 2021 GMT cluster/certs/kube-controllermanager.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kubefront-apiserver.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert Not After : Mar 28 15:53:00 2021 GMT cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert Not After : Mar 28 16:45:00 2021 GMT cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert Not After : Oct 3 15:33:00 2021 GMT cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert Not After : Oct 3 15:38:00 2021 GMT cluster/certs/kube-proxy.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-scheduler.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-serviceaccounts.cert Not After : Mar 28 15:15:00 2021 GMT Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314	2020-10-03 16:32:32 +00:00
q3k	fbe234bdb2	cluster: rename module-* into modules/* Change-Id: I65e06f3e9cec2ba0071259eb755eddbbd1025b97	2020-10-03 14:57:30 +00:00
q3k	c7de7e562f	cluster: do not export metallb routes to mesh peers This prevents metallb routes being announced from all peers to our ToR, thereby preventing issues with traffic hitting services with externalTrafficPolicy: local. There still is the from-host loopback issue, but that will be fixed by upgrading to kube 1.15. Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04	2020-10-03 14:56:52 +00:00
q3k	f0acf16564	prodvider: use SANs in service certificates This fixes compatibility with prodaccess tools built with Go 1.15, which introduced 'X.509 CommonName deprecation' [1]. [1] - https://golang.org/doc/go1.15#commonname Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3	2020-10-03 14:56:35 +00:00
q3k	44628f2b9e	Merge "k0.hswaw.net: pass metallb through Calico"	2020-10-02 22:54:57 +00:00
q3k	e7fca3acd8	ci_presubmit: init This will be, at some point, a script to run on Gerrit presubmit (ie. right before merge). For now, you can manually run it to ensure that Everything At Least Kinda Works. Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8	2020-09-25 21:15:07 +00:00
q3k	a5ed644980	k0.hswaw.net: pass metallb through Calico Previously, we had the following setup: .-----------. \| ..... \| .-----------.-\| \| dcr01s24 \| \| .-----------.-\| \| \| dcr01s22 \| \| \| .---\|-----------\| \|-' .--------. \| \|---------. \| \| \| dcsw01 \| <----- \| metallb \| \|-' '--------' \|---------' \| '-----------' Ie., each metallb on each node directly talked to dcsw01 over BGP to announce ExternalIPs to our L3 fabric. Now, we rejigger the configuration to instead have Calico's BIRD instances talk BGP to dcsw01, and have metallb talk locally to Calico. .-------------------------. \| dcr01s24 \| \|-------------------------\| .--------. \|---------. .---------. \| \| dcsw01 \| <----- \| Calico \|<--\| metallb \| \| '--------' \|---------' '---------' \| '-------------------------' This makes Calico announce our pod/service networks into our L3 fabric! Calico and metallb talk to eachother over 127.0.0.1 (they both run with Host Networking), but that requires one side to flip to pasive mode. We chose to do that with Calico, by overriding its BIRD config and special-casing any 127.0.0.1 peer to enable passive mode. We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle with the kernel programming filter (ie. to-kernel-routing-table filter), where we disable programming unreachable routes. This is because routes coming from metallb have their next-hop set to 127.0.0.1, which makes bird mark them as unreachable. Unreachable routes in the kernel will break local access to ExternalIPs, eg. register access from containerd. All routes pass through without route reflectors and a full mesh as we use eBGP over private ASNs in our fabric. We also have to make Calico aware of metallb pools - otherwise, routes announced by metallb end up being filtered by Calico. This is all mildly hacky. Here's hoping that Calico will be able to some day gain metallb-like functionality, ie. IPAM for externalIPs/LoadBalancers/... There seems to be however one problem with this change (but I'm not fixing it yet as it's not critical): metallb would previously only announce IPs from nodes that were serving that service. Now, however, the Calico internal mesh makes those appear from every node. This can probably be fixed by disabling local meshing, enabling route reflection on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by some more hacking of the Calico BIRD config :/. Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836	2020-09-23 18:55:12 +00:00
q3k	059fdfed3b	k0: add resource requests/limits to nginx, remove gitea We just had an outage seemingly caused by N-I-C sendings tons of traffic to gitea, which in turn caused N-I-C to balloon in memory/CPU usage. I haven't debugged the cause of this traffic, but I have disabled the gitea TCP forward to Stop The Bleeding. This change reflects ad-hoc production changes. Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9	2020-09-20 22:53:40 +00:00
q3k	242ec58a33	k0: add waw-hdd-redundant-q3k-3 Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82	2020-09-20 15:36:24 +00:00
patryk	8d069d8d1a	cluster/certs: refresh prodvider CA Change-Id: I35578fb62ddf10e7419c2c347e70322cf4ea0b6a	2020-09-01 22:02:52 +00:00
q3k	316411790a	cluster/nix: update nodes - we update NixOS to 20.09pre - we fix an ACME option that's now required - we switch from systemd-timesyncd to chrony (as timesyncd took a long time to sync clocks after restart, leading to MON_CLOCK_SKEW errors from ceph) This has been deployed in production. Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad	2020-08-23 00:58:29 +02:00
q3k	bc73a44519	cluster/clustercfg: fix BUILD This is continued fallout after migrating from rules_pip. Change-Id: Idb9b4d4f22aa36512d220ac31375bae7a0f25e4e	2020-08-22 20:33:37 +00:00
q3k	d5918c8e72	cluster: change q3k's laptop key Paranoia is dead, long live Mimeomia. This has already been deployed to production. Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0	2020-08-22 22:29:42 +02:00

1 2 3 4 5 ...

284 Commits (7f5f2099c5d3e9762345e27bad1c2d69ca6220ff)