hscloud

cheshire

hscloud

Author	SHA1	Message	Date
q3k	e17f7edde0	cluster/kube: nginx: add Hscloud-Nic-Source-* headers These can be used by production jobs to get the source port of the client connecting over HTTP. A followup CR implements just that. Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a	2021-05-22 19:16:39 +00:00
q3k	ba2f4d8215	cluster/prodvider: deploy Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53	2021-05-19 22:13:26 +00:00
q3k	5ae5cbec81	Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k"	2021-05-19 15:34:45 +00:00
q3k	99b91b11f1	cluster/k0/admitomatic: add .hswaw.net to hswaw-prod namespace This was preventing certificate refresh in the hswaw-prod mirko ingress. Change-Id: I14b18b642a3948a9864e2d9a90b2a2b2c145b9b1	2021-03-28 17:34:34 +00:00
q3k	2e8d24b84a	cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k This fixes CVE-2021-3450 and CVE-2021-3449. Deployed on prod: $ kubectl -n nginx-system exec nginx-ingress-controller-5c69c5cb59-2f8v4 -- openssl version OpenSSL 1.1.1k 25 Mar 2021 Change-Id: I7115fd2367cca7b687c555deb2134b22d19a291a	2021-03-25 18:16:13 +00:00
q3k	bf266c6aaf	cluster/k0: add dns crdb user In preparation for running PowerDNS on k0. Change-Id: I853c7465a6a32d02628fa6cfdeb445eb9937b3be	2021-03-17 21:49:00 +00:00
q3k	3b8935378a	cluster/crdb: make init job 'idempotent' This enables its redeployment with a newer crdb image. Change-Id: If039992674f401af53738c80d22cc2ca2818fe00	2021-03-17 21:48:30 +00:00
q3k	64de7afe32	cluster/kube/k0: fix syntax errors This happened in `793ca1b3` and slipped past review. Change-Id: Ie31f0e1ec03d6e4545d6683b21f528550bf4ef9f	2021-03-17 21:47:51 +00:00
q3k	793ca1b3b2	cluster/kube: limit OSDs in ceph-waw3 to 8GB RAM Each OSD is connected to a 6TB drive, and with the good ol' 1TB storage -> 1GB RAM rule of thumb for OSDs, we end up with 6GB. Or, to round up, 8GB. I'm doing this because over the past few weeks OSDs in ceph-waw3 have been using a _ton_ of RAM. This will probably not prevent that (and instead they wil OOM more often :/), but it at will prevent us from wasting resources (k0 started migrating pods to other nodes, and running full nodes like that without an underlying request makes for a terrible draining experience). We need to get to the bottom of why this is happening in the first place, though. Did this happen as we moved to containerd? Followup: b.hswaw.net/29 Already deployed to production. Change-Id: I98df63763c35017eb77595db7b9f2cce71756ed1	2021-03-07 00:09:58 +00:00
q3k	78d6f11cb2	Merge "cluster/admitomatic: allow whitelist-source-range"	2021-02-08 17:21:59 +00:00
q3k	877cf0af26	🅱️ Fixes b/8 Change-Id: I5a5779c3688451d89c0601dc913143d75048c9f6	2021-02-08 15:10:11 +00:00
q3k	943ab5b1a6	cluster/admitomatic: allow whitelist-source-range Without this, cert-manager get stuck. Deployed to prod. Change-Id: I356cd44f455b6f4aecea9ae396f6a05e1a727859	2021-02-07 23:35:28 +00:00
q3k	f40c9249ce	cluster/kube: allow system:admin-namespaces to modify ingresses This will permit any binding to system:admin-namespaces (eg. personal-* namespaces, per-namespace extra admin access like matrix-0x3c) the ability to create and updates ingresses. Change-Id: I522896ebe290fe982d6fe46b7b1d604d22b4f72c	2021-02-07 19:24:43 +00:00
q3k	41bbf1436a	cluster/kube: deploy admitomatic webhook This has been (succesfully) tested on prod and then rolled back. Change-Id: I22657f66b4aeaa8a0ae452035ba18a79f4549b14	2021-02-07 19:19:23 +00:00
q3k	3c5d836c56	cluster/kube: deploy admitomatic This doesn't yet enable a webhook, but deploys admitomatic itself. Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e	2021-02-07 19:19:02 +00:00
informatic	f4a6a56662	cluster/kube/k0: add issues.hackerspace.pl crdb user Change-Id: If78f795e0e35360b65c666e6b217037fc34a2ccf	2021-02-01 21:32:25 +01:00
informatic	3b8a43f35d	cluster/kube/k0: add issues.hackerspace.pl ceph s3 user Change-Id: If5eef3404bdc08ded88e46f45bad0f9abcdb0f1c	2021-02-01 21:19:59 +01:00
patryk	edf14cc5f4	crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4 This change reflects the current production state. Upgrade was done by going through following versions: 19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4 Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c	2021-01-23 23:00:29 +01:00
patryk	f3153888a8	cluster/kube: Add k0-cockroach.jsonnet, add Gitea client cert Change-Id: Ibc5db1b0114b2540b6dc806e75e9a36cf9a3bc50	2021-01-23 15:38:50 +01:00
q3k	61f978a0a0	: tear down ceph-waw2 It reached the stage of being crapped out so much that the OSDs spurious IOPS killed the performance of disks colocated on the same M610 RAID controllers. This made etcd _very_ slow, to the point of churning through re-elections due to timeouts. etcd/apiserver latencies, observe the difference at ~15:38: https://object.ceph-waw3.hswaw.net/q3k-personal/4fbe8d4cfc8193cad307d487371b4e44358b931a7494aa88aff50b13fae9983c.png I moved gerrit/ and matrix/appservice-irc-freenode PVCs to ceph-waw3 by hand. The rest were non-critical so I removed them, they can be recovered from benji backups if needed. Change-Id: Iffbe87aefc06d8324a82b958a579143b7dd9914c	2021-01-22 16:26:09 +01:00
q3k	3b9ee5f1c0	ceph: bump to 14.2.16 More as-builts. This has already been bumped. Had to coax ceph-waw2 to upgrade despite the fact that it's horribly broken. Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3	2021-01-19 21:45:26 +00:00
q3k	2c04c8410a	rook: bump to 1.2.7 As-built: deployed to ceph-waw{2,3} already. Change-Id: I27189b273cf72638cf2036681054832db99591da	2021-01-19 21:41:13 +01:00
q3k	cf842b0442	k0: reflect reality This is after the monster^Wrook outage of the week two weeks ago caused by bc01n03 dying. Plan is to migrate ceph-waw3 to be external, yeet ceph-waw2, and extend crdb-waw1 to another node. Change-Id: I133af3b1171fea383b45bf06c51e48a5c40341e4	2021-01-19 20:08:26 +01:00
patryk	cae7cf776f	k0: add missing curly brace termination in woju's S3 user name Change-Id: Ib2752d798f6e23493daee446a834e244f858330e	2020-11-28 14:36:48 +01:00
patryk	34668a5b7b	k0: add cz3's personal s3 user Change-Id: I51ee80eb05c34cfd8b03e15fcaefb5f235587c50	2020-11-28 13:45:25 +01:00
q3k	f18a531f9b	prodvider: bump to Go 1.15.5 Change-Id: I0f7999deb571aef12533f0ceee21c0283bc0bdc4	2020-11-27 09:50:09 +00:00
q3k	bfe9bb0e3a	k0: add woju's personal s3 user Change-Id: I8ed5bb5428594b74460f1b89185d684cb6c26268	2020-10-27 20:50:50 +01:00
q3k	c7de7e562f	cluster: do not export metallb routes to mesh peers This prevents metallb routes being announced from all peers to our ToR, thereby preventing issues with traffic hitting services with externalTrafficPolicy: local. There still is the from-host loopback issue, but that will be fixed by upgrading to kube 1.15. Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04	2020-10-03 14:56:52 +00:00
q3k	f0acf16564	prodvider: use SANs in service certificates This fixes compatibility with prodaccess tools built with Go 1.15, which introduced 'X.509 CommonName deprecation' [1]. [1] - https://golang.org/doc/go1.15#commonname Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3	2020-10-03 14:56:35 +00:00
q3k	a5ed644980	k0.hswaw.net: pass metallb through Calico Previously, we had the following setup: .-----------. \| ..... \| .-----------.-\| \| dcr01s24 \| \| .-----------.-\| \| \| dcr01s22 \| \| \| .---\|-----------\| \|-' .--------. \| \|---------. \| \| \| dcsw01 \| <----- \| metallb \| \|-' '--------' \|---------' \| '-----------' Ie., each metallb on each node directly talked to dcsw01 over BGP to announce ExternalIPs to our L3 fabric. Now, we rejigger the configuration to instead have Calico's BIRD instances talk BGP to dcsw01, and have metallb talk locally to Calico. .-------------------------. \| dcr01s24 \| \|-------------------------\| .--------. \|---------. .---------. \| \| dcsw01 \| <----- \| Calico \|<--\| metallb \| \| '--------' \|---------' '---------' \| '-------------------------' This makes Calico announce our pod/service networks into our L3 fabric! Calico and metallb talk to eachother over 127.0.0.1 (they both run with Host Networking), but that requires one side to flip to pasive mode. We chose to do that with Calico, by overriding its BIRD config and special-casing any 127.0.0.1 peer to enable passive mode. We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle with the kernel programming filter (ie. to-kernel-routing-table filter), where we disable programming unreachable routes. This is because routes coming from metallb have their next-hop set to 127.0.0.1, which makes bird mark them as unreachable. Unreachable routes in the kernel will break local access to ExternalIPs, eg. register access from containerd. All routes pass through without route reflectors and a full mesh as we use eBGP over private ASNs in our fabric. We also have to make Calico aware of metallb pools - otherwise, routes announced by metallb end up being filtered by Calico. This is all mildly hacky. Here's hoping that Calico will be able to some day gain metallb-like functionality, ie. IPAM for externalIPs/LoadBalancers/... There seems to be however one problem with this change (but I'm not fixing it yet as it's not critical): metallb would previously only announce IPs from nodes that were serving that service. Now, however, the Calico internal mesh makes those appear from every node. This can probably be fixed by disabling local meshing, enabling route reflection on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by some more hacking of the Calico BIRD config :/. Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836	2020-09-23 18:55:12 +00:00
q3k	059fdfed3b	k0: add resource requests/limits to nginx, remove gitea We just had an outage seemingly caused by N-I-C sendings tons of traffic to gitea, which in turn caused N-I-C to balloon in memory/CPU usage. I haven't debugged the cause of this traffic, but I have disabled the gitea TCP forward to Stop The Bleeding. This change reflects ad-hoc production changes. Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9	2020-09-20 22:53:40 +00:00
q3k	242ec58a33	k0: add waw-hdd-redundant-q3k-3 Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82	2020-09-20 15:36:24 +00:00
q3k	0581bbf8a0	games/factorio: add modproxy This adds a mod proxy system, called, well, modproxy. It sits between Factorio server instances and the Factorio mod portal, allowing for arbitrary mod download without needing the servers to know Factorio credentials. Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212	2020-08-14 13:03:46 +02:00
q3k	3d29484ebb	k0: move registry to ceph-waw3 ceph-waw2 has currently some production issues [1] which have started to cause write failures in the registry. The registry is the only user of ceph-waw2's affected pool, so we reduce the dumpster fire blast radious by moving it over to ceph-waw3. This has already been deployed and data has been migrated over (via s3cmd sync), and the migration has been verified (by a push and pull, and pull of an older image). [1] - pgs stuck inactive in the object storage pool Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347	2020-08-04 01:36:51 +02:00
q3k	4ded56ab8a	prodvider: emit client/server cert Change-Id: I024782a7dfa6e16ff5f562a62ddd8fe3bf299c51	2020-08-01 22:01:05 +02:00
q3k	f3312ef77e	*: developer machine HSPKI credentials In addition to k8s certificates, prodaccess now issues HSPKI certificates, with DN=$username.sso.hswaw.net. These are installed into XDG_CONFIG_HOME (or os equiv). //go/pki will now automatically attempt to load these certificates. This means you can now run any pki-dependant tool with -hspki_disable, and with automatic mTLS! Change-Id: I5b28e193e7c968d621bab0d42aabd6f0510fed6d	2020-08-01 17:15:52 +02:00
q3k	509ab6e29a	k0/cockroach: add public DNS entry for cockroach Change-Id: I934bf348e2165148b515b709e853ab67f039a402	2020-07-30 22:56:30 +02:00
informatic	97a6ca8a8b	Merge "cluster/kube/lib/nginx: add gitea-prod ingress service"	2020-07-02 17:15:53 +00:00
informatic	0697e01144	cluster/kube/lib/registry: allow auth'd users to pull all images "Anyone can pull all images" rule did only match on anonymous users. Now it should match all users, including authenticated ones. Change-Id: I2205299093feca51f30526ba305eadbaa0a68ecb	2020-07-02 18:45:42 +02:00
informatic	f00edf6ee8	cluster/kube/lib/nginx: add gitea-prod ingress service We would like gitea to have its ssh server exposed on TCP port 22 on the same address as its web interface. We would also still like to use all the automation around ingresses already in place (like cert-manager integration). To solve this, we create an additional LoadBalancer service for nginx-ingress-controller and set up special tcp-services forwarding rule to pass port 22 traffic to gitea-prod/gitea service, like we already do in case of gerrit. Change-Id: I5bfc901ebe858464f8e9c2f3b2216b254ccd6c4d	2020-07-02 18:30:38 +02:00
q3k	b1aadd88ff	k0: add q3k's personal s3 user Change-Id: I5681774e1dca2cf4a865d9e1a24602ed4334f006	2020-06-24 17:19:36 +00:00
implr	d9df5879e3	add radosgw bucket for spark Change-Id: Id8ea8901ce038ccbf11afabe0e6272c358b32cf2	2020-06-13 21:31:56 +02:00
q3k	9b2ce179a8	Merge "cluster/kube: split up cluster.jsonnet"	2020-06-13 17:52:27 +00:00
q3k	dbfa988c73	cluster/kube: split up cluster.jsonnet It was getting large and unwieldy (to the point where kubecfg was slow). In this change, we: - move the Cluster function to cluster.libsonnet - move the Cluster instantiation into k0.libsonnet - shuffle some fields around to make sure things are well split between k0-specific and general cluster configs. - add 'view' files that build on 'cluster.libsonnet' to allow rendering either the entire k0 state, or some subsets (for speed) - update the documentation, drive-by some small fixes and reindantation Change-Id: I4b8d920b600df79100295267efe21b8c82699d5b	2020-06-13 19:51:58 +02:00
q3k	ce81c39081	ops/metrics: basic cluster setup with prometheus We handwavingly plan on implementing monitoring as a two-tier system: - a 'global' component that is reponsible for global aggregation, long-term storage and alerting. - multiple 'per-cluster' components, that collect metrics from Kubernetes clusters and export them to the global component. In addition, several lower tiers (collected by per-cluster components) might also be implemented in the future - for instance, specific to some subprojects. Here we start sketching out some basic jsonnet structure (currently all in a single file, with little parametrization) and a cluster-level prometheus server that scrapes Kubernetes Node and cAdvisor metrics. This review is mostly to get this commited as early as possible, and to make sure that the little existing Prometheus scrape configuration is sane. Change-Id: If37ac3b1243b8b6f464d65fee6d53080c36f992c	2020-06-06 15:56:10 +02:00
patryk	c410432d94	personal/patryk/arma3: create a S3 bucket account for Arma3 mods Change-Id: Idd31b5f46fcaebfcd72334dc82fbc8df805203b1	2020-06-04 18:51:51 +02:00
informatic	cb96eb6df6	Merge "crdb.k0: add sso client"	2020-05-31 12:26:04 +00:00
q3k	e55493f635	calico: fix access to resources from controller This fixes even more networking issues. Change-Id: I754656a01e3de8a34055280908b343a1a25a4707	2020-05-30 17:57:05 +02:00
q3k	ba375e62b2	calico: fix node name selection This was an attempt to make new calico nodes use a full FQDN. However, this change seemingly also makes the calico control plane use the FQDN for all existing nodes, as such breaking CNI for new pods. We revert this change, thereby keeping all calico nodes names as hostnames. We could fix this by editing /var/lib/calico/nodename on hosts to FQDNs, but it might not be worth the effort. See https://github.com/projectcalico/calico/issues/1093 for more context. Change-Id: I52bfb00f604053d57d3009aebd6c50db7dc74f58	2020-05-30 16:18:13 +02:00
informatic	42da0e9aec	crdb.k0: add sso client Change-Id: I7490a3594694d61a19910e436983937667ed34bd	2020-05-30 14:34:33 +02:00
q3k	d81bf72d7f	calico: upgrade to 3.14, fix calicoctl We still use etcd as the data store (and as such didn't set up k8s CRDs for Calico), but that's okay for now. Change-Id: If6d66f505c6b40f2646ffae7d33d0d641d34a963	2020-05-28 16:47:16 +02:00
q3k	1223cde4d4	cluster: fix nuke's personal storage Change-Id: I422a6d9f7a483e7c44cc8dfd8c0d8a98d9e17e46	2020-05-16 17:38:23 +02:00
q3k	741c08f66c	cluster: add nuke's personal storage He needs some personal backup space, and we have enough best effort spare capacity for that. Change-Id: I75ed6f62e79d33907c0974ec5f2839389ce62543	2020-05-14 18:13:53 +00:00
q3k	a168c50132	SECURITY: cluster: limit api objects modifiable by namespace admins This previous allowed all namespace admins (ie. personal-$user namespace users) to create any sort of obejct they wanted within that namespace. This could've been exploited to allow creation of a RoleBinding that would then allow to bind a serviceaccount to the insecure podsecuritypolicy, thereby allowing escalation to root on nodes. As far as I've checked, this hasn't been exploited, and the access to the k8s cluster has so far also been limited to trusted users. This has been deployed to production. Change-Id: Icf8747d765ccfa9fed843ec9e7b0b957ff27d96e	2020-05-11 20:49:31 +02:00
q3k	d436de2010	cluster/rook: bump to 1.1.9 This bumps Rook/Ceph. The new resources (mostly RBAC) come from following https://rook.io/docs/rook/v1.1/ceph-upgrade.html . It's already deployed on production. The new CSI driver has not been tested, but the old flexvolume-based provisioners still work. We'll migrate when Rook offers a nice solution for this. We've hit a kubecfg bug that does not allow controlling the CephCluster CRD directly anymore (I had to apply it via kubecfg show / kubectl apply -f instead). This might be due to our bazel/prod k8s version mismatch, or it might be related to https://github.com/bitnami/kubecfg/issues/259. Change-Id: Icd69974b294b823e60b8619a656d4834bd6520fd	2020-05-02 23:30:52 +02:00
Bartosz Stebel	98ef1518e0	add vpn insecure namespace Change-Id: I8a774ae625342af3521ad0ab11a8f6d4e4ef6c97	2020-04-24 13:28:38 +02:00
q3k	0dcc702c64	cluster: bump nearly-expired certs This makes clustercfg ensure certificates are valid for at least 30 days, and renew them otherwise. We use this to bump all the certs that were about to expire in a week. They are now valid until 2021. There's still some certs that expire in 2020. We need to figure out a better story for this, especially as the next expiry is 2021 - todays prod rollout was somewhat disruptive (basically this was done by a full cluster upgrade-like rollout flow, via clustercfg). We also drive-by bump the number of mons in ceph-waw3 to 3, as it shouls be (this gets rid of a nasty SPOF that would've bitten us during this upgrade otherwise). Change-Id: Iee050b1b9cba4222bc0f3c7bce9e4cf9b25c8bdc	2020-03-28 18:01:40 +01:00
q3k	90e8e68bab	crdb.k0: add bugless-dev (for q3k) Change-Id: I3988e1c37f0a0c54ef1ba248f01e026d6e8c72b6	2020-03-25 10:55:05 +01:00
q3k	e186c87c1b	cluster: bump rook to 1.0.6 In preparation for updating to 1.1.0, which will be much more involved. Also fix a typo in registry.libsonnet, whoops. Change-Id: I7668bf53c7580f99fdf56fe6227f04a468f8de50	2020-02-21 12:57:02 +01:00
q3k	114edc2398	kube/mirko: add kube.CephObjectStoreUser Change-Id: I2a67076eeaf41ada41f5ae3ee588025e4c16b9e1	2020-02-18 22:55:13 +01:00
q3k	0d83300b18	cluster: set ceph-waw3 mon replicas to 1 This reflects current production. This needs to get bumped up to 3 at some point as otherwise we lose HA for this cluster. Change-Id: Ie5937e6a216b635ecbc4c82ecd182a410167c3f8	2020-02-15 11:48:39 +00:00
q3k	aa76e55eea	cert-manager: fix DNS for http01 k0 splitdns Change-Id: I73847daec9796cb891cf2fe58c2633c5fa768861	2019-12-29 02:49:30 +01:00
q3k	0c337acf89	benji: fix in waw2, run in waw3 This needed an upstream change to allow only some pools to be backed up, otherwise benji would crash when stubmling upon the first PVC from a pool that wasn't backed by the ceph cluster it was acting upon. Change-Id: I52bf163c16352cb59fdd3dbdd576145ce1dbac03	2019-12-21 23:45:07 +01:00
q3k	fd323a0f55	cluster: sync to prod Change-Id: If311f1ce44653bb54e0a10ad2fdd65685722a64d	2019-11-17 19:49:04 +01:00
q3k	c33ebcc79f	cluster: add ceph-waw3, move metallb to bgp Change-Id: Iebf369f9a02e44be163ef4afc2e0f23c4b009898	2019-11-01 18:43:45 +01:00
q3k	d493ab66ca	*: add dcr01s{22,24} Change-Id: I072e825e2e1d199d9da50b9d38a9ffba68e61182	2019-10-31 17:07:50 +01:00
q3k	6f773e0004	smsgw: productionize, implement kube/mirko This productionizes smsgw. We also add some jsonnet machinery to provide a unified service for Go micro/mirkoservices. This machinery provides all the nice stuff: - a deployment - a service for all your types of pots - TLS certificates for HSPKI We also update and test hspki for a new name scheme. Change-Id: I292d00f858144903cbc8fe0c1c26eb1180d636bc	2019-10-04 13:52:34 +02:00
q3k	d186e9468d	cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net In https://gerrit.hackerspace.pl/c/hscloud/+/70 we accidentally introduced a split-horizon DNS situation: - k0.hswaw.net from the Internet resolves to nodes running the k8s API servers, and as such can serve API server traffic - k0.hswaw.net from the cluster returned no results This broke prodvider in two ways: - it dialed the API servers at k0.hswaw.net - even after the endpoint was moved to kubernetes.default.svc.k0.hswaw.net, the apiserver cert didn't cover that Thus, not only we had to change the prodvider endpoint but also change the APIserver certs to cover this new name. I'm not sure this should be the target fix. I think at some point we should only start referring to in-cluster services via their full (or cluster.local) names, but right now k0.hswaw.net is an exception and as such a split, and we have no way to access the internal services from the outside just yet. However, getting prodvider to work is important enough that this fix is IMO good enough for now. Change-Id: I13d0681208c66f4060acecc78b7ae14b8f8d7125	2019-10-04 13:52:34 +02:00
q3k	e31d64f265	kube: move cert-manager resources to kube.local.libsonnet This way kubernetes consumers don't have to import anything from cluster/, hopefully. We also create a small abstraction for local additions for kube.libsonnet without having to modify upstream. Change-Id: I209095781f91c8867250a647fe944370cddd67d0	2019-10-02 21:03:13 +02:00
q3k	54490d385e	cluster/coredns: add cluster fqdn top level domain This means that in addition to services being discoverable the 'classic' way: <svcname>.<namespace>.svc.cluster.local They are now discoverable as: <svcname>.<namespace>.svc.<fqdn> For instance, on k0 you can now internally resolve: $ kubectl run --rm -it foo --image=nixery.dev/shell/dnsutils bash bash-4.4# dig +short coffee-svc.default.svc.k0.hswaw.net 10.10.12.192 Change-Id: Ie6875b54ed6358f30f888ca0cd96e011520ace20	2019-10-02 20:49:13 +02:00
q3k	95868eeddc	benji: back up daily instead of hourly Every benji backup seems to cycle blocks (eg. delete some and recreate them). Since wasabi has a minimum billing retention policy of 90 days, this means that every uploaded and then an hour later deleted object costs us. Currently we seem to be storing around 200G of data in wasabi for Benji but already have 600G of deleted objects. This is suboptimal. This change has already been deployed on production. Change-Id: I67302d23a1c45974fb5d51ec9a8cff28260830dc	2019-09-26 21:49:24 +00:00
q3k	5f3a5e0310	cluster/kube: emergency fixes after evition Some pods got evicted. Some of them broke. - postgres in matrix and nginx in internet because of the new policies (chown issues) - cas proxy in matrix because apparently the image was not reuploaded to the regsitry after ceph-waw1 died, and another node didn't have it - registry because it had a weak image pin an downgraded to some broken version on another node Change-Id: I836036872629843c8ede1b7f67982112c90d71f0	2019-09-25 02:58:15 +02:00
q3k	db2a2a029f	Merge "Get in the Cluster, Benji!"	2019-09-18 20:40:12 +00:00
q3k	a01c487a6e	cluster: allow insecure pods in rook-ceph-system This is required for the agent to start a socket on each host for kubelet-to-rook access. Change-Id: I78529df81185aeaacdcb494138f72f0224a029c6	2019-09-05 16:01:19 +00:00
q3k	13bb1bf4e3	Get in the Cluster, Benji! Here we introduce benji [1], a backup system based on backy2. It lets us backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible storage provider. Benji runs as a k8s CronJob, every hour at 42 minutes. It does the following: - runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs up their respective PVs to Wasabi - runs benji enforce, marking backups outside our backup policy [2] as to be deleted - runs benji cleanup, to remove unneeded backups - runs a custom script to backup benji's sqlite3 database into wasabi (unencrypted, but we're fine with that - as the metadata only contains image/pool names, thus Ceph PV and pool names) [1] - https://benji-backup.me/index.html [2] - latest3,hours48,days7,months12, which means the latest 3 backups, then one backup for the next 48 hours, then one backup for the next 7 days, then one backup for the next 12 months, for a total of 65 backups (deduplicated, of course) We also drive-by update some docs (make them mmore separated into user/admin docs). Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4	2019-09-02 16:33:02 +02:00
q3k	9496d9910a	cluster: add nextcloud user for object store Change-Id: Ib08be16f71ff5e1b72ca6ad436de4b12427dd407	2019-09-02 16:33:02 +02:00
q3k	896926c921	prodvider: clean up LDAP connections Change-Id: Ic95e6d1b845832fa0fb2da51b418bcdcb8fd05c4	2019-08-31 15:00:51 +02:00
q3k	71a21c7693	rook/ceph: bump Change-Id: I046df292cad11650adb829cc8a73100cc1d1ecc8	2019-08-30 23:08:26 +02:00
q3k	b13b7ffcdb	prod{access,vider}: implement Prodaccess/Prodvider allow issuing short-lived certificates for all SSO users to access the kubernetes cluster. Currently, all users get a personal-$username namespace in which they have adminitrative rights. Otherwise, they get no access. In addition, we define a static CRB to allow some admins access to everything. In the future, this will be more granular. We also update relevant documentation. Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153	2019-08-30 23:08:18 +02:00
q3k	d16454badc	cert-manager: bump to v0.9.1 We just got this email: We've been working with Jetstack, the authors of cert-manager, on a series of fixes to the client. Cert-manager sometimes falls into a traffic pattern where it sends really excessive traffic to Let's Encrypt's servers, continuously. To mitigate this, we plan to start blocking all traffic from cert-manager versions less than 0.8.0 (the current semver minor release), as of November 1, 2019. Please upgrade all of your cert-manager instances before then. We're sending this email because this is the contact address of your cert-manager instance at: 185.236.240.37 . Version 0.8.0 is much better but we still observe excessive traffic in some cases. We're working with Jetstack to improve these cases. As new versions of cert-manager are released, we will add the non-current versions to our block list after 3 months. We strongly encourage cert-manager users to stay up-to-date with new versions. Also, there is an opportunity to help both Jetstack and Let's Encrypt. Once you've upgraded, please check the logs for your cert-manager instances from time to time. Are they making excessive requests to Let's Encrypt (more than, say, 10 per day over multiple days)? If so, please share details at https://github.com/jetstack/cert-manager/issues/1948 . Thanks, Let's Encrypt Team Change-Id: Ic7152150ac1c96941423878c6d4b6209e07429cf	2019-08-29 17:21:49 +02:00
q3k	1fad2e5c6e	bgpwtf/cccampix: draw the rest of the fucking owl Change-Id: I49fd5906e69512e8f2d414f406edc0179522f225	2019-08-11 23:43:25 +02:00
q3k	d533892efa	Fix crdb-waw1 We accidentally created crdb-waw2 in https://gerrit.hackerspace.pl/c/hscloud/+/2. We remove it now and also backport a manual change that makes the crdb-waw1 service public via a LoadBalancer. Change-Id: I3bbd6f01b82c6efa458cc44776f086ba36e9f20c	2019-08-11 23:42:47 +02:00
q3k	d07861b7df	ceph-waw1 -> ceph-waw2 Change-Id: I03d6244b9697a9efc06492114ef90cdb01e17601	2019-08-08 17:49:31 +02:00
q3k	4d61d20aec	app/registry: integrate into cluster/kube This makes a registry be automatically part of the cluster infrastructure. Tested by running kubecfg diff, no diffs (apart from out-of-date ACLs) found. Change-Id: Ic0635e789cf3fb851f410bcf2865326f1fa87545	2019-07-21 16:56:41 +02:00
q3k	92be486f39	Revert "cluster/kube/lib/nginx: use Local traffic policy" This reverts commit `09a0f06d2a`. Reason for revert: prevents registry from being accessible on nodes: q3k@anathema ~/Software/hscloud $ curl registry.k0.hswaw.net <html> [..., ok] [root@bc01n03:~]# curl registry.k0.hswaw.net ^C Change-Id: I0da97aaf7a8791ea3f62c70b6c1502f4a48a300f	2019-06-29 22:58:19 +00:00
q3k	09a0f06d2a	cluster/kube/lib/nginx: use Local traffic policy Diff against prod: - live services nginx-system.ingress-nginx + config services nginx-system.ingress-nginx { "apiVersion": "v1", "kind": "Service", "metadata": { "annotations": {}, "labels": { "app.kubernetes.io/name": "ingress-nginx", "app.kubernetes.io/part-of": "ingress-nginx" }, "name": "ingress-nginx", "namespace": "nginx-system" }, "spec": { - "externalTrafficPolicy": "Cluster", + "externalTrafficPolicy": "Local", "ports": [ { "name": "ssh", "port": 22, "protocol": "TCP", "targetPort": 22 }, { "name": "http", "port": 80, "protocol": "TCP", "targetPort": 80 }, { "name": "https", "port": 443, "protocol": "TCP", "targetPort": 443 } ], "selector": { "app.kubernetes.io/name": "ingress-nginx", "app.kubernetes.io/part-of": "ingress-nginx" }, "type": "LoadBalancer" } } Change-Id: I0dd66e3f1643efa975d6180cc163a265d4b484ef	2019-06-29 22:44:53 +02:00
q3k	543b412a65	cluster/kube/lib/nginx: add gerrit forwarding This is already running in production since gerrit was deployed - it just got lost during submit. Change-Id: I8a1580b1ca3ec3142a8fa4320dc9f51a599a914f	2019-06-29 22:42:39 +02:00
q3k	184678b0f4	cluster/cube/lib/cockroachdb: clean up topology IP addresses are not necessary in the topology definitions of a cockroach cluster. They were mis-commited leftovers from trying to run the cluster on DaemonSets with hostNetworking: true. Change-Id: I4ef1f6ed9a745efc6b05846bc13aba9d1f8dc7c8	2019-06-22 21:18:29 +00:00
q3k	dec401c7dd	cluster/kube/lib/cockroach: move client to deployment This prevents a bug where kubecfg fails to update the client pod when running a cluster/kube/cluster.jsonnet update. The pod update is attempted because of runtime/intent differences at serviceAccounts specification, which causes kubecfg to see a diff, which causes it to attempt and update, which causes kube-apiserver to reject the change (because pods are immutable), which causes kubecfg to fail. Change-Id: I20b0ecbb264213a2eb483d475c7683b4965c82be	2019-06-22 23:14:25 +02:00
q3k	c7258f4644	cluster/kube: refactor, add crdb-waw1	2019-06-21 00:24:09 +02:00
q3k	e53e39a8be	cluster/kube/lib/cockroachdb: use manual node pinning We move away from the StatefulSet based deployment to manually starting a deployment per intended node. This allows us to pin indivisual instances of Cockroach to particular nodes, so that they state co-located with their data.	2019-06-20 23:36:35 +02:00
q3k	662a3cdcca	cluster/kube/lib/cockroachdb: refactor We refactor this library to: - support multiple databases, but with a strong suggestion of having one per k8s cluster - drop the database creation logic - redo naming (allowing for two options: multiple clusters per namespace or an exclusive namespace for the cluster) - unhardcode dns names	2019-06-20 19:45:03 +02:00
q3k	224a50bbfe	cluster/kube/lib/cockroach: fix imports	2019-06-20 16:43:01 +02:00
q3k	3c117fa841	make cockroachdb into a cluster service	2019-06-20 16:43:01 +02:00
q3k	c3b0f7627c	cluster/kube: set operator replicas to 0	2019-06-20 16:42:19 +02:00
q3k	36cc4fb61a	bazel-cache: deploy, add waw-hdd-yolo-1 ceph pool	2019-05-17 18:09:39 +02:00
informatic	fc514a9b52	cluster/kube/cert-manager: don't add APIService when webhooks are disabled	2019-05-05 12:12:13 +02:00
informatic	b187bf5b2c	cluster/kube/metallb: downgrade to 0.7.3	2019-05-05 12:11:14 +02:00
q3k	321fad9865	cluster/kube/rook: lower debug	2019-04-19 14:14:36 +02:00
q3k	ed2e670c8b	cluster/kube/rook: bump to ceph v14 fully	2019-04-19 13:27:20 +02:00

1 2 3 4

170 Commits (master)