hscloud

hswaw

hscloud

mirror of https://gerrit.hackerspace.pl/hscloud

Author	SHA1	Message	Date
radex	d45584aa6d	kube: clean up SimpleIngress Rename `target_service` to `target` to mirror Service's `target`; rename `extra_paths` to `extraPaths` to follow the camelCase convention used everywhere except for a few places in kube.upstream (assumed to be a mistake) Change-Id: Icfcb70ef889e3359bf0391c465034817f4b70cce Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1809 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-12-04 20:33:10 +00:00
radex	36964dca3b	kube: clean up PersistentVolumeClaims There's no difference as far as jsonnet is concerned, but it may confuse newbies, as Service and SimpleIngress use double colon for its top-level kube helpers. This also removes any ambiguity as to whether this is manifested in final JSON. So we can make that a convention. Change-Id: I01ad4ea63f4d5d8ee6e5d41c79637ba186548c6f Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1803 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-11-24 20:37:53 +00:00
radex	8b8f3876a9	kube: add target:: convenience field to Service Change-Id: If69116d93b6074136a36d98973e1aa997e2ebbef Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1802 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-11-24 20:37:48 +00:00
radex	f28cd62c0e	*: Simplify kube.PersistentVolumeClaims Change-Id: I0a3e44de9f1c4db146fd1e493741f5fe381da3ae Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1768 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-11-18 12:36:00 +00:00
implr	6f1fda4329	cluster/k/l/cockroach: make publicService select all nodes Change-Id: I705b89057f9c191eb62771e3683224376b2207a1 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1762 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-11-01 23:30:52 +00:00
q3k	ab2e470bd3	cluster/kube: generate namespaces in NamespaceAdmins Change-Id: I37981a4d8d7cf9b85b9b9ab8cfdfc6c66eaa4453 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1760 Reviewed-by: radex <radex@hackerspace.pl>	2023-10-31 10:52:01 +00:00
q3k	633fb2e8ce	cluster/admitomatic: deploy Change-Id: Id08c4b428a9c01b310b69396890083f999090928 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1749 Reviewed-by: radex <radex@hackerspace.pl>	2023-10-28 20:12:30 +00:00
radex	f5844311eb	*/kube: Add kube.SimpleIngress Change-Id: Iddcac629b9938f228dd93b32e58bb14606d5c6e5 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1745 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-10-28 17:55:48 +00:00
radex	0776a79df3	cluster/kube: Centralize namespace admin RoleBindings Change-Id: Iec3505b2f4a1647e67cf47cf189c77534b5be6ac Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1696 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-10-10 17:34:22 +00:00
q3k	03c2d996a0	cluster: fix prodvider deploy (after new CA) Change-Id: Icbdb5e3ac592e9eac3a033ba50af401b706c3e78 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1541 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-07-24 14:15:46 +00:00
informatic	10384cd394	cluster/registry: fix common namespaces Public pull ACL in the middle had priority over our more specific rules - moving these to the top fixes common registry namespace ACLs. Change-Id: Ia6f05cef09c0db4eb71155d2c0e2d9944b81f903 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1522 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-06-19 23:15:37 +00:00
q3k	c1f372561a	cluster/admitomatic: implement opt-out namespaces Change-Id: I32d4b019211fa755e2b3b103b88ea3f4c14e500f Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1521 Reviewed-by: informatic <informatic@hackerspace.pl>	2023-06-19 22:54:33 +00:00
informatic	7e841065b0	*: post-certmanager manifests update Change-Id: I745c850268c31777c5722a9833c8152a55615aed Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1512 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-06-19 21:20:44 +00:00
q3k	3dd3ff5dcd	cluster/cert-manager: update to v1.5.0 Change-Id: I7a4cdadc9956141292302bc004d09d6e9e22855e Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1497 Reviewed-by: informatic <informatic@hackerspace.pl>	2023-05-26 10:38:16 +00:00
q3k	073d850a95	cluster/prodvider: redeploy Change-Id: I7a6cce06bb7c2f495d5354d3a2bebef64e307e42 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1491 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-04-01 11:18:25 +00:00
q3k	3a6d67e0c4	cluster/prodvider: rewrite against x509 lib for ed25519 support This gets rid of cfssl for the kubernetes bits of prodvider, instead using plain crypto/x509. This also allows to support our new fancy ED25519 CA. Change-Id: If677b3f4523014f56ea802b87499d1c0eb6d92e9 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1489 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-03-31 22:53:59 +00:00
implr	0173f501d7	cockroach: v20.2 -> v21.1 Following https://www.cockroachlabs.com/docs/v21.1/upgrade-cockroach-version?filters=linux --logtostderr is deprecated/removed, but AFAICT from the default config it will still log there: https://www.cockroachlabs.com/docs/v21.1/configure-logs#default-logging-configuration Change-Id: I7fb3f835693f955b37de24dc581140ea34b11630 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1461 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-01-30 21:16:42 +00:00
implr	4d98cf5ca8	calico: move from etcd to crd Leaving the CRD definitions as YAML, extracted without modifications from the original install file - this should make upgrades simpler. Change-Id: I7211d2711e2af014b36dd887a951abb9e1032eb9 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1179 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-11-19 21:40:34 +00:00
q3k	437b0c335f	rook: fix benji This unforks benji back into upstream. The old fork didn't support a new authentication method on Ceph, and we don't have multiple clusters anymore (so we don't need the functionality of the fork). Change-Id: Ie79313b2321ca2e22ad2874b75a71385af95105f Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1321 Reviewed-by: informatic <informatic@hackerspace.pl>	2022-06-19 11:49:12 +00:00
q3k	b0e3693c0e	cluster/kube: calico: fix etcd endpoints Change-Id: Ia93d355ca343fa5a42ec37fbcae9135cb5304f6e Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1285 Reviewed-by: implr <implr@hackerspace.pl>	2022-06-11 19:00:52 +00:00
q3k	bdd403c587	cluster: k0: move cockroachdb away from bc01n01, fixup joins Reminded by a power failure on bc01n0{1,2}, we migrate away from at least one of them into another server. We also fix up the startup join parameter to not include the node itself (which is not necessary, but a nice thing to have nonetheless). Since bc01n01 was the initial node of the cluster, we also disable the init job for k0 (which we don't care about anyway). Change-Id: I3406471c0f9542e9d802d39138e400b5a5e74794 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1176 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-12-13 22:30:46 +00:00
implr	eca1e080d7	calico: restore CNI_NET_DIR Change-Id: I04e17f8639505f5b7cc42e86392abc175b7922db Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1178 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-12-03 03:10:13 +00:00
implr	12f176c1eb	calico 3.14 -> 1.15 Change-Id: I9eceaf26017e483235b97c8d08717d2750fabe25 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/995 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-11-20 22:12:52 +00:00
q3k	4b8ee32246	cluster/kube: always enable flexdriver Documentation says [1] this is disabled by default in 1.1, but that documentation kinda lies [2]. [1] - `235d5a384b/Documentation/flexvolume.md (ceph-flexvolume-configuration)` [2] - `64e28af741 (diff-d1eb5cba50e3770b61ccd3c730cd40514053e1da0233dfe09b5e7967e76a2a6cL424-L425)` Change-Id: Ia92c99e137ed751db62c0f56d42c4901986d0bb8	2021-09-14 21:39:39 +02:00
q3k	38f72fe094	cluster: k0: move ceph-waw3 to proper realm/zonegroup With this we can use Ceph's multi-site support to easily migrate to our new k0 Ceph cluster. This migration was done by using radosgw-admin to rename the existing realm/zonegroup to the new names (hscloud and eu), and then reworking the jsonnet so that the Rook operator would effectively do nothing. It sounds weird that creating a bunch of CRs like Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator, but that's how Rook works - a CephObjectStore generally creates everything that the above CRs would create too, but implicitly. Adding the extra CRs just allows specifying extra settings, like names. (it wasn't fully a no-op, as the rgw daemon is parametrized by realm/zonegroup/zone names, so that had to be restarted) We also make the radosgw serve under object.ceph-eu.hswaw.net, which allows us to right away start using a zonegroup URL instead of the zone-only URL. Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877	2021-09-14 21:39:39 +02:00
q3k	085a8ff247	cluster: k0: upgrade to ceph 16.2.5 This was fun. See b/6 for a log of how swimmingly this went. Change-Id: I96c3c18b5d33ef86523b3506f49a390419e9ca7f	2021-09-14 21:39:39 +02:00
q3k	464fb04f39	cluster: k0: bump rook to 1.6 This is needed to get Rook to talk to an external Ceph 16/Pacific cluster. This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our own CRD rewrite and just slurp in upstream CRD defs. Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc	2021-09-14 21:39:37 +02:00
q3k	6579e842b0	kartongips: paper over^W^Wfix CRD updates Ceph CRD updates would fail with: ERROR Error updating customresourcedefinitions cephclusters.ceph.rook.io: expected kind, but got map This wasn't just https://github.com/bitnami/kubecfg/issues/259 . We pull in the 'solution' from Pulumi (https://github.com/pulumi/pulumi-kubernetes/pull/622) which just retries the update via a JSON update instead, and that seems to have worked. We also add some better error return wrapping, which I used to debug this issue properly. Oof. Change-Id: I2007a7857e44128d74760174b61b59efa58e9cbc	2021-09-11 20:54:34 +00:00
q3k	4f0468fa26	cluster/kube: remove ceph diff against k0 production This now has a zero diff against prod. location fields in CephCluster.storage.nodes seem to have been removed from the CRD at some point. Not sure how the CRUSH tree now gets populated, but whatever, it's been working like this for a while already. Same for CephObjectStore.gateway.type. The Rook Operator has been zero-scaled for a while now due to b/6. Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59	2021-09-11 12:43:53 +00:00
q3k	89a16f4de4	cluster/admitomatic: allow use-regex n-i-c annotation This annotation is used to permit routes defined by regexes instead of simple prefix matching. This is used by our synapse deployment for routing incomming HTTP requests to diffferent Synapse components. I've stumbled upon this while deploying a new Matrix/Synapse instance. This hasn't been yet a problem because the existing ingresses for Matrix deployments predate admitomatic. Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0	2021-06-06 12:58:11 +00:00
q3k	7251f2720e	Merge changes Ib068109f,I9a00487f,I1861fe7c,I254983e5,I3e2bedca, ... * changes: cluster/identd/ident: update README cluster/kube: deploy identd cluster/identd: implement cluster/identd/kubenat: implement cluster/identd/cri: import cluster/identd/ident: add TestE2E cluster/identd/ident: add Query function cluster/identd/ident: add IdentError cluster/identd/ident: add basic ident protocol server cluster/identd/ident: add basic ident protocol client	2021-05-28 23:08:10 +00:00
q3k	2414afe3c0	cluster/kube: deploy identd Change-Id: I9a00487fc4a972ecb0904055dbaaab08221062c1	2021-05-26 19:46:09 +00:00
q3k	e17f7edde0	cluster/kube: nginx: add Hscloud-Nic-Source-* headers These can be used by production jobs to get the source port of the client connecting over HTTP. A followup CR implements just that. Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a	2021-05-22 19:16:39 +00:00
q3k	ba2f4d8215	cluster/prodvider: deploy Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53	2021-05-19 22:13:26 +00:00
q3k	5ae5cbec81	Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k"	2021-05-19 15:34:45 +00:00
q3k	2e8d24b84a	cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k This fixes CVE-2021-3450 and CVE-2021-3449. Deployed on prod: $ kubectl -n nginx-system exec nginx-ingress-controller-5c69c5cb59-2f8v4 -- openssl version OpenSSL 1.1.1k 25 Mar 2021 Change-Id: I7115fd2367cca7b687c555deb2134b22d19a291a	2021-03-25 18:16:13 +00:00
q3k	3b8935378a	cluster/crdb: make init job 'idempotent' This enables its redeployment with a newer crdb image. Change-Id: If039992674f401af53738c80d22cc2ca2818fe00	2021-03-17 21:48:30 +00:00
q3k	943ab5b1a6	cluster/admitomatic: allow whitelist-source-range Without this, cert-manager get stuck. Deployed to prod. Change-Id: I356cd44f455b6f4aecea9ae396f6a05e1a727859	2021-02-07 23:35:28 +00:00
q3k	41bbf1436a	cluster/kube: deploy admitomatic webhook This has been (succesfully) tested on prod and then rolled back. Change-Id: I22657f66b4aeaa8a0ae452035ba18a79f4549b14	2021-02-07 19:19:23 +00:00
q3k	3c5d836c56	cluster/kube: deploy admitomatic This doesn't yet enable a webhook, but deploys admitomatic itself. Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e	2021-02-07 19:19:02 +00:00
patryk	edf14cc5f4	crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4 This change reflects the current production state. Upgrade was done by going through following versions: 19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4 Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c	2021-01-23 23:00:29 +01:00
q3k	3b9ee5f1c0	ceph: bump to 14.2.16 More as-builts. This has already been bumped. Had to coax ceph-waw2 to upgrade despite the fact that it's horribly broken. Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3	2021-01-19 21:45:26 +00:00
q3k	2c04c8410a	rook: bump to 1.2.7 As-built: deployed to ceph-waw{2,3} already. Change-Id: I27189b273cf72638cf2036681054832db99591da	2021-01-19 21:41:13 +01:00
q3k	f18a531f9b	prodvider: bump to Go 1.15.5 Change-Id: I0f7999deb571aef12533f0ceee21c0283bc0bdc4	2020-11-27 09:50:09 +00:00
q3k	c7de7e562f	cluster: do not export metallb routes to mesh peers This prevents metallb routes being announced from all peers to our ToR, thereby preventing issues with traffic hitting services with externalTrafficPolicy: local. There still is the from-host loopback issue, but that will be fixed by upgrading to kube 1.15. Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04	2020-10-03 14:56:52 +00:00
q3k	f0acf16564	prodvider: use SANs in service certificates This fixes compatibility with prodaccess tools built with Go 1.15, which introduced 'X.509 CommonName deprecation' [1]. [1] - https://golang.org/doc/go1.15#commonname Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3	2020-10-03 14:56:35 +00:00
q3k	a5ed644980	k0.hswaw.net: pass metallb through Calico Previously, we had the following setup: .-----------. \| ..... \| .-----------.-\| \| dcr01s24 \| \| .-----------.-\| \| \| dcr01s22 \| \| \| .---\|-----------\| \|-' .--------. \| \|---------. \| \| \| dcsw01 \| <----- \| metallb \| \|-' '--------' \|---------' \| '-----------' Ie., each metallb on each node directly talked to dcsw01 over BGP to announce ExternalIPs to our L3 fabric. Now, we rejigger the configuration to instead have Calico's BIRD instances talk BGP to dcsw01, and have metallb talk locally to Calico. .-------------------------. \| dcr01s24 \| \|-------------------------\| .--------. \|---------. .---------. \| \| dcsw01 \| <----- \| Calico \|<--\| metallb \| \| '--------' \|---------' '---------' \| '-------------------------' This makes Calico announce our pod/service networks into our L3 fabric! Calico and metallb talk to eachother over 127.0.0.1 (they both run with Host Networking), but that requires one side to flip to pasive mode. We chose to do that with Calico, by overriding its BIRD config and special-casing any 127.0.0.1 peer to enable passive mode. We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle with the kernel programming filter (ie. to-kernel-routing-table filter), where we disable programming unreachable routes. This is because routes coming from metallb have their next-hop set to 127.0.0.1, which makes bird mark them as unreachable. Unreachable routes in the kernel will break local access to ExternalIPs, eg. register access from containerd. All routes pass through without route reflectors and a full mesh as we use eBGP over private ASNs in our fabric. We also have to make Calico aware of metallb pools - otherwise, routes announced by metallb end up being filtered by Calico. This is all mildly hacky. Here's hoping that Calico will be able to some day gain metallb-like functionality, ie. IPAM for externalIPs/LoadBalancers/... There seems to be however one problem with this change (but I'm not fixing it yet as it's not critical): metallb would previously only announce IPs from nodes that were serving that service. Now, however, the Calico internal mesh makes those appear from every node. This can probably be fixed by disabling local meshing, enabling route reflection on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by some more hacking of the Calico BIRD config :/. Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836	2020-09-23 18:55:12 +00:00
q3k	059fdfed3b	k0: add resource requests/limits to nginx, remove gitea We just had an outage seemingly caused by N-I-C sendings tons of traffic to gitea, which in turn caused N-I-C to balloon in memory/CPU usage. I haven't debugged the cause of this traffic, but I have disabled the gitea TCP forward to Stop The Bleeding. This change reflects ad-hoc production changes. Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9	2020-09-20 22:53:40 +00:00
q3k	0581bbf8a0	games/factorio: add modproxy This adds a mod proxy system, called, well, modproxy. It sits between Factorio server instances and the Factorio mod portal, allowing for arbitrary mod download without needing the servers to know Factorio credentials. Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212	2020-08-14 13:03:46 +02:00
q3k	3d29484ebb	k0: move registry to ceph-waw3 ceph-waw2 has currently some production issues [1] which have started to cause write failures in the registry. The registry is the only user of ceph-waw2's affected pool, so we reduce the dumpster fire blast radious by moving it over to ceph-waw3. This has already been deployed and data has been migrated over (via s3cmd sync), and the migration has been verified (by a push and pull, and pull of an older image). [1] - pgs stuck inactive in the object storage pool Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347	2020-08-04 01:36:51 +02:00

1 2 3

116 Commits (7a4c27d28cfa9c90473ec3092e122887400c40e6)