hscloud

Author	SHA1	Message	Date
Bartosz Stebel	0156ab24ca	cluster/kube/k0: remove implr-spark bucket, add implr bucket the spark one has been an abandoned experiment from years ago, and I could use a personal one right now Change-Id: I78a706c3371d441b2f8460fd796d0cfd9a198cc6 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1464 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-02-26 16:41:23 +00:00
Piotr Dobrowolski	3b2a2a2ce1	cluster/k0: add paperless to admitomatic config Change-Id: I54df444cddca8a05febfb96af07b9e2f614639fc Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1453 Reviewed-by: q3k <q3k@hackerspace.pl>	2023-01-05 09:12:18 +00:00
Serge Bazanski	d171263d6e	k0: remove waw-hdd-yolo-3 This was never used and only caused scary warnings during OSDs reboots due to lack of availability. Change-Id: I14eacd88855bc56e06f2a61cc2d914d985330852 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1423 Reviewed-by: implr <implr@hackerspace.pl>	2022-11-20 12:28:20 +00:00
Serge Bazanski	16842119d1	app/mastodon: deploy Change-Id: I88c104d1a8d5627355b01a8c48dc235635fca5ed Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1421 Reviewed-by: implr <implr@hackerspace.pl>	2022-11-18 12:15:22 +00:00
Bartosz Stebel	54a34b24a1	cluster/k0: ceph: add tape staging Change-Id: I7fdba86b15f92157888850d2905440b45fb36f17 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1263 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-03-05 22:45:29 +00:00
Serge Bazanski	bdd403c587	cluster: k0: move cockroachdb away from bc01n01, fixup joins Reminded by a power failure on bc01n0{1,2}, we migrate away from at least one of them into another server. We also fix up the startup join parameter to not include the node itself (which is not necessary, but a nice thing to have nonetheless). Since bc01n01 was the initial node of the cluster, we also disable the init job for k0 (which we don't care about anyway). Change-Id: I3406471c0f9542e9d802d39138e400b5a5e74794 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1176 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-12-13 22:30:46 +00:00
Piotr Dobrowolski	e839f95079	cluster/kube/k0: add matrix and informatic personal ceph users Change-Id: Ied8d474709b8053e9fc339435d3ca1ca5fdfa710	2021-09-14 22:21:22 +02:00
Serge Bazanski	38f72fe094	cluster: k0: move ceph-waw3 to proper realm/zonegroup With this we can use Ceph's multi-site support to easily migrate to our new k0 Ceph cluster. This migration was done by using radosgw-admin to rename the existing realm/zonegroup to the new names (hscloud and eu), and then reworking the jsonnet so that the Rook operator would effectively do nothing. It sounds weird that creating a bunch of CRs like Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator, but that's how Rook works - a CephObjectStore generally creates everything that the above CRs would create too, but implicitly. Adding the extra CRs just allows specifying extra settings, like names. (it wasn't fully a no-op, as the rgw daemon is parametrized by realm/zonegroup/zone names, so that had to be restarted) We also make the radosgw serve under object.ceph-eu.hswaw.net, which allows us to right away start using a zonegroup URL instead of the zone-only URL. Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877	2021-09-14 21:39:39 +02:00
Serge Bazanski	085a8ff247	cluster: k0: upgrade to ceph 16.2.5 This was fun. See b/6 for a log of how swimmingly this went. Change-Id: I96c3c18b5d33ef86523b3506f49a390419e9ca7f	2021-09-14 21:39:39 +02:00
Serge Bazanski	464fb04f39	cluster: k0: bump rook to 1.6 This is needed to get Rook to talk to an external Ceph 16/Pacific cluster. This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our own CRD rewrite and just slurp in upstream CRD defs. Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc	2021-09-14 21:39:37 +02:00
Serge Bazanski	4f0468fa26	cluster/kube: remove ceph diff against k0 production This now has a zero diff against prod. location fields in CephCluster.storage.nodes seem to have been removed from the CRD at some point. Not sure how the CRUSH tree now gets populated, but whatever, it's been working like this for a while already. Same for CephObjectStore.gateway.type. The Rook Operator has been zero-scaled for a while now due to b/6. Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59	2021-09-11 12:43:53 +00:00
Serge Bazanski	99b91b11f1	cluster/k0/admitomatic: add .hswaw.net to hswaw-prod namespace This was preventing certificate refresh in the hswaw-prod mirko ingress. Change-Id: I14b18b642a3948a9864e2d9a90b2a2b2c145b9b1	2021-03-28 17:34:34 +00:00
Serge Bazanski	bf266c6aaf	cluster/k0: add dns crdb user In preparation for running PowerDNS on k0. Change-Id: I853c7465a6a32d02628fa6cfdeb445eb9937b3be	2021-03-17 21:49:00 +00:00
Serge Bazanski	64de7afe32	cluster/kube/k0: fix syntax errors This happened in `793ca1b3` and slipped past review. Change-Id: Ie31f0e1ec03d6e4545d6683b21f528550bf4ef9f	2021-03-17 21:47:51 +00:00
Serge Bazanski	793ca1b3b2	cluster/kube: limit OSDs in ceph-waw3 to 8GB RAM Each OSD is connected to a 6TB drive, and with the good ol' 1TB storage -> 1GB RAM rule of thumb for OSDs, we end up with 6GB. Or, to round up, 8GB. I'm doing this because over the past few weeks OSDs in ceph-waw3 have been using a _ton_ of RAM. This will probably not prevent that (and instead they wil OOM more often :/), but it at will prevent us from wasting resources (k0 started migrating pods to other nodes, and running full nodes like that without an underlying request makes for a terrible draining experience). We need to get to the bottom of why this is happening in the first place, though. Did this happen as we moved to containerd? Followup: b.hswaw.net/29 Already deployed to production. Change-Id: I98df63763c35017eb77595db7b9f2cce71756ed1	2021-03-07 00:09:58 +00:00
Serge Bazanski	877cf0af26	🅱️ Fixes b/8 Change-Id: I5a5779c3688451d89c0601dc913143d75048c9f6	2021-02-08 15:10:11 +00:00
Serge Bazanski	3c5d836c56	cluster/kube: deploy admitomatic This doesn't yet enable a webhook, but deploys admitomatic itself. Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e	2021-02-07 19:19:02 +00:00
Piotr Dobrowolski	f4a6a56662	cluster/kube/k0: add issues.hackerspace.pl crdb user Change-Id: If78f795e0e35360b65c666e6b217037fc34a2ccf	2021-02-01 21:32:25 +01:00
Piotr Dobrowolski	3b8a43f35d	cluster/kube/k0: add issues.hackerspace.pl ceph s3 user Change-Id: If5eef3404bdc08ded88e46f45bad0f9abcdb0f1c	2021-02-01 21:19:59 +01:00
Patryk Jakuszew	edf14cc5f4	crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4 This change reflects the current production state. Upgrade was done by going through following versions: 19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4 Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c	2021-01-23 23:00:29 +01:00
Patryk Jakuszew	f3153888a8	cluster/kube: Add k0-cockroach.jsonnet, add Gitea client cert Change-Id: Ibc5db1b0114b2540b6dc806e75e9a36cf9a3bc50	2021-01-23 15:38:50 +01:00
Serge Bazanski	61f978a0a0	: tear down ceph-waw2 It reached the stage of being crapped out so much that the OSDs spurious IOPS killed the performance of disks colocated on the same M610 RAID controllers. This made etcd _very_ slow, to the point of churning through re-elections due to timeouts. etcd/apiserver latencies, observe the difference at ~15:38: https://object.ceph-waw3.hswaw.net/q3k-personal/4fbe8d4cfc8193cad307d487371b4e44358b931a7494aa88aff50b13fae9983c.png I moved gerrit/ and matrix/appservice-irc-freenode PVCs to ceph-waw3 by hand. The rest were non-critical so I removed them, they can be recovered from benji backups if needed. Change-Id: Iffbe87aefc06d8324a82b958a579143b7dd9914c	2021-01-22 16:26:09 +01:00
Serge Bazanski	3b9ee5f1c0	ceph: bump to 14.2.16 More as-builts. This has already been bumped. Had to coax ceph-waw2 to upgrade despite the fact that it's horribly broken. Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3	2021-01-19 21:45:26 +00:00
Serge Bazanski	cf842b0442	k0: reflect reality This is after the monster^Wrook outage of the week two weeks ago caused by bc01n03 dying. Plan is to migrate ceph-waw3 to be external, yeet ceph-waw2, and extend crdb-waw1 to another node. Change-Id: I133af3b1171fea383b45bf06c51e48a5c40341e4	2021-01-19 20:08:26 +01:00
Patryk Jakuszew	cae7cf776f	k0: add missing curly brace termination in woju's S3 user name Change-Id: Ib2752d798f6e23493daee446a834e244f858330e	2020-11-28 14:36:48 +01:00
Patryk Jakuszew	34668a5b7b	k0: add cz3's personal s3 user Change-Id: I51ee80eb05c34cfd8b03e15fcaefb5f235587c50	2020-11-28 13:45:25 +01:00
Serge Bazanski	bfe9bb0e3a	k0: add woju's personal s3 user Change-Id: I8ed5bb5428594b74460f1b89185d684cb6c26268	2020-10-27 20:50:50 +01:00
Serge Bazanski	a5ed644980	k0.hswaw.net: pass metallb through Calico Previously, we had the following setup: .-----------. \| ..... \| .-----------.-\| \| dcr01s24 \| \| .-----------.-\| \| \| dcr01s22 \| \| \| .---\|-----------\| \|-' .--------. \| \|---------. \| \| \| dcsw01 \| <----- \| metallb \| \|-' '--------' \|---------' \| '-----------' Ie., each metallb on each node directly talked to dcsw01 over BGP to announce ExternalIPs to our L3 fabric. Now, we rejigger the configuration to instead have Calico's BIRD instances talk BGP to dcsw01, and have metallb talk locally to Calico. .-------------------------. \| dcr01s24 \| \|-------------------------\| .--------. \|---------. .---------. \| \| dcsw01 \| <----- \| Calico \|<--\| metallb \| \| '--------' \|---------' '---------' \| '-------------------------' This makes Calico announce our pod/service networks into our L3 fabric! Calico and metallb talk to eachother over 127.0.0.1 (they both run with Host Networking), but that requires one side to flip to pasive mode. We chose to do that with Calico, by overriding its BIRD config and special-casing any 127.0.0.1 peer to enable passive mode. We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle with the kernel programming filter (ie. to-kernel-routing-table filter), where we disable programming unreachable routes. This is because routes coming from metallb have their next-hop set to 127.0.0.1, which makes bird mark them as unreachable. Unreachable routes in the kernel will break local access to ExternalIPs, eg. register access from containerd. All routes pass through without route reflectors and a full mesh as we use eBGP over private ASNs in our fabric. We also have to make Calico aware of metallb pools - otherwise, routes announced by metallb end up being filtered by Calico. This is all mildly hacky. Here's hoping that Calico will be able to some day gain metallb-like functionality, ie. IPAM for externalIPs/LoadBalancers/... There seems to be however one problem with this change (but I'm not fixing it yet as it's not critical): metallb would previously only announce IPs from nodes that were serving that service. Now, however, the Calico internal mesh makes those appear from every node. This can probably be fixed by disabling local meshing, enabling route reflection on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by some more hacking of the Calico BIRD config :/. Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836	2020-09-23 18:55:12 +00:00
Serge Bazanski	242ec58a33	k0: add waw-hdd-redundant-q3k-3 Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82	2020-09-20 15:36:24 +00:00
Serge Bazanski	3d29484ebb	k0: move registry to ceph-waw3 ceph-waw2 has currently some production issues [1] which have started to cause write failures in the registry. The registry is the only user of ceph-waw2's affected pool, so we reduce the dumpster fire blast radious by moving it over to ceph-waw3. This has already been deployed and data has been migrated over (via s3cmd sync), and the migration has been verified (by a push and pull, and pull of an older image). [1] - pgs stuck inactive in the object storage pool Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347	2020-08-04 01:36:51 +02:00
Serge Bazanski	509ab6e29a	k0/cockroach: add public DNS entry for cockroach Change-Id: I934bf348e2165148b515b709e853ab67f039a402	2020-07-30 22:56:30 +02:00
Sergiusz Bazanski	b1aadd88ff	k0: add q3k's personal s3 user Change-Id: I5681774e1dca2cf4a865d9e1a24602ed4334f006	2020-06-24 17:19:36 +00:00
Bartosz Stebel	d9df5879e3	add radosgw bucket for spark Change-Id: Id8ea8901ce038ccbf11afabe0e6272c358b32cf2	2020-06-13 21:31:56 +02:00
Sergiusz Bazanski	dbfa988c73	cluster/kube: split up cluster.jsonnet It was getting large and unwieldy (to the point where kubecfg was slow). In this change, we: - move the Cluster function to cluster.libsonnet - move the Cluster instantiation into k0.libsonnet - shuffle some fields around to make sure things are well split between k0-specific and general cluster configs. - add 'view' files that build on 'cluster.libsonnet' to allow rendering either the entire k0 state, or some subsets (for speed) - update the documentation, drive-by some small fixes and reindantation Change-Id: I4b8d920b600df79100295267efe21b8c82699d5b	2020-06-13 19:51:58 +02:00

34 commits