1
0
Fork 0
Commit Graph

110 Commits (a6592b845c6c2417c531bb8dc904286919b8acd7)

Author SHA1 Message Date
q3k 633fb2e8ce cluster/admitomatic: deploy
Change-Id: Id08c4b428a9c01b310b69396890083f999090928
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1749
Reviewed-by: radex <radex@hackerspace.pl>
2023-10-28 20:12:30 +00:00
radex f5844311eb */kube: Add kube.SimpleIngress
Change-Id: Iddcac629b9938f228dd93b32e58bb14606d5c6e5
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1745
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-10-28 17:55:48 +00:00
radex 0776a79df3 cluster/kube: Centralize namespace admin RoleBindings
Change-Id: Iec3505b2f4a1647e67cf47cf189c77534b5be6ac
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1696
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-10-10 17:34:22 +00:00
q3k 03c2d996a0 cluster: fix prodvider deploy (after new CA)
Change-Id: Icbdb5e3ac592e9eac3a033ba50af401b706c3e78
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1541
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-07-24 14:15:46 +00:00
informatic 10384cd394 cluster/registry: fix common namespaces
Public pull ACL in the middle had priority over our more specific rules
- moving these to the top fixes common registry namespace ACLs.

Change-Id: Ia6f05cef09c0db4eb71155d2c0e2d9944b81f903
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1522
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 23:15:37 +00:00
q3k c1f372561a cluster/admitomatic: implement opt-out namespaces
Change-Id: I32d4b019211fa755e2b3b103b88ea3f4c14e500f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1521
Reviewed-by: informatic <informatic@hackerspace.pl>
2023-06-19 22:54:33 +00:00
informatic 7e841065b0 *: post-certmanager manifests update
Change-Id: I745c850268c31777c5722a9833c8152a55615aed
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1512
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-06-19 21:20:44 +00:00
q3k 3dd3ff5dcd cluster/cert-manager: update to v1.5.0
Change-Id: I7a4cdadc9956141292302bc004d09d6e9e22855e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1497
Reviewed-by: informatic <informatic@hackerspace.pl>
2023-05-26 10:38:16 +00:00
q3k 073d850a95 cluster/prodvider: redeploy
Change-Id: I7a6cce06bb7c2f495d5354d3a2bebef64e307e42
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1491
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-04-01 11:18:25 +00:00
q3k 3a6d67e0c4 cluster/prodvider: rewrite against x509 lib for ed25519 support
This gets rid of cfssl for the kubernetes bits of prodvider, instead
using plain crypto/x509. This also allows to support our new fancy
ED25519 CA.

Change-Id: If677b3f4523014f56ea802b87499d1c0eb6d92e9
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1489
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-03-31 22:53:59 +00:00
implr 0173f501d7 cockroach: v20.2 -> v21.1
Following https://www.cockroachlabs.com/docs/v21.1/upgrade-cockroach-version?filters=linux
--logtostderr is deprecated/removed, but AFAICT from the default config
it will still log there: https://www.cockroachlabs.com/docs/v21.1/configure-logs#default-logging-configuration

Change-Id: I7fb3f835693f955b37de24dc581140ea34b11630
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1461
Reviewed-by: q3k <q3k@hackerspace.pl>
2023-01-30 21:16:42 +00:00
implr 4d98cf5ca8 calico: move from etcd to crd
Leaving the CRD definitions as YAML, extracted without modifications
from the original install file - this should make upgrades simpler.

Change-Id: I7211d2711e2af014b36dd887a951abb9e1032eb9
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1179
Reviewed-by: q3k <q3k@hackerspace.pl>
2022-11-19 21:40:34 +00:00
q3k 437b0c335f rook: fix benji
This unforks benji back into upstream. The old fork didn't support a new
authentication method on Ceph, and we don't have multiple clusters
anymore (so we don't need the functionality of the fork).

Change-Id: Ie79313b2321ca2e22ad2874b75a71385af95105f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1321
Reviewed-by: informatic <informatic@hackerspace.pl>
2022-06-19 11:49:12 +00:00
q3k b0e3693c0e cluster/kube: calico: fix etcd endpoints
Change-Id: Ia93d355ca343fa5a42ec37fbcae9135cb5304f6e
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1285
Reviewed-by: implr <implr@hackerspace.pl>
2022-06-11 19:00:52 +00:00
q3k bdd403c587 cluster: k0: move cockroachdb away from bc01n01, fixup joins
Reminded by a power failure on bc01n0{1,2}, we migrate away from at
least one of them into another server.

We also fix up the startup join parameter to not include the node itself
(which is not necessary, but a nice thing to have nonetheless).

Since bc01n01 was the initial node of the cluster, we also disable the
init job for k0 (which we don't care about anyway).

Change-Id: I3406471c0f9542e9d802d39138e400b5a5e74794
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1176
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-12-13 22:30:46 +00:00
implr eca1e080d7 calico: restore CNI_NET_DIR
Change-Id: I04e17f8639505f5b7cc42e86392abc175b7922db
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1178
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-12-03 03:10:13 +00:00
implr 12f176c1eb calico 3.14 -> 1.15
Change-Id: I9eceaf26017e483235b97c8d08717d2750fabe25
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/995
Reviewed-by: q3k <q3k@hackerspace.pl>
2021-11-20 22:12:52 +00:00
q3k 4b8ee32246 cluster/kube: always enable flexdriver
Documentation says [1] this is disabled by default in 1.1, but that
documentation kinda lies [2].

[1] - 235d5a384b/Documentation/flexvolume.md (ceph-flexvolume-configuration)

[2] - 64e28af741 (diff-d1eb5cba50e3770b61ccd3c730cd40514053e1da0233dfe09b5e7967e76a2a6cL424-L425)

Change-Id: Ia92c99e137ed751db62c0f56d42c4901986d0bb8
2021-09-14 21:39:39 +02:00
q3k 38f72fe094 cluster: k0: move ceph-waw3 to proper realm/zonegroup
With this we can use Ceph's multi-site support to easily migrate to our
new k0 Ceph cluster.

This migration was done by using radosgw-admin to rename the existing
realm/zonegroup to the new names (hscloud and eu), and then reworking
the jsonnet so that the Rook operator would effectively do nothing.

It sounds weird that creating a bunch of CRs like
Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator,
but that's how Rook works - a CephObjectStore generally creates
everything that the above CRs would create too, but implicitly. Adding
the extra CRs just allows specifying extra settings, like names.

(it wasn't fully a no-op, as the rgw daemon is parametrized by
realm/zonegroup/zone names, so that had to be restarted)

We also make the radosgw serve under object.ceph-eu.hswaw.net, which
allows us to right away start using a zonegroup URL instead of the
zone-only URL.

Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877
2021-09-14 21:39:39 +02:00
q3k 085a8ff247 cluster: k0: upgrade to ceph 16.2.5
This was fun. See b/6 for a log of how swimmingly this went.

Change-Id: I96c3c18b5d33ef86523b3506f49a390419e9ca7f
2021-09-14 21:39:39 +02:00
q3k 464fb04f39 cluster: k0: bump rook to 1.6
This is needed to get Rook to talk to an external Ceph 16/Pacific
cluster.

This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our
own CRD rewrite and just slurp in upstream CRD defs.

Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc
2021-09-14 21:39:37 +02:00
q3k 6579e842b0 kartongips: paper over^W^Wfix CRD updates
Ceph CRD updates would fail with:

  ERROR Error updating customresourcedefinitions cephclusters.ceph.rook.io: expected kind, but got map

This wasn't just https://github.com/bitnami/kubecfg/issues/259 . We pull
in the 'solution' from Pulumi
(https://github.com/pulumi/pulumi-kubernetes/pull/622) which just
retries the update via a JSON update instead, and that seems to have
worked.

We also add some better error return wrapping, which I used to debug
this issue properly.

Oof.

Change-Id: I2007a7857e44128d74760174b61b59efa58e9cbc
2021-09-11 20:54:34 +00:00
q3k 4f0468fa26 cluster/kube: remove ceph diff against k0 production
This now has a zero diff against prod.

location fields in CephCluster.storage.nodes seem to have been removed
from the CRD at some point. Not sure how the CRUSH tree now gets
populated, but whatever, it's been working like this for a while
already. Same for CephObjectStore.gateway.type.

The Rook Operator has been zero-scaled for a while now due to b/6.

Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59
2021-09-11 12:43:53 +00:00
q3k 89a16f4de4 cluster/admitomatic: allow use-regex n-i-c annotation
This annotation is used to permit routes defined by regexes instead of
simple prefix matching. This is used by our synapse deployment for
routing incomming HTTP requests to diffferent Synapse components.

I've stumbled upon this while deploying a new Matrix/Synapse instance.
This hasn't been yet a problem because the existing ingresses for Matrix
deployments predate admitomatic.

Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0
2021-06-06 12:58:11 +00:00
q3k 7251f2720e Merge changes Ib068109f,I9a00487f,I1861fe7c,I254983e5,I3e2bedca, ...
* changes:
  cluster/identd/ident: update README
  cluster/kube: deploy identd
  cluster/identd: implement
  cluster/identd/kubenat: implement
  cluster/identd/cri: import
  cluster/identd/ident: add TestE2E
  cluster/identd/ident: add Query function
  cluster/identd/ident: add IdentError
  cluster/identd/ident: add basic ident protocol server
  cluster/identd/ident: add basic ident protocol client
2021-05-28 23:08:10 +00:00
q3k 2414afe3c0 cluster/kube: deploy identd
Change-Id: I9a00487fc4a972ecb0904055dbaaab08221062c1
2021-05-26 19:46:09 +00:00
q3k e17f7edde0 cluster/kube: nginx: add Hscloud-Nic-Source-* headers
These can be used by production jobs to get the source port of the
client connecting over HTTP. A followup CR implements just that.

Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a
2021-05-22 19:16:39 +00:00
q3k ba2f4d8215 cluster/prodvider: deploy
Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53
2021-05-19 22:13:26 +00:00
q3k 5ae5cbec81 Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k" 2021-05-19 15:34:45 +00:00
q3k 2e8d24b84a cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k
This fixes CVE-2021-3450 and CVE-2021-3449.

Deployed on prod:

$ kubectl -n nginx-system exec nginx-ingress-controller-5c69c5cb59-2f8v4 -- openssl version
OpenSSL 1.1.1k  25 Mar 2021

Change-Id: I7115fd2367cca7b687c555deb2134b22d19a291a
2021-03-25 18:16:13 +00:00
q3k 3b8935378a cluster/crdb: make init job 'idempotent'
This enables its redeployment with a newer crdb image.

Change-Id: If039992674f401af53738c80d22cc2ca2818fe00
2021-03-17 21:48:30 +00:00
q3k 943ab5b1a6 cluster/admitomatic: allow whitelist-source-range
Without this, cert-manager get stuck.

Deployed to prod.

Change-Id: I356cd44f455b6f4aecea9ae396f6a05e1a727859
2021-02-07 23:35:28 +00:00
q3k 41bbf1436a cluster/kube: deploy admitomatic webhook
This has been (succesfully) tested on prod and then rolled back.

Change-Id: I22657f66b4aeaa8a0ae452035ba18a79f4549b14
2021-02-07 19:19:23 +00:00
q3k 3c5d836c56 cluster/kube: deploy admitomatic
This doesn't yet enable a webhook, but deploys admitomatic itself.

Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e
2021-02-07 19:19:02 +00:00
patryk edf14cc5f4 crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4
This change reflects the current production state.

Upgrade was done by going through following versions:
19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4

Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c
2021-01-23 23:00:29 +01:00
q3k 3b9ee5f1c0 ceph: bump to 14.2.16
More as-builts. This has already been bumped. Had to coax ceph-waw2 to
upgrade despite the fact that it's horribly broken.

Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3
2021-01-19 21:45:26 +00:00
q3k 2c04c8410a rook: bump to 1.2.7
As-built: deployed to ceph-waw{2,3} already.

Change-Id: I27189b273cf72638cf2036681054832db99591da
2021-01-19 21:41:13 +01:00
q3k f18a531f9b prodvider: bump to Go 1.15.5
Change-Id: I0f7999deb571aef12533f0ceee21c0283bc0bdc4
2020-11-27 09:50:09 +00:00
q3k c7de7e562f cluster: do not export metallb routes to mesh peers
This prevents metallb routes being announced from all peers to our ToR,
thereby preventing issues with traffic hitting services with
externalTrafficPolicy: local.

There still is the from-host loopback issue, but that will be fixed by
upgrading to kube 1.15.

Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04
2020-10-03 14:56:52 +00:00
q3k f0acf16564 prodvider: use SANs in service certificates
This fixes compatibility with prodaccess tools built with Go 1.15, which
introduced 'X.509 CommonName deprecation' [1].

[1] - https://golang.org/doc/go1.15#commonname

Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3
2020-10-03 14:56:35 +00:00
q3k a5ed644980 k0.hswaw.net: pass metallb through Calico
Previously, we had the following setup:

                          .-----------.
                          | .....     |
                        .-----------.-|
                        | dcr01s24  | |
                      .-----------.-| |
                      | dcr01s22  | | |
                  .---|-----------| |-'
    .--------.    |   |---------. | |
    | dcsw01 | <----- | metallb | |-'
    '--------'        |---------' |
                      '-----------'

Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.

Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.

                      .-------------------------.
                      | dcr01s24                |
                      |-------------------------|
    .--------.        |---------.   .---------. |
    | dcsw01 | <----- | Calico  |<--| metallb | |
    '--------'        |---------'   '---------' |
                      '-------------------------'

This makes Calico announce our pod/service networks into our L3 fabric!

Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.

We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.

All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.

We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.

This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...

There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.

Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
2020-09-23 18:55:12 +00:00
q3k 059fdfed3b k0: add resource requests/limits to nginx, remove gitea
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.

I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.

This change reflects ad-hoc production changes.

Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
2020-09-20 22:53:40 +00:00
q3k 0581bbf8a0 games/factorio: add modproxy
This adds a mod proxy system, called, well, modproxy.

It sits between Factorio server instances and the Factorio mod portal,
allowing for arbitrary mod download without needing the servers to know
Factorio credentials.

Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212
2020-08-14 13:03:46 +02:00
q3k 3d29484ebb k0: move registry to ceph-waw3
ceph-waw2 has currently some production issues [1] which have started to
cause write failures in the registry. The registry is the only user of
ceph-waw2's affected pool, so we reduce the dumpster fire blast radious
by moving it over to ceph-waw3.

This has already been deployed and data has been migrated over (via
s3cmd sync), and the migration has been verified (by a push and pull,
and pull of an older image).

[1] - pgs stuck inactive in the object storage pool

Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347
2020-08-04 01:36:51 +02:00
q3k 4ded56ab8a prodvider: emit client/server cert
Change-Id: I024782a7dfa6e16ff5f562a62ddd8fe3bf299c51
2020-08-01 22:01:05 +02:00
q3k f3312ef77e *: developer machine HSPKI credentials
In addition to k8s certificates, prodaccess now issues HSPKI
certificates, with DN=$username.sso.hswaw.net. These are installed into
XDG_CONFIG_HOME (or os equiv).

//go/pki will now automatically attempt to load these certificates. This
means you can now run any pki-dependant tool with -hspki_disable, and
with automatic mTLS!

Change-Id: I5b28e193e7c968d621bab0d42aabd6f0510fed6d
2020-08-01 17:15:52 +02:00
q3k 509ab6e29a k0/cockroach: add public DNS entry for cockroach
Change-Id: I934bf348e2165148b515b709e853ab67f039a402
2020-07-30 22:56:30 +02:00
informatic 97a6ca8a8b Merge "cluster/kube/lib/nginx: add gitea-prod ingress service" 2020-07-02 17:15:53 +00:00
informatic 0697e01144 cluster/kube/lib/registry: allow auth'd users to pull all images
"Anyone can pull all images" rule did only match on anonymous users. Now
it should match all users, including authenticated ones.

Change-Id: I2205299093feca51f30526ba305eadbaa0a68ecb
2020-07-02 18:45:42 +02:00
informatic f00edf6ee8 cluster/kube/lib/nginx: add gitea-prod ingress service
We would like gitea to have its ssh server exposed on TCP port 22 on the
same address as its web interface. We would also still like to use all
the automation around ingresses already in place (like cert-manager
integration).

To solve this, we create an additional LoadBalancer service for
nginx-ingress-controller and set up special tcp-services forwarding rule
to pass port 22 traffic to gitea-prod/gitea service, like we already do
in case of gerrit.

Change-Id: I5bfc901ebe858464f8e9c2f3b2216b254ccd6c4d
2020-07-02 18:30:38 +02:00