Commit Graph

215 Commits (56ff18c486afbd56ebbfe337f8613932b0aefbc9)

Author SHA1 Message Date
q3k 432fa30ded cluster/certs: bump ca-kube-prodivider
Redeployed.

Change-Id: I01110433f89df5595de0f9587508104d6091a774
2021-08-29 17:20:59 +00:00
q3k 89a16f4de4 cluster/admitomatic: allow use-regex n-i-c annotation
This annotation is used to permit routes defined by regexes instead of
simple prefix matching. This is used by our synapse deployment for
routing incomming HTTP requests to diffferent Synapse components.

I've stumbled upon this while deploying a new Matrix/Synapse instance.
This hasn't been yet a problem because the existing ingresses for Matrix
deployments predate admitomatic.

Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0
2021-06-06 12:58:11 +00:00
q3k 7251f2720e Merge changes Ib068109f,I9a00487f,I1861fe7c,I254983e5,I3e2bedca, ...
* changes:
  cluster/identd/ident: update README
  cluster/kube: deploy identd
  cluster/identd: implement
  cluster/identd/kubenat: implement
  cluster/identd/cri: import
  cluster/identd/ident: add TestE2E
  cluster/identd/ident: add Query function
  cluster/identd/ident: add IdentError
  cluster/identd/ident: add basic ident protocol server
  cluster/identd/ident: add basic ident protocol client
2021-05-28 23:08:10 +00:00
q3k 46c3137d36 cluster/identd/ident: update README
Change-Id: Ib068109ff37749207e7b2a18c07f51d3c4ed3fd6
2021-05-26 19:46:13 +00:00
q3k 2414afe3c0 cluster/kube: deploy identd
Change-Id: I9a00487fc4a972ecb0904055dbaaab08221062c1
2021-05-26 19:46:09 +00:00
q3k 044386d638 cluster/identd: implement
This implements the main identd service that will run on our production
hosts. It's comparatively small, as most of the functionality is
implemented in //cluster/identd/ident and //cluster/identd/kubenat.

Change-Id: I1861fe7c93d105faa19a2bafbe9c85fe36502f73
2021-05-26 19:46:06 +00:00
q3k 6b649f8234 cluster/identd/kubenat: implement
This is a library to find pod information for a given TCP 4-tuple.

Change-Id: I254983e579e3aaa04c0c5491851f4af94a3f4249
2021-05-26 19:46:02 +00:00
q3k ae052f0804 cluster/identd/cri: import
This imports the CRI protobuf/gRPC specs. These are pulled from:

    https://raw.githubusercontent.com/kubernetes/cri-api/master/pkg/apis/runtime/v1alpha2/api.proto

Our host containerd does not implement v1, so we go with v1alpha2.

Change-Id: I3e2bedca76edc85eea9b61a8634c92175f0d2a30
2021-05-26 19:45:58 +00:00
q3k 3638a3d76a cluster/identd/ident: add TestE2E
Change-Id: I8a95fadf19376de2806cb63897b77e370559392f
2021-05-23 16:27:22 +00:00
q3k 8e603e13e5 cluster/identd/ident: add Query function
This is a high-level wrapper for querying identd, and uses IdentError to
carry errors received from the server.

Change-Id: I6444a67117193b97146ffd1548151cdb234d47b5
2021-05-23 16:27:17 +00:00
q3k 1c2bc12ad0 cluster/identd/ident: add IdentError
This adds a Go error type that can be used to wrap any ErrorResponse.

Change-Id: I57fbd056ac774f4e2ae3bdf85941c1010ada0656
2021-05-23 16:26:59 +00:00
q3k ce2737f2e7 cluster/identd/ident: add basic ident protocol server
This adds an ident protocol server and tests for it.

Change-Id: I830f85faa7dce4220bd7001635b20e88b4a8b417
2021-05-23 16:26:54 +00:00
q3k d4438d67a2 cluster/identd/ident: add basic ident protocol client
This is the first pass at an ident protocol client. In the end, we want
to implement an ident protocol server for our in-cluster identd, but
starting out with a client helps me getting familiar with the protocol,
and will allow the server implementation to be tested against the
client.

Change-Id: Ic37b84577321533bab2f2fbf7fb53409a5defb95
2021-05-23 16:26:50 +00:00
q3k e17f7edde0 cluster/kube: nginx: add Hscloud-Nic-Source-* headers
These can be used by production jobs to get the source port of the
client connecting over HTTP. A followup CR implements just that.

Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a
2021-05-22 19:16:39 +00:00
q3k ba2f4d8215 cluster/prodvider: deploy
Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53
2021-05-19 22:13:26 +00:00
q3k 02e1598eb3 cluster/prodvider: emit crdb certs
This emits short-lived user credentials for a `dev-user` in crdb-waw1
any time someone prodaccesses.

Change-Id: I0266a05c1f02225d762cfd2ca61976af0658639d
2021-05-19 22:13:22 +00:00
q3k bade46d45f go/pki: fix error return
DeveloperCredentialsLocation used to glog.Exitf instead of returning an
error, and a consumer (prodaccess) used to not check the return code.
Bad refactor?

Change-Id: I6c2d05966ba6b3eb300c24a51584ccf5e324cd49
2021-05-19 22:12:08 +00:00
q3k 5ae5cbec81 Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k" 2021-05-19 15:34:45 +00:00
q3k 99b91b11f1 cluster/k0/admitomatic: add .hswaw.net to hswaw-prod namespace
This was preventing certificate refresh in the hswaw-prod mirko ingress.

Change-Id: I14b18b642a3948a9864e2d9a90b2a2b2c145b9b1
2021-03-28 17:34:34 +00:00
q3k 7967ca177b cluster/certs: update k0 certs
This leaves us with the next set of expiring certs in September 2021.

Fixes b/36.

Change-Id: I536497626c0dd3807fccf28d4b61e5e531cf8d9c
2021-03-27 12:19:25 +00:00
q3k 41b882d053 cluster: remove bc01n03 certs/secrets
Decomissioned node, noticed while rolling over certs in b/36.

Change-Id: Ia386ff846998c52799662179c325b24e78f2eca8
2021-03-27 12:18:56 +00:00
q3k 2e8d24b84a cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k
This fixes CVE-2021-3450 and CVE-2021-3449.

Deployed on prod:

$ kubectl -n nginx-system exec nginx-ingress-controller-5c69c5cb59-2f8v4 -- openssl version
OpenSSL 1.1.1k  25 Mar 2021

Change-Id: I7115fd2367cca7b687c555deb2134b22d19a291a
2021-03-25 18:16:13 +00:00
q3k bf266c6aaf cluster/k0: add dns crdb user
In preparation for running PowerDNS on k0.

Change-Id: I853c7465a6a32d02628fa6cfdeb445eb9937b3be
2021-03-17 21:49:00 +00:00
q3k 3b8935378a cluster/crdb: make init job 'idempotent'
This enables its redeployment with a newer crdb image.

Change-Id: If039992674f401af53738c80d22cc2ca2818fe00
2021-03-17 21:48:30 +00:00
q3k 64de7afe32 cluster/kube/k0: fix syntax errors
This happened in 793ca1b3 and slipped past review.

Change-Id: Ie31f0e1ec03d6e4545d6683b21f528550bf4ef9f
2021-03-17 21:47:51 +00:00
q3k 793ca1b3b2 cluster/kube: limit OSDs in ceph-waw3 to 8GB RAM
Each OSD is connected to a 6TB drive, and with the good ol' 1TB storage
-> 1GB RAM rule of thumb for OSDs, we end up with 6GB. Or, to round up,
8GB.

I'm doing this because over the past few weeks OSDs in ceph-waw3 have
been using a _ton_ of RAM. This will probably not prevent that (and
instead they wil OOM more often :/), but it at will prevent us from
wasting resources (k0 started migrating pods to other nodes, and running
full nodes like that without an underlying request makes for a terrible
draining experience).

We need to get to the bottom of why this is happening in the first
place, though. Did this happen as we moved to containerd?

Followup: b.hswaw.net/29

Already deployed to production.

Change-Id: I98df63763c35017eb77595db7b9f2cce71756ed1
2021-03-07 00:09:58 +00:00
q3k 3ba5c1b591 *: docs pass
Change-Id: I87ca80d3f7728ed407071468ac233e6ad4574929
2021-03-06 22:21:28 +00:00
q3k bc0d3cb227 hackdoc: link to cs instead of gitweb
Change-Id: Ifca7a63517bceffe7ccc0452474d9d16626486de
2021-03-06 22:16:54 +00:00
q3k 0d26fc9780 cluster: disable nginx/acme
These are unused.

Change-Id: I2a428dabd0a27c060c595f5e0843d7d8d8e26dcd
2021-02-15 22:14:41 +01:00
q3k 765e369255 cluster: replace docker with containerd
This removes Docker and docker-shim from our production kubernetes, and
moves over to containerd/CRI. Docker support within Kubernetes was
always slightly shitty, and with 1.20 the integration was dropped
entirely. CRI/Containerd/runc is pretty much the new standard.

Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe
2021-02-15 22:14:15 +01:00
q3k 4b613303b1 RFC: *: move away from rules_nixpkgs
This is an attempt to see how well we do without rules_nixpkgs.

rules_nixpkgs has the following problems:

 - complicates our build system significantly (generated external
   repository indirection for picking local/nix python and go)
 - creates builds that cannot run on production (as they are tainted by
   /nix/store libraries)
 - is not a full solution to the bazel hermeticity problem anyway, and
   we'll have to tackle that some other way (eg. by introducing proper
   C++ cross-compilation toolchains and building everything from C,
   including Python and Go)

Instead of rules_nixpkgs, we ship a shell.nix file, so NixOS users can
just:

  jane@hacker:~/hscloud $ nix-shell
  hscloud-build-chrootenv:jane@hacker:~/hscloud$ prodaccess

This shell.nix is in a way nicer, as it immediately gives you all tools
needed to access production straight away.

Change-Id: Ieceb5ae0fb4d32e87301e5c99416379cedc900c5
2021-02-15 22:11:35 +01:00
q3k 4842705406 cluster/nix: integrate with readtree
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:

   nix-build -A cluster.nix.provision

   result/bin/provision

Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
2021-02-14 14:46:07 +00:00
q3k 225a5c7ee9 nixpkgs: bump
Fixes b/3.

Change-Id: I2f734422cdad00f78956477815c4aea645c6c49e
2021-02-14 14:43:07 +00:00
q3k 78d6f11cb2 Merge "cluster/admitomatic: allow whitelist-source-range" 2021-02-08 17:21:59 +00:00
q3k 877cf0af26 🅱️
Fixes b/8

Change-Id: I5a5779c3688451d89c0601dc913143d75048c9f6
2021-02-08 15:10:11 +00:00
q3k 943ab5b1a6 cluster/admitomatic: allow whitelist-source-range
Without this, cert-manager get stuck.

Deployed to prod.

Change-Id: I356cd44f455b6f4aecea9ae396f6a05e1a727859
2021-02-07 23:35:28 +00:00
q3k f40c9249ce cluster/kube: allow system:admin-namespaces to modify ingresses
This will permit any binding to system:admin-namespaces (eg. personal-*
namespaces, per-namespace extra admin access like matrix-0x3c) the
ability to create and updates ingresses.

Change-Id: I522896ebe290fe982d6fe46b7b1d604d22b4f72c
2021-02-07 19:24:43 +00:00
q3k 41bbf1436a cluster/kube: deploy admitomatic webhook
This has been (succesfully) tested on prod and then rolled back.

Change-Id: I22657f66b4aeaa8a0ae452035ba18a79f4549b14
2021-02-07 19:19:23 +00:00
q3k 3c5d836c56 cluster/kube: deploy admitomatic
This doesn't yet enable a webhook, but deploys admitomatic itself.

Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e
2021-02-07 19:19:02 +00:00
q3k 3ab5f07c64 cluster/admitomatic: build docker image
Change-Id: I086a8b17a4dc7257de1bae3a6f0c95400af7e115
2021-02-07 19:18:53 +00:00
q3k c80321d17e Merge "cluster: add admitomatic CA/certificate" 2021-02-06 23:18:59 +00:00
q3k 04604b2aae cluster: add admitomatic CA/certificate
Change-Id: Idb32dc38b897aa266b6d2d6fd57a5e38b47db7fc
2021-02-06 17:18:58 +00:00
informatic f4a6a56662 cluster/kube/k0: add issues.hackerspace.pl crdb user
Change-Id: If78f795e0e35360b65c666e6b217037fc34a2ccf
2021-02-01 21:32:25 +01:00
informatic 3b8a43f35d cluster/kube/k0: add issues.hackerspace.pl ceph s3 user
Change-Id: If5eef3404bdc08ded88e46f45bad0f9abcdb0f1c
2021-02-01 21:19:59 +01:00
q3k c6118649ab cluster/admitomatic: finish up service
This turns admitomatic into a self-standing service that can be used as
an admission controller.

I've tested this E2E on a local k3s server, and have some early test
code for that - but that'll land up in a follow up CR, as it first needs
to be cleaned up.

Change-Id: I46da0fc49f9d1a3a1a96700a36deb82e5057249b
2021-01-31 12:18:16 +01:00
q3k 5d2c8fcda0 cluster/admitomatic: finish up ingress admission logic
This gives us nearly everything required to run the admission
controller. In addition to checking for allowed domains, we also do some
nginx-inress-controller security checks.

Change-Id: Ib187de6d2c06c58bd8c320503d4f850df2ec8abd
2021-01-31 12:18:16 +01:00
q3k 649565324b cluster/admitomatic: implement basic dns/ns filtering
This is the beginning of a validating admission controller which we will
use to permit end-users access to manage Ingresses.

This first pass implements an ingressFilter, which is the main structure
through which allowed namespace/dns combinations will be allowed. The
interface is currently via a test, but in the future this will likely be
configured via a command line, or via a serialized protobuf config.

Change-Id: I22dbed633ea8d8e1fa02c2a1598f37f02ea1b309
2021-01-30 19:19:35 +01:00
patryk edf14cc5f4 crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4
This change reflects the current production state.

Upgrade was done by going through following versions:
19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4

Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c
2021-01-23 23:00:29 +01:00
patryk f3153888a8 cluster/kube: Add k0-cockroach.jsonnet, add Gitea client cert
Change-Id: Ibc5db1b0114b2540b6dc806e75e9a36cf9a3bc50
2021-01-23 15:38:50 +01:00
q3k 61f978a0a0 *: tear down ceph-waw2
It reached the stage of being crapped out so much that the OSDs spurious
IOPS killed the performance of disks colocated on the same M610 RAID
controllers. This made etcd _very_ slow, to the point of churning
through re-elections due to timeouts.

etcd/apiserver latencies, observe the difference at ~15:38:

https://object.ceph-waw3.hswaw.net/q3k-personal/4fbe8d4cfc8193cad307d487371b4e44358b931a7494aa88aff50b13fae9983c.png

I moved gerrit/* and matrix/appservice-irc-freenode PVCs to ceph-waw3 by
hand. The rest were non-critical so I removed them, they can be
recovered from benji backups if needed.

Change-Id: Iffbe87aefc06d8324a82b958a579143b7dd9914c
2021-01-22 16:26:09 +01:00