1
0
Fork 0
Commit Graph

289 Commits (6e10e46f96875ef15a1e1588e9b03d4f1f6b42a5)

Author SHA1 Message Date
q3k 3ba5c1b591 *: docs pass
Change-Id: I87ca80d3f7728ed407071468ac233e6ad4574929
2021-03-06 22:21:28 +00:00
q3k bc0d3cb227 hackdoc: link to cs instead of gitweb
Change-Id: Ifca7a63517bceffe7ccc0452474d9d16626486de
2021-03-06 22:16:54 +00:00
q3k 0d26fc9780 cluster: disable nginx/acme
These are unused.

Change-Id: I2a428dabd0a27c060c595f5e0843d7d8d8e26dcd
2021-02-15 22:14:41 +01:00
q3k 765e369255 cluster: replace docker with containerd
This removes Docker and docker-shim from our production kubernetes, and
moves over to containerd/CRI. Docker support within Kubernetes was
always slightly shitty, and with 1.20 the integration was dropped
entirely. CRI/Containerd/runc is pretty much the new standard.

Change-Id: I98c89d5433f221b5fe766fcbef261fd72db530fe
2021-02-15 22:14:15 +01:00
q3k 4b613303b1 RFC: *: move away from rules_nixpkgs
This is an attempt to see how well we do without rules_nixpkgs.

rules_nixpkgs has the following problems:

 - complicates our build system significantly (generated external
   repository indirection for picking local/nix python and go)
 - creates builds that cannot run on production (as they are tainted by
   /nix/store libraries)
 - is not a full solution to the bazel hermeticity problem anyway, and
   we'll have to tackle that some other way (eg. by introducing proper
   C++ cross-compilation toolchains and building everything from C,
   including Python and Go)

Instead of rules_nixpkgs, we ship a shell.nix file, so NixOS users can
just:

  jane@hacker:~/hscloud $ nix-shell
  hscloud-build-chrootenv:jane@hacker:~/hscloud$ prodaccess

This shell.nix is in a way nicer, as it immediately gives you all tools
needed to access production straight away.

Change-Id: Ieceb5ae0fb4d32e87301e5c99416379cedc900c5
2021-02-15 22:11:35 +01:00
q3k 4842705406 cluster/nix: integrate with readtree
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:

   nix-build -A cluster.nix.provision

   result/bin/provision

Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
2021-02-14 14:46:07 +00:00
q3k 225a5c7ee9 nixpkgs: bump
Fixes b/3.

Change-Id: I2f734422cdad00f78956477815c4aea645c6c49e
2021-02-14 14:43:07 +00:00
q3k 78d6f11cb2 Merge "cluster/admitomatic: allow whitelist-source-range" 2021-02-08 17:21:59 +00:00
q3k 877cf0af26 🅱️
Fixes b/8

Change-Id: I5a5779c3688451d89c0601dc913143d75048c9f6
2021-02-08 15:10:11 +00:00
q3k 943ab5b1a6 cluster/admitomatic: allow whitelist-source-range
Without this, cert-manager get stuck.

Deployed to prod.

Change-Id: I356cd44f455b6f4aecea9ae396f6a05e1a727859
2021-02-07 23:35:28 +00:00
q3k f40c9249ce cluster/kube: allow system:admin-namespaces to modify ingresses
This will permit any binding to system:admin-namespaces (eg. personal-*
namespaces, per-namespace extra admin access like matrix-0x3c) the
ability to create and updates ingresses.

Change-Id: I522896ebe290fe982d6fe46b7b1d604d22b4f72c
2021-02-07 19:24:43 +00:00
q3k 41bbf1436a cluster/kube: deploy admitomatic webhook
This has been (succesfully) tested on prod and then rolled back.

Change-Id: I22657f66b4aeaa8a0ae452035ba18a79f4549b14
2021-02-07 19:19:23 +00:00
q3k 3c5d836c56 cluster/kube: deploy admitomatic
This doesn't yet enable a webhook, but deploys admitomatic itself.

Change-Id: Id177bc8841c873031f9c196b8ff3c12dd846ba8e
2021-02-07 19:19:02 +00:00
q3k 3ab5f07c64 cluster/admitomatic: build docker image
Change-Id: I086a8b17a4dc7257de1bae3a6f0c95400af7e115
2021-02-07 19:18:53 +00:00
q3k c80321d17e Merge "cluster: add admitomatic CA/certificate" 2021-02-06 23:18:59 +00:00
q3k 04604b2aae cluster: add admitomatic CA/certificate
Change-Id: Idb32dc38b897aa266b6d2d6fd57a5e38b47db7fc
2021-02-06 17:18:58 +00:00
informatic f4a6a56662 cluster/kube/k0: add issues.hackerspace.pl crdb user
Change-Id: If78f795e0e35360b65c666e6b217037fc34a2ccf
2021-02-01 21:32:25 +01:00
informatic 3b8a43f35d cluster/kube/k0: add issues.hackerspace.pl ceph s3 user
Change-Id: If5eef3404bdc08ded88e46f45bad0f9abcdb0f1c
2021-02-01 21:19:59 +01:00
q3k c6118649ab cluster/admitomatic: finish up service
This turns admitomatic into a self-standing service that can be used as
an admission controller.

I've tested this E2E on a local k3s server, and have some early test
code for that - but that'll land up in a follow up CR, as it first needs
to be cleaned up.

Change-Id: I46da0fc49f9d1a3a1a96700a36deb82e5057249b
2021-01-31 12:18:16 +01:00
q3k 5d2c8fcda0 cluster/admitomatic: finish up ingress admission logic
This gives us nearly everything required to run the admission
controller. In addition to checking for allowed domains, we also do some
nginx-inress-controller security checks.

Change-Id: Ib187de6d2c06c58bd8c320503d4f850df2ec8abd
2021-01-31 12:18:16 +01:00
q3k 649565324b cluster/admitomatic: implement basic dns/ns filtering
This is the beginning of a validating admission controller which we will
use to permit end-users access to manage Ingresses.

This first pass implements an ingressFilter, which is the main structure
through which allowed namespace/dns combinations will be allowed. The
interface is currently via a test, but in the future this will likely be
configured via a command line, or via a serialized protobuf config.

Change-Id: I22dbed633ea8d8e1fa02c2a1598f37f02ea1b309
2021-01-30 19:19:35 +01:00
patryk edf14cc5f4 crdb: replace bc01n03 with dcr01s22, upgrade to v20.2.4
This change reflects the current production state.

Upgrade was done by going through following versions:
19.1.0 -> 19.2.12 -> 20.1.10 -> 20.2.4

Change-Id: I8b33b8116363f1a918423fd18ba3d1b5c910851c
2021-01-23 23:00:29 +01:00
patryk f3153888a8 cluster/kube: Add k0-cockroach.jsonnet, add Gitea client cert
Change-Id: Ibc5db1b0114b2540b6dc806e75e9a36cf9a3bc50
2021-01-23 15:38:50 +01:00
q3k 61f978a0a0 *: tear down ceph-waw2
It reached the stage of being crapped out so much that the OSDs spurious
IOPS killed the performance of disks colocated on the same M610 RAID
controllers. This made etcd _very_ slow, to the point of churning
through re-elections due to timeouts.

etcd/apiserver latencies, observe the difference at ~15:38:

https://object.ceph-waw3.hswaw.net/q3k-personal/4fbe8d4cfc8193cad307d487371b4e44358b931a7494aa88aff50b13fae9983c.png

I moved gerrit/* and matrix/appservice-irc-freenode PVCs to ceph-waw3 by
hand. The rest were non-critical so I removed them, they can be
recovered from benji backups if needed.

Change-Id: Iffbe87aefc06d8324a82b958a579143b7dd9914c
2021-01-22 16:26:09 +01:00
q3k 3b9ee5f1c0 ceph: bump to 14.2.16
More as-builts. This has already been bumped. Had to coax ceph-waw2 to
upgrade despite the fact that it's horribly broken.

Change-Id: Ia762f5d7d88d6420c2fc25cf199037cbccde0cb3
2021-01-19 21:45:26 +00:00
q3k 2c04c8410a rook: bump to 1.2.7
As-built: deployed to ceph-waw{2,3} already.

Change-Id: I27189b273cf72638cf2036681054832db99591da
2021-01-19 21:41:13 +01:00
q3k f684535c6e k0: remove bc01n03 from nix defs
This only affects ETCD_INITIAL_* env vars, so is is effectively a no-op.

Deployed to prod.

Change-Id: Ic9118e17b088d1b58ebaf1ac0708a1ee6fcf2c06
2021-01-19 20:20:33 +01:00
q3k cf842b0442 k0: reflect reality
This is after the monster^Wrook outage of the week two weeks ago caused
by bc01n03 dying.

Plan is to migrate ceph-waw3 to be external, yeet ceph-waw2, and extend
crdb-waw1 to another node.

Change-Id: I133af3b1171fea383b45bf06c51e48a5c40341e4
2021-01-19 20:08:26 +01:00
q3k 9708ba02ec Merge "cluster: use static addresses" 2020-12-15 18:53:54 +00:00
q3k acdd665b08 cluster: use static addresses
This disables DHCP on all k0 nodes. This change has been tentatively
deployed to bc01n01 (which is cordoned off in kube), and I will deploy
it to the rest of k0 machines once merged.

Change-Id: I96253a9d0acedb4512c877c64174992ffdb43d58
2020-12-14 19:10:52 +01:00
patryk cae7cf776f k0: add missing curly brace termination in woju's S3 user name
Change-Id: Ib2752d798f6e23493daee446a834e244f858330e
2020-11-28 14:36:48 +01:00
patryk 34668a5b7b k0: add cz3's personal s3 user
Change-Id: I51ee80eb05c34cfd8b03e15fcaefb5f235587c50
2020-11-28 13:45:25 +01:00
q3k f18a531f9b prodvider: bump to Go 1.15.5
Change-Id: I0f7999deb571aef12533f0ceee21c0283bc0bdc4
2020-11-27 09:50:09 +00:00
q3k 0754ed86a2 prodvider: fix build after k8s update, add to CI presubmit
Change-Id: I5a3794541853abd1fb16e67e285edfa29c2f5cf7
2020-11-27 09:43:47 +00:00
q3k e00fe3a448 cluster/tools/kartongips: skip tests broken by fork
These tests are broken as they depend on some test data that we
currently don't have in hscloud. They should be fixed ASAP.

Change-Id: I2571c2958cb84e145a7e3a44171685ecf43cf499
2020-11-12 00:45:15 +01:00
q3k 640336144d cluster/tools: integrate kartongips as main kubecfg tool
Change-Id: If6a6c8e9c9163f0fc25adcaa8680857fdca69cd3
2020-11-12 00:40:08 +01:00
q3k be538db63b cluster/tools/kartongips: init
This forks bitnami/kubecfg into kartongips. The rationale is that we
want to implement hscloud-specific functionality that wouldn't really be
upstreamable into kubecfg (like secret support, mulit-cluster support).

We forked off from github.com/q3k/kubecfg at commit b6817a94492c561ed61a44eeea2d92dcf2e6b8c0.

Change-Id: If5ba513905e0a86f971576fe7061a471c1d8b398
2020-11-12 00:39:34 +01:00
q3k bfe9bb0e3a k0: add woju's personal s3 user
Change-Id: I8ed5bb5428594b74460f1b89185d684cb6c26268
2020-10-27 20:50:50 +01:00
q3k e77f7717d4 k0: bump to 1.16.5
Change-Id: I548808ce4e0deb0513a1e00963f383d84b9d920c
2020-10-10 22:39:50 +02:00
q3k 1257389d3d k0: expose controller-manager and scheduler metrics
We want to be able to scrape controller-manager and scheduler metrics
into Prometheus. For that, each of them needs to:

 1) listen on a secure port
 2) have authn enabled

With this, any k8s user with the right permissions (and a bearer token
or TLS certificate) can come in and access metrics over a node's public
IP address. Access without a certificate/token gets thrown into the
system:anonymous user, which as no access to any API.

Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b
2020-10-10 16:00:15 +00:00
q3k 36224c617a clustercfg: show diff before switching to new configuration
This is mildly hacky, but lets us be more informed before we switch to a
new configuration.

Change-Id: I008f3f698db702f1e0992bd41a8d1050449d59b5
2020-10-10 16:00:11 +00:00
q3k 2e001e5046 k0: bump to 1.15.4
This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.

Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!

Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
2020-10-03 16:32:38 +00:00
q3k 2a223705fd cluster: bump certs
This has been deployed to k0 nodes.

Current state of cluster certificates:

cluster/certs/ca-etcd.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-etcdpeer.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kube.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kubefront.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kube-prodvider.cert
            Not After : Sep  1 21:30:00 2021 GMT
cluster/certs/etcd-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcd-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcd-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-calico.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/etcd-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/etcd-kube.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcdpeer-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcdpeer-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/etcd-root.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-apiserver.cert
            Not After : Oct  3 15:26:00 2021 GMT
cluster/certs/kube-controllermanager.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kubefront-apiserver.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/kube-proxy.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-scheduler.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-serviceaccounts.cert
            Not After : Mar 28 15:15:00 2021 GMT

Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314
2020-10-03 16:32:32 +00:00
q3k fbe234bdb2 cluster: rename module-* into modules/*
Change-Id: I65e06f3e9cec2ba0071259eb755eddbbd1025b97
2020-10-03 14:57:30 +00:00
q3k c7de7e562f cluster: do not export metallb routes to mesh peers
This prevents metallb routes being announced from all peers to our ToR,
thereby preventing issues with traffic hitting services with
externalTrafficPolicy: local.

There still is the from-host loopback issue, but that will be fixed by
upgrading to kube 1.15.

Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04
2020-10-03 14:56:52 +00:00
q3k f0acf16564 prodvider: use SANs in service certificates
This fixes compatibility with prodaccess tools built with Go 1.15, which
introduced 'X.509 CommonName deprecation' [1].

[1] - https://golang.org/doc/go1.15#commonname

Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3
2020-10-03 14:56:35 +00:00
q3k 44628f2b9e Merge "k0.hswaw.net: pass metallb through Calico" 2020-10-02 22:54:57 +00:00
q3k e7fca3acd8 ci_presubmit: init
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).

For now, you can manually run it to ensure that Everything At Least
Kinda Works.

Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
2020-09-25 21:15:07 +00:00
q3k a5ed644980 k0.hswaw.net: pass metallb through Calico
Previously, we had the following setup:

                          .-----------.
                          | .....     |
                        .-----------.-|
                        | dcr01s24  | |
                      .-----------.-| |
                      | dcr01s22  | | |
                  .---|-----------| |-'
    .--------.    |   |---------. | |
    | dcsw01 | <----- | metallb | |-'
    '--------'        |---------' |
                      '-----------'

Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.

Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.

                      .-------------------------.
                      | dcr01s24                |
                      |-------------------------|
    .--------.        |---------.   .---------. |
    | dcsw01 | <----- | Calico  |<--| metallb | |
    '--------'        |---------'   '---------' |
                      '-------------------------'

This makes Calico announce our pod/service networks into our L3 fabric!

Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.

We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.

All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.

We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.

This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...

There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.

Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
2020-09-23 18:55:12 +00:00
q3k 059fdfed3b k0: add resource requests/limits to nginx, remove gitea
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.

I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.

This change reflects ad-hoc production changes.

Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
2020-09-20 22:53:40 +00:00
q3k 242ec58a33 k0: add waw-hdd-redundant-q3k-3
Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82
2020-09-20 15:36:24 +00:00
patryk 8d069d8d1a cluster/certs: refresh prodvider CA
Change-Id: I35578fb62ddf10e7419c2c347e70322cf4ea0b6a
2020-09-01 22:02:52 +00:00
q3k 316411790a cluster/nix: update nodes
- we update NixOS to 20.09pre
 - we fix an ACME option that's now required
 - we switch from systemd-timesyncd to chrony (as timesyncd took a long
   time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
   from ceph)

This has been deployed in production.

Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
2020-08-23 00:58:29 +02:00
q3k bc73a44519 cluster/clustercfg: fix BUILD
This is continued fallout after migrating from rules_pip.

Change-Id: Idb9b4d4f22aa36512d220ac31375bae7a0f25e4e
2020-08-22 20:33:37 +00:00
q3k d5918c8e72 cluster: change q3k's laptop key
Paranoia is dead, long live Mimeomia.

This has already been deployed to production.

Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0
2020-08-22 22:29:42 +02:00
q3k 0581bbf8a0 games/factorio: add modproxy
This adds a mod proxy system, called, well, modproxy.

It sits between Factorio server instances and the Factorio mod portal,
allowing for arbitrary mod download without needing the servers to know
Factorio credentials.

Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212
2020-08-14 13:03:46 +02:00
q3k 3d29484ebb k0: move registry to ceph-waw3
ceph-waw2 has currently some production issues [1] which have started to
cause write failures in the registry. The registry is the only user of
ceph-waw2's affected pool, so we reduce the dumpster fire blast radious
by moving it over to ceph-waw3.

This has already been deployed and data has been migrated over (via
s3cmd sync), and the migration has been verified (by a push and pull,
and pull of an older image).

[1] - pgs stuck inactive in the object storage pool

Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347
2020-08-04 01:36:51 +02:00
q3k 4ded56ab8a prodvider: emit client/server cert
Change-Id: I024782a7dfa6e16ff5f562a62ddd8fe3bf299c51
2020-08-01 22:01:05 +02:00
q3k f3312ef77e *: developer machine HSPKI credentials
In addition to k8s certificates, prodaccess now issues HSPKI
certificates, with DN=$username.sso.hswaw.net. These are installed into
XDG_CONFIG_HOME (or os equiv).

//go/pki will now automatically attempt to load these certificates. This
means you can now run any pki-dependant tool with -hspki_disable, and
with automatic mTLS!

Change-Id: I5b28e193e7c968d621bab0d42aabd6f0510fed6d
2020-08-01 17:15:52 +02:00
q3k 509ab6e29a k0/cockroach: add public DNS entry for cockroach
Change-Id: I934bf348e2165148b515b709e853ab67f039a402
2020-07-30 22:56:30 +02:00
implr cae27ecd99 Replace rules_pip with rules_python; use bazel built upstream grpc
instead of Python packages

As usual with Python sadness, the @pydeps wheels are built on the bazel
host, so stuffing them inside a container_image (or py_image) will cause
new and unexpected kinds of misery.

Change-Id: Id4e4d53741cf2da367f01aa15c21c133c5cf0dba
2020-07-08 18:55:34 +02:00
informatic 97a6ca8a8b Merge "cluster/kube/lib/nginx: add gitea-prod ingress service" 2020-07-02 17:15:53 +00:00
informatic 0697e01144 cluster/kube/lib/registry: allow auth'd users to pull all images
"Anyone can pull all images" rule did only match on anonymous users. Now
it should match all users, including authenticated ones.

Change-Id: I2205299093feca51f30526ba305eadbaa0a68ecb
2020-07-02 18:45:42 +02:00
informatic f00edf6ee8 cluster/kube/lib/nginx: add gitea-prod ingress service
We would like gitea to have its ssh server exposed on TCP port 22 on the
same address as its web interface. We would also still like to use all
the automation around ingresses already in place (like cert-manager
integration).

To solve this, we create an additional LoadBalancer service for
nginx-ingress-controller and set up special tcp-services forwarding rule
to pass port 22 traffic to gitea-prod/gitea service, like we already do
in case of gerrit.

Change-Id: I5bfc901ebe858464f8e9c2f3b2216b254ccd6c4d
2020-07-02 18:30:38 +02:00
q3k b1aadd88ff k0: add q3k's personal s3 user
Change-Id: I5681774e1dca2cf4a865d9e1a24602ed4334f006
2020-06-24 17:19:36 +00:00
q3k 0037edaa5b cluster/tools/rook-s3cmd-config: build using bazel
This turns the existing script into a proper sh_binary, and injects
dependencies (kubectl and jq) as deps into it.

This change also pulls in BUILDfiles for jq, and a dep (oniguruma) into
//third_party, and adds buildable external repositories for them.

The jq/oniguruma BUILDfiles are lifted from
https://github.com/attilaolah/bazel-tools/.

Change-Id: If2e548bd60a8fd34e4f3be767ae59c6b2f2286d9
2020-06-13 22:46:41 +02:00
implr d9df5879e3 add radosgw bucket for spark
Change-Id: Id8ea8901ce038ccbf11afabe0e6272c358b32cf2
2020-06-13 21:31:56 +02:00
q3k 9b2ce179a8 Merge "cluster/kube: split up cluster.jsonnet" 2020-06-13 17:52:27 +00:00
q3k dbfa988c73 cluster/kube: split up cluster.jsonnet
It was getting large and unwieldy (to the point where kubecfg was slow).
In this change, we:

 - move the Cluster function to cluster.libsonnet
 - move the Cluster instantiation into k0.libsonnet
 - shuffle some fields around to make sure things are well split between
   k0-specific and general cluster configs.
 - add 'view' files that build on 'cluster.libsonnet' to allow rendering
   either the entire k0 state, or some subsets (for speed)
 - update the documentation, drive-by some small fixes and reindantation

Change-Id: I4b8d920b600df79100295267efe21b8c82699d5b
2020-06-13 19:51:58 +02:00
q3k 66a26a8f02 WORKSPACE: remove nixpkgs/rules_nix
We're not using them for anything. Initially they were going to be used
for nixops, but nixops is not very good, so let's just drop them.

We still have a Nix dependency for clustercfg.py when provisioning
nodes, but rules_nix/nixpkgs in WORKSPACE were unrelated to that.

Change-Id: I28c249507d1be9c5dbbd1ee764deccd9ab038549
2020-06-07 02:22:14 +02:00
q3k ce81c39081 ops/metrics: basic cluster setup with prometheus
We handwavingly plan on implementing monitoring as a two-tier system:

 - a 'global' component that is reponsible for global aggregation,
   long-term storage and alerting.
 - multiple 'per-cluster' components, that collect metrics from
   Kubernetes clusters and export them to the global component.

In addition, several lower tiers (collected by per-cluster components)
might also be implemented in the future - for instance, specific to some
subprojects.

Here we start sketching out some basic jsonnet structure (currently all
in a single file, with little parametrization) and a cluster-level
prometheus server that scrapes Kubernetes Node and cAdvisor metrics.

This review is mostly to get this commited as early as possible, and to
make sure that the little existing Prometheus scrape configuration is
sane.

Change-Id: If37ac3b1243b8b6f464d65fee6d53080c36f992c
2020-06-06 15:56:10 +02:00
q3k 7371b7288b tools/secretstore: add sync command, re-encrypt
This kills two birds with one stone:

 - update the secretstore tool to be slightly smarter about secrets, to
   the point where we can now just point it at a secret directory and
   ask it to 'sync' all secrets in there
 - runs the new fancy sync command on all keys to update them, which
   is a follow up to gerrit/328.

Change-Id: I0eec4a3e8afcd9481b0b248154983aac25657c40
2020-06-04 19:25:07 +00:00
patryk c410432d94 personal/patryk/arma3: create a S3 bucket account for Arma3 mods
Change-Id: Idd31b5f46fcaebfcd72334dc82fbc8df805203b1
2020-06-04 18:51:51 +02:00
informatic cb96eb6df6 Merge "crdb.k0: add sso client" 2020-05-31 12:26:04 +00:00
q3k e55493f635 calico: fix access to resources from controller
This fixes even more networking issues.

Change-Id: I754656a01e3de8a34055280908b343a1a25a4707
2020-05-30 17:57:05 +02:00
q3k ba375e62b2 calico: fix node name selection
This was an attempt to make new calico nodes use a full FQDN. However,
this change seemingly also makes the calico control plane use the FQDN
for all existing nodes, as such breaking CNI for new pods.

We revert this change, thereby keeping all calico nodes names as
hostnames. We could fix this by editing /var/lib/calico/nodename on
hosts to FQDNs, but it might not be worth the effort.

See https://github.com/projectcalico/calico/issues/1093 for more
context.

Change-Id: I52bfb00f604053d57d3009aebd6c50db7dc74f58
2020-05-30 16:18:13 +02:00
informatic 42da0e9aec crdb.k0: add sso client
Change-Id: I7490a3594694d61a19910e436983937667ed34bd
2020-05-30 14:34:33 +02:00
q3k d81bf72d7f calico: upgrade to 3.14, fix calicoctl
We still use etcd as the data store (and as such didn't set up k8s CRDs
for Calico), but that's okay for now.

Change-Id: If6d66f505c6b40f2646ffae7d33d0d641d34a963
2020-05-28 16:47:16 +02:00
q3k 1223cde4d4 cluster: fix nuke's personal storage
Change-Id: I422a6d9f7a483e7c44cc8dfd8c0d8a98d9e17e46
2020-05-16 17:38:23 +02:00
q3k 741c08f66c cluster: add nuke's personal storage
He needs some personal backup space, and we have enough best effort
spare capacity for that.

Change-Id: I75ed6f62e79d33907c0974ec5f2839389ce62543
2020-05-14 18:13:53 +00:00
q3k a168c50132 SECURITY: cluster: limit api objects modifiable by namespace admins
This previous allowed all namespace admins (ie. personal-$user namespace
users) to create any sort of obejct they wanted within that namespace.

This could've been exploited to allow creation of a RoleBinding that
would then allow to bind a serviceaccount to the insecure
podsecuritypolicy, thereby allowing escalation to root on nodes.

As far as I've checked, this hasn't been exploited, and the access to
the k8s cluster has so far also been limited to trusted users.

This has been deployed to production.

Change-Id: Icf8747d765ccfa9fed843ec9e7b0b957ff27d96e
2020-05-11 20:49:31 +02:00
q3k d436de2010 cluster/rook: bump to 1.1.9
This bumps Rook/Ceph. The new resources (mostly RBAC) come from
following https://rook.io/docs/rook/v1.1/ceph-upgrade.html .

It's already deployed on production. The new CSI driver has not been
tested, but the old flexvolume-based provisioners still work. We'll
migrate when Rook offers a nice solution for this.

We've hit a kubecfg bug that does not allow controlling the CephCluster
CRD directly anymore (I had to apply it via kubecfg show / kubectl apply
-f instead). This might be due to our bazel/prod k8s version mismatch,
or it might be related to https://github.com/bitnami/kubecfg/issues/259.

Change-Id: Icd69974b294b823e60b8619a656d4834bd6520fd
2020-05-02 23:30:52 +02:00
Bartosz Stebel 98ef1518e0 add vpn insecure namespace
Change-Id: I8a774ae625342af3521ad0ab11a8f6d4e4ef6c97
2020-04-24 13:28:38 +02:00
q3k 8adbd49051 *: more hackdoc updates
Change-Id: Ib9830c66fe36c423d38f447905c470b67cde5399
2020-04-10 22:10:18 +02:00
q3k 4f7cc0064f Revert "*: update docs for hackdoc"
This reverts commit cc8c69c897.

Reason for revert: <INSERT REASONING HERE>

Change-Id: I1315e930e2ef69db3188eda05e4aa0b12db24274
2020-04-10 20:09:35 +00:00
q3k cc8c69c897 *: update docs for hackdoc
Change-Id: I256ec4499da2289f8f7ea3766ce40f2b0ffb0dc1
2020-04-10 21:20:53 +02:00
q3k c881cf3c22 devtools/hackdoc: init
This is hackdoc, a documentation rendering tool for monorepos.

This is the first code iteration, that can only serve from a local git
checkout.

The code is incomplete, and is WIP.

Change-Id: I68ef7a991191c1bb1b0fdd2a8d8353aba642e28f
2020-04-08 20:03:12 +02:00
q3k 0dcc702c64 cluster: bump nearly-expired certs
This makes clustercfg ensure certificates are valid for at least 30
days, and renew them otherwise.

We use this to bump all the certs that were about to expire in a week.
They are now valid until 2021.

There's still some certs that expire in 2020. We need to figure out a
better story for this, especially as the next expiry is 2021 - todays
prod rollout was somewhat disruptive (basically this was done by a full
cluster upgrade-like rollout flow, via clustercfg).

We also drive-by bump the number of mons in ceph-waw3 to 3, as it shouls
be (this gets rid of a nasty SPOF that would've bitten us during this
upgrade otherwise).

Change-Id: Iee050b1b9cba4222bc0f3c7bce9e4cf9b25c8bdc
2020-03-28 18:01:40 +01:00
q3k 90e8e68bab crdb.k0: add bugless-dev (for q3k)
Change-Id: I3988e1c37f0a0c54ef1ba248f01e026d6e8c72b6
2020-03-25 10:55:05 +01:00
q3k e186c87c1b cluster: bump rook to 1.0.6
In preparation for updating to 1.1.0, which will be much more involved.

Also fix a typo in registry.libsonnet, whoops.

Change-Id: I7668bf53c7580f99fdf56fe6227f04a468f8de50
2020-02-21 12:57:02 +01:00
q3k 114edc2398 kube/mirko: add kube.CephObjectStoreUser
Change-Id: I2a67076eeaf41ada41f5ae3ee588025e4c16b9e1
2020-02-18 22:55:13 +01:00
q3k 0d83300b18 cluster: set ceph-waw3 mon replicas to 1
This reflects current production. This needs to get bumped up to 3 at some point as otherwise we lose HA for this cluster.

Change-Id: Ie5937e6a216b635ecbc4c82ecd182a410167c3f8
2020-02-15 11:48:39 +00:00
q3k 58d08595f1 {cluster,}/README: update
Change-Id: Ie211fd34316c407f29506b67187632fd22a4f75b
2020-02-15 01:00:42 +01:00
q3k d7364520e9 cluster: bump kubelets to 1.14.3
Change-Id: I02ed978a49629cdfc3f3587ad640e8cc5a5fad23
2020-02-02 23:43:28 +01:00
q3k e2095b2ce9 cluster: remove unused module-cluster.nix
Change-Id: I819d803fc7454cfd63a11a109ec73c9578f598b8
2020-02-02 23:43:00 +01:00
q3k c78cc13528 cluster/nix: locally build nixos derivations
We change the existing behaviour (copy files & run nixos-rebuild switch)
to something closer to nixops-style. This now means that provisioning
admin machines need Nix installed locally, but that's probably an okay
choice to make.

The upside of this approach is that it's easier to debug and test
derivations, as all data is local to the repo and the workstation, and
deploying just means copying a configuration closure and switching the
system to it. At some point we should even be able to run the entire
cluster within a set of test VMs.

We also bump the kubernetes control plane to 1.14. Kubelets are still at
1.13 and their upgrade is comint up today too.

Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd
2020-02-02 22:31:53 +01:00
q3k aa76e55eea cert-manager: fix DNS for http01 k0 splitdns
Change-Id: I73847daec9796cb891cf2fe58c2633c5fa768861
2019-12-29 02:49:30 +01:00
q3k 0c337acf89 benji: fix in waw2, run in waw3
This needed an upstream change to allow only some pools to be backed up,
otherwise benji would crash when stubmling upon the first PVC from a
pool that wasn't backed by the ceph cluster it was acting upon.

Change-Id: I52bf163c16352cb59fdd3dbdd576145ce1dbac03
2019-12-21 23:45:07 +01:00
q3k ba8e79e8f4 kube-apiserver: fix cert mismatch, again
This time from a bare hscloud checkout to make sure _nothing_ is fucked
up.

This causes no change remotely, just makes te repo reflect reality.

Change-Id: Ie8db01300771268e0371c3cdaf1930c8d7cbfb1a
2019-12-17 02:13:55 +01:00
q3k 050af01b83 cluster: add q3k's new SSH key
Change-Id: I872a75cc89a62c9487433fa5e8e5767953e309c9
2019-12-17 01:58:58 +01:00
q3k e5a956a1c8 *: bump to q3k's kubecfg, kubernetes 1.16
Change-Id: I302876d5a45cbfb63d87ad9f6ea9aaeff7bec17d
2019-11-17 22:38:40 +01:00
q3k fd323a0f55 cluster: sync to prod
Change-Id: If311f1ce44653bb54e0a10ad2fdd65685722a64d
2019-11-17 19:49:04 +01:00
q3k 96c428f7d7 nixops: fix
Change-Id: I15ebde319fcae3f9771da6a549e52783e0ec4409
2019-11-17 19:00:46 +01:00
q3k c33ebcc79f cluster: add ceph-waw3, move metallb to bgp
Change-Id: Iebf369f9a02e44be163ef4afc2e0f23c4b009898
2019-11-01 18:43:45 +01:00
q3k e67f6fec98 cluster/secrets: really try to fix apiserver key/cert
Change-Id: I6b0ea601246b665585adb040b9819344bc683e78
2019-10-31 17:36:44 +01:00
q3k 737cafd548 cluster/certs: fix kube-apiserver
key/cert mismatch :/

Change-Id: I3601a18d3ab1eae4183b59be43c497cd27dfe704
2019-10-31 17:30:48 +01:00
q3k d493ab66ca *: add dcr01s{22,24}
Change-Id: I072e825e2e1d199d9da50b9d38a9ffba68e61182
2019-10-31 17:07:50 +01:00
q3k 6f773e0004 smsgw: productionize, implement kube/mirko
This productionizes smsgw.

We also add some jsonnet machinery to provide a unified service for Go
micro/mirkoservices.

This machinery provides all the nice stuff:
 - a deployment
 - a service for all your types of pots
 - TLS certificates for HSPKI

We also update and test hspki for a new name scheme.

Change-Id: I292d00f858144903cbc8fe0c1c26eb1180d636bc
2019-10-04 13:52:34 +02:00
q3k d186e9468d cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net
In https://gerrit.hackerspace.pl/c/hscloud/+/70 we accidentally
introduced a split-horizon DNS situation:

 - k0.hswaw.net from the Internet resolves to nodes running the k8s API
   servers, and as such can serve API server traffic
 - k0.hswaw.net from the cluster returned no results

This broke prodvider in two ways:
 - it dialed the API servers at k0.hswaw.net
 - even after the endpoint was moved to
   kubernetes.default.svc.k0.hswaw.net, the apiserver cert didn't cover
   that

Thus, not only we had to change the prodvider endpoint but also change
the APIserver certs to cover this new name.

I'm not sure this should be the target fix. I think at some point we
should only start referring to in-cluster services via their full (or
cluster.local) names, but right now k0.hswaw.net is an exception and as
such a split, and we have no way to access the internal services from
the outside just yet.

However, getting prodvider to work is important enough that this fix is
IMO good enough for now.

Change-Id: I13d0681208c66f4060acecc78b7ae14b8f8d7125
2019-10-04 13:52:34 +02:00
q3k e31d64f265 kube: move cert-manager resources to kube.local.libsonnet
This way kubernetes consumers don't have to import anything from
cluster/, hopefully.

We also create a small abstraction for local additions for
kube.libsonnet without having to modify upstream.

Change-Id: I209095781f91c8867250a647fe944370cddd67d0
2019-10-02 21:03:13 +02:00
q3k 54490d385e cluster/coredns: add cluster fqdn top level domain
This means that in addition to services being discoverable the 'classic'
way:

    <svcname>.<namespace>.svc.cluster.local

They are now discoverable as:

    <svcname>.<namespace>.svc.<fqdn>

For instance, on k0 you can now internally resolve:

    $ kubectl run --rm -it foo --image=nixery.dev/shell/dnsutils bash
    bash-4.4# dig +short coffee-svc.default.svc.k0.hswaw.net
    10.10.12.192

Change-Id: Ie6875b54ed6358f30f888ca0cd96e011520ace20
2019-10-02 20:49:13 +02:00
q3k 95868eeddc benji: back up daily instead of hourly
Every benji backup seems to cycle blocks (eg. delete some and recreate
them).

Since wasabi has a minimum billing retention policy of 90 days, this
means that every uploaded and then an hour later deleted object costs
us.

Currently we seem to be storing around 200G of data in wasabi for Benji
but already have 600G of deleted objects. This is suboptimal.

This change has already been deployed on production.

Change-Id: I67302d23a1c45974fb5d51ec9a8cff28260830dc
2019-09-26 21:49:24 +00:00
q3k 57515a2525 Merge "rules_pip: update to new version" 2019-09-25 12:05:58 +00:00
q3k 5f9b1ecd67 rules_pip: update to new version
rules_pip has a new version [1] of their rule system, incompatible with the
version we used, that fixes a bunch of issues, notably:
 - explicit tagging of repositories for PY2/PY3/PY23 support
 - removal of dependency on host pip (in exchange for having to vendor
   wheels)
 - higher quality tooling for locking

We update to the newer version of pip_rules, rename the external
repository to pydeps and move requirements.txt, the lockfile and the
newly vendored wheels to third_party/, where they belong.

[1] - https://github.com/apt-itude/rules_pip/issues/16

Change-Id: I1065ee2fc410e52fca2be89fcbdd4cc5a4755d55
2019-09-25 14:05:07 +02:00
q3k 5f3a5e0310 cluster/kube: emergency fixes after evition
Some pods got evicted. Some of them broke.

  - postgres in matrix and nginx in internet because of the new policies
    (chown issues)
  - cas proxy in matrix because apparently the image was not reuploaded
    to the regsitry after ceph-waw1 died, and another node didn't have it
  - registry because it had a weak image pin an downgraded to some
    broken version on another node

Change-Id: I836036872629843c8ede1b7f67982112c90d71f0
2019-09-25 02:58:15 +02:00
q3k db2a2a029f Merge "Get in the Cluster, Benji!" 2019-09-18 20:40:12 +00:00
q3k a01c487a6e cluster: allow insecure pods in rook-ceph-system
This is required for the agent to start a socket on each host for
kubelet-to-rook access.

Change-Id: I78529df81185aeaacdcb494138f72f0224a029c6
2019-09-05 16:01:19 +00:00
q3k 13bb1bf4e3 Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.

Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
 - runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
   up their respective PVs to Wasabi
 - runs benji enforce, marking backups outside our backup policy [2] as
   to be deleted
 - runs benji cleanup, to remove unneeded backups
 - runs a custom script to backup benji's sqlite3 database into wasabi
   (unencrypted, but we're fine with that - as the metadata only contains
   image/pool names, thus Ceph PV and pool names)

[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
      then one backup for the next 48 hours, then one backup for the next
      7 days, then one backup for the next 12 months, for a total of 65
      backups (deduplicated, of course)

We also drive-by update some docs (make them mmore separated into
user/admin docs).

Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-09-02 16:33:02 +02:00
q3k 9496d9910a cluster: add nextcloud user for object store
Change-Id: Ib08be16f71ff5e1b72ca6ad436de4b12427dd407
2019-09-02 16:33:02 +02:00
q3k 42553cd044 cluster: disable unauthenticated read only port on kubelets
This port was leaking kubelet state, including information on running
pods. No secrets were leaked (if they were not text-pasted into
env/args), but this still shouldn't be available.

As far as I can tell, nothing depends on this port, other than some
enterprise load balancers that require HTTP for node 'health' checks.

Change-Id: I9549b73e0168fe3ea4dce43cbe8fdc2ca4575961
2019-09-02 16:33:02 +02:00
q3k 896926c921 prodvider: clean up LDAP connections
Change-Id: Ic95e6d1b845832fa0fb2da51b418bcdcb8fd05c4
2019-08-31 15:00:51 +02:00
q3k 71a21c7693 rook/ceph: bump
Change-Id: I046df292cad11650adb829cc8a73100cc1d1ecc8
2019-08-30 23:08:26 +02:00
q3k b13b7ffcdb prod{access,vider}: implement
Prodaccess/Prodvider allow issuing short-lived certificates for all SSO
users to access the kubernetes cluster.

Currently, all users get a personal-$username namespace in which they
have adminitrative rights. Otherwise, they get no access.

In addition, we define a static CRB to allow some admins access to
everything. In the future, this will be more granular.

We also update relevant documentation.

Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153
2019-08-30 23:08:18 +02:00
q3k d16454badc cert-manager: bump to v0.9.1
We just got this email:

We've been working with Jetstack, the authors of cert-manager, on a
series of fixes to the client. Cert-manager sometimes falls into a
traffic pattern where it sends really excessive traffic to Let's
Encrypt's servers, continuously. To mitigate this, we plan to start
blocking all traffic from cert-manager versions less than 0.8.0 (the
current semver minor release), as of November 1, 2019. Please upgrade
all of your cert-manager instances before then.

We're sending this email because this is the contact address of your
cert-manager instance at:

 185.236.240.37 .

Version 0.8.0 is much better but we still observe excessive traffic in
some cases. We're working with Jetstack to improve these cases. As new
versions of cert-manager are released, we will add the non-current
versions to our block list after 3 months. We strongly encourage
cert-manager users to stay up-to-date with new versions.

Also, there is an opportunity to help both Jetstack and Let's Encrypt.
Once you've upgraded, please check the logs for your cert-manager
instances from time to time. Are they making excessive requests to Let's
Encrypt (more than, say, 10 per day over multiple days)? If so, please
share details at https://github.com/jetstack/cert-manager/issues/1948 .

Thanks,
Let's Encrypt Team

Change-Id: Ic7152150ac1c96941423878c6d4b6209e07429cf
2019-08-29 17:21:49 +02:00
q3k 1fad2e5c6e bgpwtf/cccampix: draw the rest of the fucking owl
Change-Id: I49fd5906e69512e8f2d414f406edc0179522f225
2019-08-11 23:43:25 +02:00
q3k d533892efa Fix crdb-waw1
We accidentally created crdb-waw2 in
https://gerrit.hackerspace.pl/c/hscloud/+/2.

We remove it now and also backport a manual change that makes the
crdb-waw1 service public via a LoadBalancer.

Change-Id: I3bbd6f01b82c6efa458cc44776f086ba36e9f20c
2019-08-11 23:42:47 +02:00
q3k d07861b7df ceph-waw1 -> ceph-waw2
Change-Id: I03d6244b9697a9efc06492114ef90cdb01e17601
2019-08-08 17:49:31 +02:00
q3k f774f2f31d Merge "app/registry: integrate into cluster/kube" 2019-08-02 00:28:10 +00:00
q3k 654c70dad7 cluster/tools/install.sh: fix nixops graceful degradation
Nixops requires nix_rules, which in turn requires a working nix
installation.

When we split tools/install.sh into tools/install.sh and
cluster/tools/install.sh [1], we accidentally made the latter always install
all cluster tools, including nixops - even if the install.sh script
detected that the system does not have Nix installed.

[1] - https://gerrit.hackerspace.pl/c/hscloud/+/81

Change-Id: Ib5357cfe125f1393b395b28062787f3f0091f549
2019-07-23 01:37:11 +02:00
q3k 4d61d20aec app/registry: integrate into cluster/kube
This makes a registry be automatically part of the cluster
infrastructure.

Tested by running kubecfg diff, no diffs (apart from out-of-date ACLs)
found.

Change-Id: Ic0635e789cf3fb851f410bcf2865326f1fa87545
2019-07-21 16:56:41 +02:00
q3k 1663e0e93b tools: move cluster-specific stuff to cluster/tools
Change-Id: I1813bb221d1bff0d6067eceb84d23510face60ff
2019-07-21 14:26:51 +00:00
q3k 116da981c9 nix/ -> cluster/nix/
These are related to cluster bootstrapping, not generic language
libraries (like go/ and bzl/).

Change-Id: I03a83c64f3e0fa6cb615d36b4e618f5e92d886ec
2019-07-21 15:53:20 +02:00
Serge Bazanski 2ce367681a *: move away from python_rules
python_rules is completely broken when it comes to py2/py3 support.

Here, we replace it with native python rules from new Bazel versions [1] and rules_pip for PyPI dependencies [2].

rules_pip is somewhat little known and experimental, but it seems to work much better than what we had previously.

We also unpin rules_docker and fix .bazelrc to force Bazel into Python 2 mode - hopefully, this repo will now work
fine under operating systems where `python` is python2 (as the standard dictates).

[1] - https://docs.bazel.build/versions/master/be/python.html

[2] - https://github.com/apt-itude/rules_pip

Change-Id: Ibd969a4266db564bf86e9c96275deffb9610dd44
2019-07-16 22:22:05 +00:00
q3k 92be486f39 Revert "cluster/kube/lib/nginx: use Local traffic policy"
This reverts commit 09a0f06d2a.

Reason for revert: prevents registry from being accessible on nodes:

q3k@anathema ~/Software/hscloud $ curl registry.k0.hswaw.net
<html>
[..., ok]

[root@bc01n03:~]# curl registry.k0.hswaw.net
^C

Change-Id: I0da97aaf7a8791ea3f62c70b6c1502f4a48a300f
2019-06-29 22:58:19 +00:00
q3k 09a0f06d2a cluster/kube/lib/nginx: use Local traffic policy
Diff against prod:

  - live services nginx-system.ingress-nginx
  + config services nginx-system.ingress-nginx
    {
      "apiVersion": "v1",
      "kind": "Service",
      "metadata": {
        "annotations": {},
        "labels": {
          "app.kubernetes.io/name": "ingress-nginx",
          "app.kubernetes.io/part-of": "ingress-nginx"
        },
        "name": "ingress-nginx",
        "namespace": "nginx-system"
      },
      "spec": {
  -     "externalTrafficPolicy": "Cluster",
  +     "externalTrafficPolicy": "Local",
        "ports": [
          {
            "name": "ssh",
            "port": 22,
            "protocol": "TCP",
            "targetPort": 22
          },
          {
            "name": "http",
            "port": 80,
            "protocol": "TCP",
            "targetPort": 80
          },
          {
            "name": "https",
            "port": 443,
            "protocol": "TCP",
            "targetPort": 443
          }
        ],
        "selector": {
          "app.kubernetes.io/name": "ingress-nginx",
          "app.kubernetes.io/part-of": "ingress-nginx"
        },
        "type": "LoadBalancer"
      }
    }

Change-Id: I0dd66e3f1643efa975d6180cc163a265d4b484ef
2019-06-29 22:44:53 +02:00
q3k 543b412a65 cluster/kube/lib/nginx: add gerrit forwarding
This is already running in production since gerrit was deployed - it
just got lost during submit.

Change-Id: I8a1580b1ca3ec3142a8fa4320dc9f51a599a914f
2019-06-29 22:42:39 +02:00
q3k 59f5fd315c cluster/openssl.cnf: remove
This was used in the old openssl-based TLS certificate generation code.

Change-Id: I5da8c5b012b6af8c2f8b990237b3c4933b90a349
2019-06-25 15:02:45 +02:00
q3k 184678b0f4 cluster/cube/lib/cockroachdb: clean up topology
IP addresses are not necessary in the topology definitions of a
cockroach cluster.

They were mis-commited leftovers from trying to run the cluster on
DaemonSets with hostNetworking: true.

Change-Id: I4ef1f6ed9a745efc6b05846bc13aba9d1f8dc7c8
2019-06-22 21:18:29 +00:00
q3k dec401c7dd cluster/kube/lib/cockroach: move client to deployment
This prevents a bug where kubecfg fails to update the client pod when
running a cluster/kube/cluster.jsonnet update. The pod update is
attempted because of runtime/intent differences at serviceAccounts
specification, which causes kubecfg to see a diff, which causes it to
attempt and update, which causes kube-apiserver to reject the change
(because pods are immutable), which causes kubecfg to fail.

Change-Id: I20b0ecbb264213a2eb483d475c7683b4965c82be
2019-06-22 23:14:25 +02:00
q3k c7258f4644 cluster/kube: refactor, add crdb-waw1 2019-06-21 00:24:09 +02:00
q3k e53e39a8be cluster/kube/lib/cockroachdb: use manual node pinning
We move away from the StatefulSet based deployment to manually starting
a deployment per intended node. This allows us to pin indivisual
instances of Cockroach to particular nodes, so that they state
co-located with their data.
2019-06-20 23:36:35 +02:00
q3k 662a3cdcca cluster/kube/lib/cockroachdb: refactor
We refactor this library to:

 - support multiple databases, but with a strong suggestion of having
   one per k8s cluster
 - drop the database creation logic
 - redo naming (allowing for two options: multiple clusters per
   namespace or an exclusive namespace for the cluster)
 - unhardcode dns names
2019-06-20 19:45:03 +02:00
q3k 224a50bbfe cluster/kube/lib/cockroach: fix imports 2019-06-20 16:43:01 +02:00
q3k 3c117fa841 make cockroachdb into a cluster service 2019-06-20 16:43:01 +02:00
q3k c3b0f7627c cluster/kube: set operator replicas to 0 2019-06-20 16:42:19 +02:00
q3k c0fc3ee442 cluster/clustercfg: add clustercfg-nocerts 2019-06-20 16:11:38 +02:00
q3k f970a7ef0f nix/cluster-configuration: fix CNI plugins being deleted on kubelet restart 2019-06-20 12:51:51 +02:00
q3k f81f7d462a cluster/clustercfg: gitignore __pycache__ 2019-05-19 03:11:18 +02:00
q3k aa68f3fdd8 secretstore: add implr 2019-05-18 00:15:25 +02:00
q3k 36cc4fb61a bazel-cache: deploy, add waw-hdd-yolo-1 ceph pool 2019-05-17 18:09:39 +02:00