1
0
Fork 0
Commit Graph

568 Commits (c824405e2e75e5a322acb3770982c79bfe6aeb23)

Author SHA1 Message Date
q3k c824405e2e Update COPYING
Change-Id: I22661254b16840bcea7b352d51171a232fa7041a
2020-10-10 15:59:10 +00:00
q3k 531cacf14a Merge "WORKSPACE: use nix for python/go if available" 2020-10-07 12:56:38 +00:00
q3k eb09c6a347 speedtest: fix mimetype on served JS
Change-Id: Ifcb1d4f8a58a5e6120f31373b2a8c0e307e414be
2020-10-06 15:29:08 +00:00
q3k 363bf4f341 monitoring: global: implement
This creates a basic Global instance, running Victoria Metrics on k0.

Change-Id: Ib03003213d79b41cc54efe40cd2c4837f652c0f4
2020-10-06 14:28:27 +00:00
q3k 2e001e5046 k0: bump to 1.15.4
This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.

Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!

Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
2020-10-03 16:32:38 +00:00
q3k 2a223705fd cluster: bump certs
This has been deployed to k0 nodes.

Current state of cluster certificates:

cluster/certs/ca-etcd.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-etcdpeer.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kube.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kubefront.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kube-prodvider.cert
            Not After : Sep  1 21:30:00 2021 GMT
cluster/certs/etcd-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcd-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcd-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-calico.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/etcd-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/etcd-kube.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcdpeer-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcdpeer-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/etcd-root.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-apiserver.cert
            Not After : Oct  3 15:26:00 2021 GMT
cluster/certs/kube-controllermanager.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kubefront-apiserver.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/kube-proxy.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-scheduler.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-serviceaccounts.cert
            Not After : Mar 28 15:15:00 2021 GMT

Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314
2020-10-03 16:32:32 +00:00
q3k 194b1c8e62 WORKSPACE: use nix for python/go if available
This introduces Nix, the package manager, and nixpkgs, the package
collection, into hscloud's bazel build machinery.

There are two reasons behind this:

 - on NixOS, it's painful or at least very difficult to run hscloud out
   of the box. Especially with rules_go, that download a blob from the
   Internet to get a Go toolchain, it just fails outright. This solves
   this and allows hscloud to be used on NixOS.

 - on non-NixOS platforms that still might have access to Nix this
   allows to somewhat hermeticize the build. Notably, Python now comes
   from nixpkgs, and is fabricobbled in a way that makes pip3_import
   use Nix system dependencies for ncurses and libpq.

This has been tested to run ci_presubmit on NixOS 20.09pre and Gentoo
~amd64.

Change-Id: Ic16e4827cb52a05aea0df0eed84d80c5e9ae0e07
2020-10-03 18:31:38 +02:00
q3k 6abe4fa771 bgpwtf/machines: init edge01.waw
This configures our WAW edge router using NixOS. This replaces our
previous Ubuntu installation.

Change-Id: Ibd72bde66ec413164401da407c5b268ad83fd3af
2020-10-03 14:57:38 +00:00
q3k 2efb698d22 *: add default.nix/readTree
This makes all Nix files addressable from root by file path.

For instance, if a file is located in //foo/bar:baz.nix containing:

    { pkgs, ... }:

    pkgs.stdenv.mkDerivation {
      pname = "foo";
      # ...
    }

You can then do:

    nix-build -A foo.bar.baz

All nix files loaded this way must be a function taking a 'config'
attrset - see nix/readTree.nix for more information. Currently the
config attrset contains the following fields:

 - hscloud: the root of the hscloud repository itself, which allows
            for traversal via readTree (eg. hscloud.foo.bar.baz)
 - pkgs: nixpkgs
 - pkgsSrc: nixpkgs souce/channel, useful to load NixOS modules.
 - lib, stdenv: lib and stdenv from pkgs.

Change-Id: Ieaacdcabceec18dd6c670d346928bff08b66cf79
2020-10-03 14:57:34 +00:00
q3k fbe234bdb2 cluster: rename module-* into modules/*
Change-Id: I65e06f3e9cec2ba0071259eb755eddbbd1025b97
2020-10-03 14:57:30 +00:00
q3k c7de7e562f cluster: do not export metallb routes to mesh peers
This prevents metallb routes being announced from all peers to our ToR,
thereby preventing issues with traffic hitting services with
externalTrafficPolicy: local.

There still is the from-host loopback issue, but that will be fixed by
upgrading to kube 1.15.

Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04
2020-10-03 14:56:52 +00:00
q3k f0acf16564 prodvider: use SANs in service certificates
This fixes compatibility with prodaccess tools built with Go 1.15, which
introduced 'X.509 CommonName deprecation' [1].

[1] - https://golang.org/doc/go1.15#commonname

Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3
2020-10-03 14:56:35 +00:00
q3k 44628f2b9e Merge "k0.hswaw.net: pass metallb through Calico" 2020-10-02 22:54:57 +00:00
q3k e7fca3acd8 ci_presubmit: init
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).

For now, you can manually run it to ensure that Everything At Least
Kinda Works.

Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
2020-09-25 21:15:07 +00:00
q3k f00a701f27 tools: remove unused go_sdk.bzl
This is a leftover from an old attempt at NixOS compatibility.

Change-Id: I5050f76b83f47796cdfa6235db8ee5efe8daf3e2
2020-09-25 21:01:12 +00:00
q3k 4e8622df35 djtest: use pyelftools to find uwsgi ld.so
Change-Id: I54bdaa588ff15d8c6ca73c4307076a93a5682d78
2020-09-25 21:00:11 +00:00
q3k a5ed644980 k0.hswaw.net: pass metallb through Calico
Previously, we had the following setup:

                          .-----------.
                          | .....     |
                        .-----------.-|
                        | dcr01s24  | |
                      .-----------.-| |
                      | dcr01s22  | | |
                  .---|-----------| |-'
    .--------.    |   |---------. | |
    | dcsw01 | <----- | metallb | |-'
    '--------'        |---------' |
                      '-----------'

Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.

Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.

                      .-------------------------.
                      | dcr01s24                |
                      |-------------------------|
    .--------.        |---------.   .---------. |
    | dcsw01 | <----- | Calico  |<--| metallb | |
    '--------'        |---------'   '---------' |
                      '-------------------------'

This makes Calico announce our pod/service networks into our L3 fabric!

Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.

We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.

All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.

We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.

This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...

There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.

Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
2020-09-23 18:55:12 +00:00
q3k 0dd5195766 hackdoc: bump
Change-Id: I027a7d8f30d55773ec0e2ec7700bd780e417cb19
2020-09-23 18:31:35 +00:00
q3k 2b8f3c4af7 Merge changes Ib91e4d3b,I5d41fa12,I839863a8
* changes:
  hackdoc: render TOC inline
  hackdoc: fix pub_listen flag in readme
  hackdoc: do not add ?ref= to intra-links unless necessary
2020-09-23 18:14:49 +00:00
q3k 0a2f413b4c hackdoc: render TOC inline
Change-Id: Ib91e4d3b73354e7e19095ea62eed70a23ef96512
2020-09-23 18:13:20 +00:00
q3k 80380f4444 hackdoc: fix pub_listen flag in readme
Change-Id: I5d41fa12f29ec5cff9251bb0ad77fc5fdafef786
2020-09-23 18:13:20 +00:00
q3k 26f44da5f1 hackdoc: do not add ?ref= to intra-links unless necessary
Change-Id: I839863a8c10c54fae11100b885c972bed348eba6
2020-09-23 18:13:20 +00:00
q3k 059fdfed3b k0: add resource requests/limits to nginx, remove gitea
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.

I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.

This change reflects ad-hoc production changes.

Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
2020-09-20 22:53:40 +00:00
q3k 242ec58a33 k0: add waw-hdd-redundant-q3k-3
Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82
2020-09-20 15:36:24 +00:00
q3k c09d8fedcc Merge "app/onlyoffice: init" 2020-09-16 16:59:06 +00:00
q3k 5533ce9075 matrix: bump synapse to 1.19.2
This has already been deployed to production.

Change-Id: I0ebf818193bd161d6565a9ec4eddc785e79d9077
2020-09-16 14:20:09 +00:00
q3k 06b61d4d47 app/onlyoffice: init
This deploys office.hackerspace.pl. It's a collaborative document
editing server that works with Nextcloud.

This is already live, and can be tested with owncloud.hackerspace.pl
(new -> document).

Change-Id: Ic8055a8a6679e7a0695ebb9e41108074d8f789af
2020-09-15 18:23:08 +00:00
q3k 1230ac38b5 matrix: enable metrics
Change-Id: Ia916cb1311ab079153ba37818455170e85e437bc
2020-09-12 22:26:12 +00:00
patryk 8d069d8d1a cluster/certs: refresh prodvider CA
Change-Id: I35578fb62ddf10e7419c2c347e70322cf4ea0b6a
2020-09-01 22:02:52 +00:00
radex 81da4e5823 laserproxy: extend deadline to 60min & random changes
Change-Id: I2601d2da8da567d8dd6beecc630de911d5d161c3
2020-08-28 19:52:38 +02:00
radex 30b6be82e6 Revert "radex: test"
This reverts commit 04f9d2e2f1.

Reason for revert: <INSERT REASONING HERE>

Change-Id: If29d212656ef30cf9cf53f507ff029f83c9da028
2020-08-27 20:36:46 +00:00
radex 04f9d2e2f1 radex: test
Change-Id: I780578d44eac4e81624b88e20aa7da85b8fd5505
2020-08-27 20:33:26 +00:00
q3k dc496d21a1 Merge "cluster/nix: update nodes" 2020-08-27 15:13:51 +00:00
q3k 1db03c32b6 matrix: fix iOS signup issues by specifying public_baseurl
WHITE
WHALE
HOLY
GRAIL

Complex systems are complex. Let me tell you a story about that.

Matrix clients perform their last stage of login by performing a POST to
/_matrix/client/r0/login on the Matrix homeserver they log in to. How
they reach the Homeserver is specified earlier - either by using
discovery via SRV or .well-known, or by the client manually specifying
the Matrix homeserver URL.

Regardless of how they reach this endpoint in the first place, this POST
endpoint, as per the Matrix Client-Server API Specification (r0.6.1),
MAY return a `well_known` key, which MUST contain a `homeserver`
address, pointing to the address of the homeserver which the client
should talk to. If present, the client SHOULD use that instead of
whatever it connected to so far.

Issue the first: the iOS client requires `well_known` in that response,
and doesn't work otherwise. https://github.com/vector-im/element-ios/issues/3448

Issue the second: Synapse will return `well_known` accordingly, but only
if `public_baseurl` is set in its configuration. It is not required to
be set. If not set, it will simply not return this key.

Shrek the third: we never set `public_baseurl` in Synapse, and the first
issue (iOS needing `well_known`) only became a regression in
https://github.com/vector-im/element-ios/issues/2715 . As such, it was
difficult to troubleshoot this issue, and we kept getting on some red
herrings: is it the SSO? Is our server broken? Is the iOS implementation
broken?

But now we know - https://github.com/vector-im/element-ios/issues/2715
seems to be the true culprit.

Change-Id: I913792e31e3c6813d4e51d4befdba720cad3f532
2020-08-26 18:10:36 +00:00
q3k de6275101b matrix: add Telegram bridge appservice.
Configuring this one is a bit different from appservice-irc. Notably,
there's no way to give it a registration.yaml to overlay on top of a
config, se we end up using an init container with yq to do that for us.

Also, I had to manually copy the regsitration.yaml in synapse, from
/appservices/telegram-prod/registration.yaml to
/data/appservices/telegram-prod.jsonnet, in order to make it work with
the synapse docker start magic. :/

Otherwise, this is deployed and seems to be working.

Change-Id: Id747a0e310221855556c1d280439376f0c4e5ed6
2020-08-24 21:20:39 +00:00
q3k cdba291e7d matrix: split up appservice to separate file
This is in preparation for adding a Telegram bridge appservice. The main
jsonnet file was getting quite chonky.

This does not affect production, and is just a refactor.

Change-Id: I7cdee2bd71aedb40a9f6c3e5148f829023171dcb
2020-08-24 19:14:04 +00:00
q3k c0c037aad9 app/matrix: migrate postgres and data to waw3
The way this was migrated is not to be spoken of.

(hint: it involved downtime, and mounting two volumes at once)

appservice-irc has some storage, we should migrate that to waw3, too. But
it's not as critical.

The new storage (waw3) is _much_ faster.

Change-Id: I4b4bd32e4fedc514753d25bac35d001e8a9c5f00
2020-08-24 19:12:08 +00:00
q3k 35d437883b kube/policies: implement mostlysecure
This now allows to run apt and should allow to run most upstream docker
images. In return, we prohibit some mildly sketchy stuff. But this is
safe enough for project namespaces with limited administrative access.

We should still get gvisor sooner than later...

Change-Id: Ida5ccfae440bacb6f3fd55dcc34ca0addfddd5ae
2020-08-23 11:32:44 +00:00
q3k ed71be4392 Merge "devtools: fix sourcegraph" 2020-08-23 11:06:27 +00:00
q3k b7898a8038 devtools: fix sourcegraph
Permissions get mangled on container restart. This adds an init
container to fix them.

Change-Id: I37c44e23a75b8ec41e6aba2ed38eee223496b8b9
2020-08-23 11:05:57 +00:00
q3k 99db0cd62f Merge "cluster/clustercfg: fix BUILD" 2020-08-23 01:38:25 +00:00
q3k 1b15dc46ea app/matrix: move appservice-irc to bc01n03
When deploying https://gerrit.hackerspace.pl/c/hscloud/+/401 we manually
re-pinned appservice-irc to run on bc01n03 (to prevent reschedule as
bc01n02 was updated while bc01n03 was already done). This change makes
git reflect production.

Change-Id: I2518a8a227bfacefd9f1905ded5a1d65e379845f
2020-08-23 01:03:00 +02:00
q3k 316411790a cluster/nix: update nodes
- we update NixOS to 20.09pre
 - we fix an ACME option that's now required
 - we switch from systemd-timesyncd to chrony (as timesyncd took a long
   time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
   from ceph)

This has been deployed in production.

Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
2020-08-23 00:58:29 +02:00
q3k bc73a44519 cluster/clustercfg: fix BUILD
This is continued fallout after migrating from rules_pip.

Change-Id: Idb9b4d4f22aa36512d220ac31375bae7a0f25e4e
2020-08-22 20:33:37 +00:00
q3k 31e41d5ff7 Merge changes I4ecc5002,Iff21654e,I312be8e8
* changes:
  kube/kube.libsonnet: add OpenAPI.Require
  kube/kube.libsonnet: add Contain to Namespace
  kube/kube.libsonnet: add CertificateVolume
2020-08-22 20:32:02 +00:00
q3k d5918c8e72 cluster: change q3k's laptop key
Paranoia is dead, long live Mimeomia.

This has already been deployed to production.

Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0
2020-08-22 22:29:42 +02:00
q3k 0b6d5d526f kube/kube.libsonnet: add OpenAPI.Require
This allows for the following:

    local oa = kube.OpenAPI,

    vaidation: oa.Validation(oa.Dict {
        foo: oa.Required(oa.String),
        bar: oa.Required(oa.Array(oa.Dict {
            baz: oa.Boolean,
        })),
    }),

No more `oa.String { required:: true }`!

Change-Id: I4ecc5002e83a8a1cfcdf083d425d7decd4cf8871
2020-08-22 19:01:01 +00:00
q3k 5a89d225e7 kube/kube.libsonnet: add Contain to Namespace
This allow for the following:

    ns: kube.Namespace("foo"),

    service: self.ns.Contain(kube.Service("bar")) {
        spec+: {
            // ...
        },
    },

No more `metadata+: { namespace: ... }` !

Change-Id: Iff21654e18919afbe60c574e560356c6bd6d9b89
2020-08-22 18:57:30 +00:00
q3k 394dd83219 kube/kube.libsonnet: add CertificateVolume
CertificateVolume is like SecretVolume, but for secrets generated from
Certificates.

Change-Id: I312be8e84c856221173583df478ec5317aa948c0
2020-08-22 18:56:53 +00:00
q3k 8887655aa8 go/mirko: fix trace logging
Change-Id: I95b8ce32ad529ffe0b43282f5761495df78b2b10
2020-08-16 13:25:40 +00:00