1
0
Fork 0
Commit Graph

1008 Commits (20c6bcb7305d4b85c5fd6dfc72c04c68b772d15f)

Author SHA1 Message Date
q3k e7fca3acd8 ci_presubmit: init
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).

For now, you can manually run it to ensure that Everything At Least
Kinda Works.

Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
2020-09-25 21:15:07 +00:00
q3k f00a701f27 tools: remove unused go_sdk.bzl
This is a leftover from an old attempt at NixOS compatibility.

Change-Id: I5050f76b83f47796cdfa6235db8ee5efe8daf3e2
2020-09-25 21:01:12 +00:00
q3k 4e8622df35 djtest: use pyelftools to find uwsgi ld.so
Change-Id: I54bdaa588ff15d8c6ca73c4307076a93a5682d78
2020-09-25 21:00:11 +00:00
q3k a5ed644980 k0.hswaw.net: pass metallb through Calico
Previously, we had the following setup:

                          .-----------.
                          | .....     |
                        .-----------.-|
                        | dcr01s24  | |
                      .-----------.-| |
                      | dcr01s22  | | |
                  .---|-----------| |-'
    .--------.    |   |---------. | |
    | dcsw01 | <----- | metallb | |-'
    '--------'        |---------' |
                      '-----------'

Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.

Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.

                      .-------------------------.
                      | dcr01s24                |
                      |-------------------------|
    .--------.        |---------.   .---------. |
    | dcsw01 | <----- | Calico  |<--| metallb | |
    '--------'        |---------'   '---------' |
                      '-------------------------'

This makes Calico announce our pod/service networks into our L3 fabric!

Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.

We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.

All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.

We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.

This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...

There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.

Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
2020-09-23 18:55:12 +00:00
q3k 0dd5195766 hackdoc: bump
Change-Id: I027a7d8f30d55773ec0e2ec7700bd780e417cb19
2020-09-23 18:31:35 +00:00
q3k 2b8f3c4af7 Merge changes Ib91e4d3b,I5d41fa12,I839863a8
* changes:
  hackdoc: render TOC inline
  hackdoc: fix pub_listen flag in readme
  hackdoc: do not add ?ref= to intra-links unless necessary
2020-09-23 18:14:49 +00:00
q3k 0a2f413b4c hackdoc: render TOC inline
Change-Id: Ib91e4d3b73354e7e19095ea62eed70a23ef96512
2020-09-23 18:13:20 +00:00
q3k 80380f4444 hackdoc: fix pub_listen flag in readme
Change-Id: I5d41fa12f29ec5cff9251bb0ad77fc5fdafef786
2020-09-23 18:13:20 +00:00
q3k 26f44da5f1 hackdoc: do not add ?ref= to intra-links unless necessary
Change-Id: I839863a8c10c54fae11100b885c972bed348eba6
2020-09-23 18:13:20 +00:00
q3k 059fdfed3b k0: add resource requests/limits to nginx, remove gitea
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.

I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.

This change reflects ad-hoc production changes.

Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
2020-09-20 22:53:40 +00:00
q3k 242ec58a33 k0: add waw-hdd-redundant-q3k-3
Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82
2020-09-20 15:36:24 +00:00
q3k c09d8fedcc Merge "app/onlyoffice: init" 2020-09-16 16:59:06 +00:00
q3k 5533ce9075 matrix: bump synapse to 1.19.2
This has already been deployed to production.

Change-Id: I0ebf818193bd161d6565a9ec4eddc785e79d9077
2020-09-16 14:20:09 +00:00
q3k 06b61d4d47 app/onlyoffice: init
This deploys office.hackerspace.pl. It's a collaborative document
editing server that works with Nextcloud.

This is already live, and can be tested with owncloud.hackerspace.pl
(new -> document).

Change-Id: Ic8055a8a6679e7a0695ebb9e41108074d8f789af
2020-09-15 18:23:08 +00:00
q3k 1230ac38b5 matrix: enable metrics
Change-Id: Ia916cb1311ab079153ba37818455170e85e437bc
2020-09-12 22:26:12 +00:00
patryk 8d069d8d1a cluster/certs: refresh prodvider CA
Change-Id: I35578fb62ddf10e7419c2c347e70322cf4ea0b6a
2020-09-01 22:02:52 +00:00
radex 81da4e5823 laserproxy: extend deadline to 60min & random changes
Change-Id: I2601d2da8da567d8dd6beecc630de911d5d161c3
2020-08-28 19:52:38 +02:00
radex 30b6be82e6 Revert "radex: test"
This reverts commit 04f9d2e2f1.

Reason for revert: <INSERT REASONING HERE>

Change-Id: If29d212656ef30cf9cf53f507ff029f83c9da028
2020-08-27 20:36:46 +00:00
radex 04f9d2e2f1 radex: test
Change-Id: I780578d44eac4e81624b88e20aa7da85b8fd5505
2020-08-27 20:33:26 +00:00
q3k dc496d21a1 Merge "cluster/nix: update nodes" 2020-08-27 15:13:51 +00:00
q3k 1db03c32b6 matrix: fix iOS signup issues by specifying public_baseurl
WHITE
WHALE
HOLY
GRAIL

Complex systems are complex. Let me tell you a story about that.

Matrix clients perform their last stage of login by performing a POST to
/_matrix/client/r0/login on the Matrix homeserver they log in to. How
they reach the Homeserver is specified earlier - either by using
discovery via SRV or .well-known, or by the client manually specifying
the Matrix homeserver URL.

Regardless of how they reach this endpoint in the first place, this POST
endpoint, as per the Matrix Client-Server API Specification (r0.6.1),
MAY return a `well_known` key, which MUST contain a `homeserver`
address, pointing to the address of the homeserver which the client
should talk to. If present, the client SHOULD use that instead of
whatever it connected to so far.

Issue the first: the iOS client requires `well_known` in that response,
and doesn't work otherwise. https://github.com/vector-im/element-ios/issues/3448

Issue the second: Synapse will return `well_known` accordingly, but only
if `public_baseurl` is set in its configuration. It is not required to
be set. If not set, it will simply not return this key.

Shrek the third: we never set `public_baseurl` in Synapse, and the first
issue (iOS needing `well_known`) only became a regression in
https://github.com/vector-im/element-ios/issues/2715 . As such, it was
difficult to troubleshoot this issue, and we kept getting on some red
herrings: is it the SSO? Is our server broken? Is the iOS implementation
broken?

But now we know - https://github.com/vector-im/element-ios/issues/2715
seems to be the true culprit.

Change-Id: I913792e31e3c6813d4e51d4befdba720cad3f532
2020-08-26 18:10:36 +00:00
q3k de6275101b matrix: add Telegram bridge appservice.
Configuring this one is a bit different from appservice-irc. Notably,
there's no way to give it a registration.yaml to overlay on top of a
config, se we end up using an init container with yq to do that for us.

Also, I had to manually copy the regsitration.yaml in synapse, from
/appservices/telegram-prod/registration.yaml to
/data/appservices/telegram-prod.jsonnet, in order to make it work with
the synapse docker start magic. :/

Otherwise, this is deployed and seems to be working.

Change-Id: Id747a0e310221855556c1d280439376f0c4e5ed6
2020-08-24 21:20:39 +00:00
q3k cdba291e7d matrix: split up appservice to separate file
This is in preparation for adding a Telegram bridge appservice. The main
jsonnet file was getting quite chonky.

This does not affect production, and is just a refactor.

Change-Id: I7cdee2bd71aedb40a9f6c3e5148f829023171dcb
2020-08-24 19:14:04 +00:00
q3k c0c037aad9 app/matrix: migrate postgres and data to waw3
The way this was migrated is not to be spoken of.

(hint: it involved downtime, and mounting two volumes at once)

appservice-irc has some storage, we should migrate that to waw3, too. But
it's not as critical.

The new storage (waw3) is _much_ faster.

Change-Id: I4b4bd32e4fedc514753d25bac35d001e8a9c5f00
2020-08-24 19:12:08 +00:00
q3k 35d437883b kube/policies: implement mostlysecure
This now allows to run apt and should allow to run most upstream docker
images. In return, we prohibit some mildly sketchy stuff. But this is
safe enough for project namespaces with limited administrative access.

We should still get gvisor sooner than later...

Change-Id: Ida5ccfae440bacb6f3fd55dcc34ca0addfddd5ae
2020-08-23 11:32:44 +00:00
q3k ed71be4392 Merge "devtools: fix sourcegraph" 2020-08-23 11:06:27 +00:00
q3k b7898a8038 devtools: fix sourcegraph
Permissions get mangled on container restart. This adds an init
container to fix them.

Change-Id: I37c44e23a75b8ec41e6aba2ed38eee223496b8b9
2020-08-23 11:05:57 +00:00
q3k 99db0cd62f Merge "cluster/clustercfg: fix BUILD" 2020-08-23 01:38:25 +00:00
q3k 1b15dc46ea app/matrix: move appservice-irc to bc01n03
When deploying https://gerrit.hackerspace.pl/c/hscloud/+/401 we manually
re-pinned appservice-irc to run on bc01n03 (to prevent reschedule as
bc01n02 was updated while bc01n03 was already done). This change makes
git reflect production.

Change-Id: I2518a8a227bfacefd9f1905ded5a1d65e379845f
2020-08-23 01:03:00 +02:00
q3k 316411790a cluster/nix: update nodes
- we update NixOS to 20.09pre
 - we fix an ACME option that's now required
 - we switch from systemd-timesyncd to chrony (as timesyncd took a long
   time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
   from ceph)

This has been deployed in production.

Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
2020-08-23 00:58:29 +02:00
q3k bc73a44519 cluster/clustercfg: fix BUILD
This is continued fallout after migrating from rules_pip.

Change-Id: Idb9b4d4f22aa36512d220ac31375bae7a0f25e4e
2020-08-22 20:33:37 +00:00
q3k 31e41d5ff7 Merge changes I4ecc5002,Iff21654e,I312be8e8
* changes:
  kube/kube.libsonnet: add OpenAPI.Require
  kube/kube.libsonnet: add Contain to Namespace
  kube/kube.libsonnet: add CertificateVolume
2020-08-22 20:32:02 +00:00
q3k d5918c8e72 cluster: change q3k's laptop key
Paranoia is dead, long live Mimeomia.

This has already been deployed to production.

Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0
2020-08-22 22:29:42 +02:00
q3k 0b6d5d526f kube/kube.libsonnet: add OpenAPI.Require
This allows for the following:

    local oa = kube.OpenAPI,

    vaidation: oa.Validation(oa.Dict {
        foo: oa.Required(oa.String),
        bar: oa.Required(oa.Array(oa.Dict {
            baz: oa.Boolean,
        })),
    }),

No more `oa.String { required:: true }`!

Change-Id: I4ecc5002e83a8a1cfcdf083d425d7decd4cf8871
2020-08-22 19:01:01 +00:00
q3k 5a89d225e7 kube/kube.libsonnet: add Contain to Namespace
This allow for the following:

    ns: kube.Namespace("foo"),

    service: self.ns.Contain(kube.Service("bar")) {
        spec+: {
            // ...
        },
    },

No more `metadata+: { namespace: ... }` !

Change-Id: Iff21654e18919afbe60c574e560356c6bd6d9b89
2020-08-22 18:57:30 +00:00
q3k 394dd83219 kube/kube.libsonnet: add CertificateVolume
CertificateVolume is like SecretVolume, but for secrets generated from
Certificates.

Change-Id: I312be8e84c856221173583df478ec5317aa948c0
2020-08-22 18:56:53 +00:00
q3k 8887655aa8 go/mirko: fix trace logging
Change-Id: I95b8ce32ad529ffe0b43282f5761495df78b2b10
2020-08-16 13:25:40 +00:00
q3k b97a303f89 Merge "hswaw/ldapweb: bump" 2020-08-15 18:44:03 +00:00
q3k fceedd1bab hswaw/ldapweb: bump
This pulls in https://code.hackerspace.pl/q3k/ldap-web-public/commit/?id=1cced0d613f4ec8b454c1a6c6fd9bb01eed391e3

Change-Id: Ib676d09084bf1bd00bfa88eab980353550525729
2020-08-15 18:43:46 +00:00
q3k 0581bbf8a0 games/factorio: add modproxy
This adds a mod proxy system, called, well, modproxy.

It sits between Factorio server instances and the Factorio mod portal,
allowing for arbitrary mod download without needing the servers to know
Factorio credentials.

Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212
2020-08-14 13:03:46 +02:00
q3k 791ab6d1a5 factorio: bump to 1.0.0
Change-Id: I24c96e556ae4054fb1b25e671341f2cb671010c2
2020-08-14 10:35:28 +00:00
q3k 15db04c705 hackdoc: deploy
There's an issue with the registry that forbids me from pushing into
anything but my personal namespace - might have been introduced by
0697e01144 . For now, I move the hackdoc
image to my personal namespace, as at some point in the future I want to
revamp the registry system, anyway.

We also drive-by fix a mirko.libsonnet typo that, for some reason,
hasn't manifested itself yet.

Change-Id: I8544e4a52610fb84c5c9d8b0de449f785248f60f
2020-08-10 18:57:26 +02:00
q3k d40bd1bd71 README: link to cs instead of gitiles
Change-Id: Iaaa6cbe1327fc75dfd642bbfe5677740bb9b2fb6
2020-08-10 18:03:04 +02:00
q3k 77a5a4b388 Merge "hackdoc: do not render links to pages that wouldn't serve anything" 2020-08-10 16:01:51 +00:00
q3k d701c4ebc6 hackdoc: do not render links to pages that wouldn't serve anything
This gets rid of annoying clickable 404 links.

Change-Id: Ibf767875af29f4571e7f935d494b44dde002fac6
2020-08-10 18:01:13 +02:00
q3k 03c9a5ed86 app/matrix: add q3k to OWNERS
(apparently these don't get inherited?)

Change-Id: Ie0052677585863da6dade8c184e25b8c15ddf42c
2020-08-05 23:04:29 +02:00
q3k fe33aa6489 Merge "third_party/py: bump cffi and psycopg2 to latest versions" 2020-08-05 20:58:12 +00:00
q3k 970b7687f3 factorio: bump all to 0.18.40
Change-Id: Iaf9b28ce6fed9ba791075307ee3e75f218267d23
2020-08-04 20:33:25 +02:00
q3k 3d29484ebb k0: move registry to ceph-waw3
ceph-waw2 has currently some production issues [1] which have started to
cause write failures in the registry. The registry is the only user of
ceph-waw2's affected pool, so we reduce the dumpster fire blast radious
by moving it over to ceph-waw3.

This has already been deployed and data has been migrated over (via
s3cmd sync), and the migration has been verified (by a push and pull,
and pull of an older image).

[1] - pgs stuck inactive in the object storage pool

Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347
2020-08-04 01:36:51 +02:00
q3k 1773f32c8a factorio: bump to 0.18.40
Change-Id: I065a5e8a8c6608a137c0ae4f1cb04f8254ef6ddd
2020-08-01 22:02:38 +02:00