This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).
For now, you can manually run it to ensure that Everything At Least
Kinda Works.
Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
Previously, we had the following setup:
.-----------.
| ..... |
.-----------.-|
| dcr01s24 | |
.-----------.-| |
| dcr01s22 | | |
.---|-----------| |-'
.--------. | |---------. | |
| dcsw01 | <----- | metallb | |-'
'--------' |---------' |
'-----------'
Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.
Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.
.-------------------------.
| dcr01s24 |
|-------------------------|
.--------. |---------. .---------. |
| dcsw01 | <----- | Calico |<--| metallb | |
'--------' |---------' '---------' |
'-------------------------'
This makes Calico announce our pod/service networks into our L3 fabric!
Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.
We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.
All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.
We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.
This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...
There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.
Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.
I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.
This change reflects ad-hoc production changes.
Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
This deploys office.hackerspace.pl. It's a collaborative document
editing server that works with Nextcloud.
This is already live, and can be tested with owncloud.hackerspace.pl
(new -> document).
Change-Id: Ic8055a8a6679e7a0695ebb9e41108074d8f789af
WHITE
WHALE
HOLY
GRAIL
Complex systems are complex. Let me tell you a story about that.
Matrix clients perform their last stage of login by performing a POST to
/_matrix/client/r0/login on the Matrix homeserver they log in to. How
they reach the Homeserver is specified earlier - either by using
discovery via SRV or .well-known, or by the client manually specifying
the Matrix homeserver URL.
Regardless of how they reach this endpoint in the first place, this POST
endpoint, as per the Matrix Client-Server API Specification (r0.6.1),
MAY return a `well_known` key, which MUST contain a `homeserver`
address, pointing to the address of the homeserver which the client
should talk to. If present, the client SHOULD use that instead of
whatever it connected to so far.
Issue the first: the iOS client requires `well_known` in that response,
and doesn't work otherwise. https://github.com/vector-im/element-ios/issues/3448
Issue the second: Synapse will return `well_known` accordingly, but only
if `public_baseurl` is set in its configuration. It is not required to
be set. If not set, it will simply not return this key.
Shrek the third: we never set `public_baseurl` in Synapse, and the first
issue (iOS needing `well_known`) only became a regression in
https://github.com/vector-im/element-ios/issues/2715 . As such, it was
difficult to troubleshoot this issue, and we kept getting on some red
herrings: is it the SSO? Is our server broken? Is the iOS implementation
broken?
But now we know - https://github.com/vector-im/element-ios/issues/2715
seems to be the true culprit.
Change-Id: I913792e31e3c6813d4e51d4befdba720cad3f532
Configuring this one is a bit different from appservice-irc. Notably,
there's no way to give it a registration.yaml to overlay on top of a
config, se we end up using an init container with yq to do that for us.
Also, I had to manually copy the regsitration.yaml in synapse, from
/appservices/telegram-prod/registration.yaml to
/data/appservices/telegram-prod.jsonnet, in order to make it work with
the synapse docker start magic. :/
Otherwise, this is deployed and seems to be working.
Change-Id: Id747a0e310221855556c1d280439376f0c4e5ed6
This is in preparation for adding a Telegram bridge appservice. The main
jsonnet file was getting quite chonky.
This does not affect production, and is just a refactor.
Change-Id: I7cdee2bd71aedb40a9f6c3e5148f829023171dcb
The way this was migrated is not to be spoken of.
(hint: it involved downtime, and mounting two volumes at once)
appservice-irc has some storage, we should migrate that to waw3, too. But
it's not as critical.
The new storage (waw3) is _much_ faster.
Change-Id: I4b4bd32e4fedc514753d25bac35d001e8a9c5f00
This now allows to run apt and should allow to run most upstream docker
images. In return, we prohibit some mildly sketchy stuff. But this is
safe enough for project namespaces with limited administrative access.
We should still get gvisor sooner than later...
Change-Id: Ida5ccfae440bacb6f3fd55dcc34ca0addfddd5ae
When deploying https://gerrit.hackerspace.pl/c/hscloud/+/401 we manually
re-pinned appservice-irc to run on bc01n03 (to prevent reschedule as
bc01n02 was updated while bc01n03 was already done). This change makes
git reflect production.
Change-Id: I2518a8a227bfacefd9f1905ded5a1d65e379845f
- we update NixOS to 20.09pre
- we fix an ACME option that's now required
- we switch from systemd-timesyncd to chrony (as timesyncd took a long
time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
from ceph)
This has been deployed in production.
Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
This allows for the following:
local oa = kube.OpenAPI,
vaidation: oa.Validation(oa.Dict {
foo: oa.Required(oa.String),
bar: oa.Required(oa.Array(oa.Dict {
baz: oa.Boolean,
})),
}),
No more `oa.String { required:: true }`!
Change-Id: I4ecc5002e83a8a1cfcdf083d425d7decd4cf8871
This adds a mod proxy system, called, well, modproxy.
It sits between Factorio server instances and the Factorio mod portal,
allowing for arbitrary mod download without needing the servers to know
Factorio credentials.
Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212
There's an issue with the registry that forbids me from pushing into
anything but my personal namespace - might have been introduced by
0697e01144 . For now, I move the hackdoc
image to my personal namespace, as at some point in the future I want to
revamp the registry system, anyway.
We also drive-by fix a mirko.libsonnet typo that, for some reason,
hasn't manifested itself yet.
Change-Id: I8544e4a52610fb84c5c9d8b0de449f785248f60f
ceph-waw2 has currently some production issues [1] which have started to
cause write failures in the registry. The registry is the only user of
ceph-waw2's affected pool, so we reduce the dumpster fire blast radious
by moving it over to ceph-waw3.
This has already been deployed and data has been migrated over (via
s3cmd sync), and the migration has been verified (by a push and pull,
and pull of an older image).
[1] - pgs stuck inactive in the object storage pool
Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347