This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.
Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!
Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
This has been deployed to k0 nodes.
Current state of cluster certificates:
cluster/certs/ca-etcd.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-etcdpeer.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-kube.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-kubefront.crt
Not After : Apr 4 17:59:00 2024 GMT
cluster/certs/ca-kube-prodvider.cert
Not After : Sep 1 21:30:00 2021 GMT
cluster/certs/etcd-bc01n01.hswaw.net.cert
Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcd-bc01n02.hswaw.net.cert
Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcd-bc01n03.hswaw.net.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-calico.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-dcr01s22.hswaw.net.cert
Not After : Oct 3 15:33:00 2021 GMT
cluster/certs/etcd-dcr01s24.hswaw.net.cert
Not After : Oct 3 15:38:00 2021 GMT
cluster/certs/etcd-kube.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-bc01n01.hswaw.net.cert
Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcdpeer-bc01n02.hswaw.net.cert
Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcdpeer-bc01n03.hswaw.net.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert
Not After : Oct 3 15:33:00 2021 GMT
cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert
Not After : Oct 3 15:38:00 2021 GMT
cluster/certs/etcd-root.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-apiserver.cert
Not After : Oct 3 15:26:00 2021 GMT
cluster/certs/kube-controllermanager.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kubefront-apiserver.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert
Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert
Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert
Not After : Oct 3 15:33:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert
Not After : Oct 3 15:38:00 2021 GMT
cluster/certs/kube-proxy.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-scheduler.cert
Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-serviceaccounts.cert
Not After : Mar 28 15:15:00 2021 GMT
Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314
This makes all Nix files addressable from root by file path.
For instance, if a file is located in //foo/bar:baz.nix containing:
{ pkgs, ... }:
pkgs.stdenv.mkDerivation {
pname = "foo";
# ...
}
You can then do:
nix-build -A foo.bar.baz
All nix files loaded this way must be a function taking a 'config'
attrset - see nix/readTree.nix for more information. Currently the
config attrset contains the following fields:
- hscloud: the root of the hscloud repository itself, which allows
for traversal via readTree (eg. hscloud.foo.bar.baz)
- pkgs: nixpkgs
- pkgsSrc: nixpkgs souce/channel, useful to load NixOS modules.
- lib, stdenv: lib and stdenv from pkgs.
Change-Id: Ieaacdcabceec18dd6c670d346928bff08b66cf79
This prevents metallb routes being announced from all peers to our ToR,
thereby preventing issues with traffic hitting services with
externalTrafficPolicy: local.
There still is the from-host loopback issue, but that will be fixed by
upgrading to kube 1.15.
Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04
This fixes compatibility with prodaccess tools built with Go 1.15, which
introduced 'X.509 CommonName deprecation' [1].
[1] - https://golang.org/doc/go1.15#commonname
Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).
For now, you can manually run it to ensure that Everything At Least
Kinda Works.
Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
Previously, we had the following setup:
.-----------.
| ..... |
.-----------.-|
| dcr01s24 | |
.-----------.-| |
| dcr01s22 | | |
.---|-----------| |-'
.--------. | |---------. | |
| dcsw01 | <----- | metallb | |-'
'--------' |---------' |
'-----------'
Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.
Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.
.-------------------------.
| dcr01s24 |
|-------------------------|
.--------. |---------. .---------. |
| dcsw01 | <----- | Calico |<--| metallb | |
'--------' |---------' '---------' |
'-------------------------'
This makes Calico announce our pod/service networks into our L3 fabric!
Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.
We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.
All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.
We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.
This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...
There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.
Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.
I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.
This change reflects ad-hoc production changes.
Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
This deploys office.hackerspace.pl. It's a collaborative document
editing server that works with Nextcloud.
This is already live, and can be tested with owncloud.hackerspace.pl
(new -> document).
Change-Id: Ic8055a8a6679e7a0695ebb9e41108074d8f789af
WHITE
WHALE
HOLY
GRAIL
Complex systems are complex. Let me tell you a story about that.
Matrix clients perform their last stage of login by performing a POST to
/_matrix/client/r0/login on the Matrix homeserver they log in to. How
they reach the Homeserver is specified earlier - either by using
discovery via SRV or .well-known, or by the client manually specifying
the Matrix homeserver URL.
Regardless of how they reach this endpoint in the first place, this POST
endpoint, as per the Matrix Client-Server API Specification (r0.6.1),
MAY return a `well_known` key, which MUST contain a `homeserver`
address, pointing to the address of the homeserver which the client
should talk to. If present, the client SHOULD use that instead of
whatever it connected to so far.
Issue the first: the iOS client requires `well_known` in that response,
and doesn't work otherwise. https://github.com/vector-im/element-ios/issues/3448
Issue the second: Synapse will return `well_known` accordingly, but only
if `public_baseurl` is set in its configuration. It is not required to
be set. If not set, it will simply not return this key.
Shrek the third: we never set `public_baseurl` in Synapse, and the first
issue (iOS needing `well_known`) only became a regression in
https://github.com/vector-im/element-ios/issues/2715 . As such, it was
difficult to troubleshoot this issue, and we kept getting on some red
herrings: is it the SSO? Is our server broken? Is the iOS implementation
broken?
But now we know - https://github.com/vector-im/element-ios/issues/2715
seems to be the true culprit.
Change-Id: I913792e31e3c6813d4e51d4befdba720cad3f532
Configuring this one is a bit different from appservice-irc. Notably,
there's no way to give it a registration.yaml to overlay on top of a
config, se we end up using an init container with yq to do that for us.
Also, I had to manually copy the regsitration.yaml in synapse, from
/appservices/telegram-prod/registration.yaml to
/data/appservices/telegram-prod.jsonnet, in order to make it work with
the synapse docker start magic. :/
Otherwise, this is deployed and seems to be working.
Change-Id: Id747a0e310221855556c1d280439376f0c4e5ed6
This is in preparation for adding a Telegram bridge appservice. The main
jsonnet file was getting quite chonky.
This does not affect production, and is just a refactor.
Change-Id: I7cdee2bd71aedb40a9f6c3e5148f829023171dcb
The way this was migrated is not to be spoken of.
(hint: it involved downtime, and mounting two volumes at once)
appservice-irc has some storage, we should migrate that to waw3, too. But
it's not as critical.
The new storage (waw3) is _much_ faster.
Change-Id: I4b4bd32e4fedc514753d25bac35d001e8a9c5f00
This now allows to run apt and should allow to run most upstream docker
images. In return, we prohibit some mildly sketchy stuff. But this is
safe enough for project namespaces with limited administrative access.
We should still get gvisor sooner than later...
Change-Id: Ida5ccfae440bacb6f3fd55dcc34ca0addfddd5ae
When deploying https://gerrit.hackerspace.pl/c/hscloud/+/401 we manually
re-pinned appservice-irc to run on bc01n03 (to prevent reschedule as
bc01n02 was updated while bc01n03 was already done). This change makes
git reflect production.
Change-Id: I2518a8a227bfacefd9f1905ded5a1d65e379845f
- we update NixOS to 20.09pre
- we fix an ACME option that's now required
- we switch from systemd-timesyncd to chrony (as timesyncd took a long
time to sync clocks after restart, leading to MON_CLOCK_SKEW errors
from ceph)
This has been deployed in production.
Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad
This allows for the following:
local oa = kube.OpenAPI,
vaidation: oa.Validation(oa.Dict {
foo: oa.Required(oa.String),
bar: oa.Required(oa.Array(oa.Dict {
baz: oa.Boolean,
})),
}),
No more `oa.String { required:: true }`!
Change-Id: I4ecc5002e83a8a1cfcdf083d425d7decd4cf8871
This adds a mod proxy system, called, well, modproxy.
It sits between Factorio server instances and the Factorio mod portal,
allowing for arbitrary mod download without needing the servers to know
Factorio credentials.
Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212