Commit Graph

596 Commits (c3f36e9bf199909c5fa65644249ff61976ddaf84)

Author SHA1 Message Date
q3k c3f36e9bf1 third_party/go: bump kubernetes to 1.19.3
Change-Id: Id9245765936997088e94135fde409ff4c1539bba
2020-11-03 21:15:41 +01:00
q3k 99ce53c79a third_party: remove uWSGI
It's not being used outside of personal/q3k for now, and it's really
fucking up the build system.

Change-Id: Ie8f3e59e40e8be8ef3ec32118a591da2274e398c
2020-10-29 01:43:37 +01:00
q3k b1de757249 laserproxy: add nix build
Change-Id: If93f4ba69afa028fed9098663a523f46d6134f7c
2020-10-29 00:43:43 +01:00
q3k bfe9bb0e3a k0: add woju's personal s3 user
Change-Id: I8ed5bb5428594b74460f1b89185d684cb6c26268
2020-10-27 20:50:50 +01:00
q3k 491542589b tools/gostatic: init
This adds Bazel/hscloud integration to gostatic, via gostatic_tarball.

A sample is provided in //tools/gostatic/example, it can be built using:

    bazel build //tools/gostatic/example

The resulting tarball can then be extracted and viewed in a web
browser.

Change-Id: Idf8d4a8e0ee3a5ae07f7449a25909478c2d8b105
2020-10-26 12:08:33 +01:00
q3k 94a1af8714 hackdoc: add table css, make it colorful
Change-Id: Idab1f911c10832ef4cfcf7073f77577d1b8673ff
2020-10-24 20:20:18 +02:00
q3k 79b506bcc2 third_party/go: unbreak build
This was missed in gerrit/486. Whoops, we should CI sooner than later.

Change-Id: Ic70b742c75d52dd615d4e2f946233783d156cead
2020-10-24 17:36:25 +02:00
q3k b4c3f342e4 third_party/go: add gostatic
To test:

    bazel run '@com_github_piranha_gostatic//:gostatic'

Change-Id: Ie846429df0d1f1914f2734735591edebd5d29094
2020-10-24 17:30:44 +02:00
q3k e401735fdd Merge "bgpwtf: add static v6 routes via bird" 2020-10-16 17:09:18 +00:00
q3k d9a6365f8b bgpwtf: add static v6 routes via bird
A customer was missing a static v6 route via their router. Since we
don't want to add them to networking.interfaces.routes.* (as this
restarts the whole scripted network stack in NixOS), we add them to
bird. This requires implementing hscloud.routing.static.

Change-Id: I0a205ed1e1f17a86de43aaf72ab6c2694a069112
2020-10-16 19:07:52 +02:00
q3k 78753aa275 Merge "k0: bump to 1.16.5" 2020-10-10 20:40:56 +00:00
q3k b014a95e0a Merge "k0: expose controller-manager and scheduler metrics" 2020-10-10 20:40:35 +00:00
q3k bfe2fe6455 Merge "clustercfg: show diff before switching to new configuration" 2020-10-10 20:40:31 +00:00
q3k e77f7717d4 k0: bump to 1.16.5
Change-Id: I548808ce4e0deb0513a1e00963f383d84b9d920c
2020-10-10 22:39:50 +02:00
informatic cf47f08481 app/covid-formity: enable redis password
This has already been deployed in production

Change-Id: I9c603a4985332d422d8875ecf6f8dca157f32f22
2020-10-10 18:40:45 +00:00
informatic 7e3447f3ff Merge "kube/redis: implement optional cfg.password option" 2020-10-10 18:40:37 +00:00
q3k d9e32f19f6 Merge "kube/upstream: bump to 1.14.4" 2020-10-10 18:24:48 +00:00
informatic 89a1ee90cd kube/redis: implement optional cfg.password option
If set, this enables internal redis authentication scheme. Supports
secretRefs, as well as values passed directly.

Change-Id: Ie902b8d79fdc4aa83ad8ad123e79f0bc80c1251f
2020-10-10 19:44:14 +02:00
informatic 018d219dc9 Merge changes Ie974e7e8,I0bda7f6e
* changes:
  app/covid-formity: add kurjerzy integration
  app/covid-formity: image update, add /qr1, /manual, /video redirect
2020-10-10 17:13:53 +00:00
q3k 1257389d3d k0: expose controller-manager and scheduler metrics
We want to be able to scrape controller-manager and scheduler metrics
into Prometheus. For that, each of them needs to:

 1) listen on a secure port
 2) have authn enabled

With this, any k8s user with the right permissions (and a bearer token
or TLS certificate) can come in and access metrics over a node's public
IP address. Access without a certificate/token gets thrown into the
system:anonymous user, which as no access to any API.

Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b
2020-10-10 16:00:15 +00:00
q3k 36224c617a clustercfg: show diff before switching to new configuration
This is mildly hacky, but lets us be more informed before we switch to a
new configuration.

Change-Id: I008f3f698db702f1e0992bd41a8d1050449d59b5
2020-10-10 16:00:11 +00:00
q3k a4a5a66f88 Merge "nix: provide a python2 toolchain" 2020-10-10 15:59:58 +00:00
q3k 7d311e9602 ops/monitoring: pull in grafonnet-7.0
Change-Id: Ie036ef767419418876a18255a5ad378f5cfa1535
2020-10-10 15:59:45 +00:00
q3k 3af7da1988 third_party/licenses: create, import Apache-2.0
Change-Id: I3f1a9ede192e70244c8d51bd58e9232a186a203f
2020-10-10 15:59:29 +00:00
q3k c824405e2e Update COPYING
Change-Id: I22661254b16840bcea7b352d51171a232fa7041a
2020-10-10 15:59:10 +00:00
q3k 531cacf14a Merge "WORKSPACE: use nix for python/go if available" 2020-10-07 12:56:38 +00:00
q3k eb09c6a347 speedtest: fix mimetype on served JS
Change-Id: Ifcb1d4f8a58a5e6120f31373b2a8c0e307e414be
2020-10-06 15:29:08 +00:00
q3k 363bf4f341 monitoring: global: implement
This creates a basic Global instance, running Victoria Metrics on k0.

Change-Id: Ib03003213d79b41cc54efe40cd2c4837f652c0f4
2020-10-06 14:28:27 +00:00
q3k 27885a9979 nix: provide a python2 toolchain
This allows us to use rules_docker from NixOS. However, the built
binaries are broken because of the Docker base image not being NixOS
based.

Change-Id: I29b93f1bae1575b04f97265c67497081d11a1910
2020-10-03 16:41:39 +00:00
q3k 2e001e5046 k0: bump to 1.15.4
This notably fixes the annoying loopback issues that prevented hosts
from accessing externalip services with externalTrafficPolicy: local
from nodes that weren't running the service.

Which means, hopefuly, no more registry pull failures when
nginx-ingress gets misplaced!

Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec
2020-10-03 16:32:38 +00:00
q3k 2a223705fd cluster: bump certs
This has been deployed to k0 nodes.

Current state of cluster certificates:

cluster/certs/ca-etcd.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-etcdpeer.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kube.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kubefront.crt
            Not After : Apr  4 17:59:00 2024 GMT
cluster/certs/ca-kube-prodvider.cert
            Not After : Sep  1 21:30:00 2021 GMT
cluster/certs/etcd-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcd-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcd-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-calico.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcd-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/etcd-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/etcd-kube.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/etcdpeer-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/etcdpeer-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/etcd-root.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-apiserver.cert
            Not After : Oct  3 15:26:00 2021 GMT
cluster/certs/kube-controllermanager.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kubefront-apiserver.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert
            Not After : Mar 28 15:53:00 2021 GMT
cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert
            Not After : Mar 28 16:45:00 2021 GMT
cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert
            Not After : Oct  3 15:33:00 2021 GMT
cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert
            Not After : Oct  3 15:38:00 2021 GMT
cluster/certs/kube-proxy.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-scheduler.cert
            Not After : Mar 28 15:15:00 2021 GMT
cluster/certs/kube-serviceaccounts.cert
            Not After : Mar 28 15:15:00 2021 GMT

Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314
2020-10-03 16:32:32 +00:00
q3k 194b1c8e62 WORKSPACE: use nix for python/go if available
This introduces Nix, the package manager, and nixpkgs, the package
collection, into hscloud's bazel build machinery.

There are two reasons behind this:

 - on NixOS, it's painful or at least very difficult to run hscloud out
   of the box. Especially with rules_go, that download a blob from the
   Internet to get a Go toolchain, it just fails outright. This solves
   this and allows hscloud to be used on NixOS.

 - on non-NixOS platforms that still might have access to Nix this
   allows to somewhat hermeticize the build. Notably, Python now comes
   from nixpkgs, and is fabricobbled in a way that makes pip3_import
   use Nix system dependencies for ncurses and libpq.

This has been tested to run ci_presubmit on NixOS 20.09pre and Gentoo
~amd64.

Change-Id: Ic16e4827cb52a05aea0df0eed84d80c5e9ae0e07
2020-10-03 18:31:38 +02:00
q3k 6abe4fa771 bgpwtf/machines: init edge01.waw
This configures our WAW edge router using NixOS. This replaces our
previous Ubuntu installation.

Change-Id: Ibd72bde66ec413164401da407c5b268ad83fd3af
2020-10-03 14:57:38 +00:00
q3k 2efb698d22 *: add default.nix/readTree
This makes all Nix files addressable from root by file path.

For instance, if a file is located in //foo/bar:baz.nix containing:

    { pkgs, ... }:

    pkgs.stdenv.mkDerivation {
      pname = "foo";
      # ...
    }

You can then do:

    nix-build -A foo.bar.baz

All nix files loaded this way must be a function taking a 'config'
attrset - see nix/readTree.nix for more information. Currently the
config attrset contains the following fields:

 - hscloud: the root of the hscloud repository itself, which allows
            for traversal via readTree (eg. hscloud.foo.bar.baz)
 - pkgs: nixpkgs
 - pkgsSrc: nixpkgs souce/channel, useful to load NixOS modules.
 - lib, stdenv: lib and stdenv from pkgs.

Change-Id: Ieaacdcabceec18dd6c670d346928bff08b66cf79
2020-10-03 14:57:34 +00:00
q3k fbe234bdb2 cluster: rename module-* into modules/*
Change-Id: I65e06f3e9cec2ba0071259eb755eddbbd1025b97
2020-10-03 14:57:30 +00:00
q3k c7de7e562f cluster: do not export metallb routes to mesh peers
This prevents metallb routes being announced from all peers to our ToR,
thereby preventing issues with traffic hitting services with
externalTrafficPolicy: local.

There still is the from-host loopback issue, but that will be fixed by
upgrading to kube 1.15.

Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04
2020-10-03 14:56:52 +00:00
q3k f0acf16564 prodvider: use SANs in service certificates
This fixes compatibility with prodaccess tools built with Go 1.15, which
introduced 'X.509 CommonName deprecation' [1].

[1] - https://golang.org/doc/go1.15#commonname

Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3
2020-10-03 14:56:35 +00:00
q3k 44628f2b9e Merge "k0.hswaw.net: pass metallb through Calico" 2020-10-02 22:54:57 +00:00
q3k e7fca3acd8 ci_presubmit: init
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).

For now, you can manually run it to ensure that Everything At Least
Kinda Works.

Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
2020-09-25 21:15:07 +00:00
q3k f00a701f27 tools: remove unused go_sdk.bzl
This is a leftover from an old attempt at NixOS compatibility.

Change-Id: I5050f76b83f47796cdfa6235db8ee5efe8daf3e2
2020-09-25 21:01:12 +00:00
q3k 4e8622df35 djtest: use pyelftools to find uwsgi ld.so
Change-Id: I54bdaa588ff15d8c6ca73c4307076a93a5682d78
2020-09-25 21:00:11 +00:00
q3k a5ed644980 k0.hswaw.net: pass metallb through Calico
Previously, we had the following setup:

                          .-----------.
                          | .....     |
                        .-----------.-|
                        | dcr01s24  | |
                      .-----------.-| |
                      | dcr01s22  | | |
                  .---|-----------| |-'
    .--------.    |   |---------. | |
    | dcsw01 | <----- | metallb | |-'
    '--------'        |---------' |
                      '-----------'

Ie., each metallb on each node directly talked to dcsw01 over BGP to
announce ExternalIPs to our L3 fabric.

Now, we rejigger the configuration to instead have Calico's BIRD
instances talk BGP to dcsw01, and have metallb talk locally to Calico.

                      .-------------------------.
                      | dcr01s24                |
                      |-------------------------|
    .--------.        |---------.   .---------. |
    | dcsw01 | <----- | Calico  |<--| metallb | |
    '--------'        |---------'   '---------' |
                      '-------------------------'

This makes Calico announce our pod/service networks into our L3 fabric!

Calico and metallb talk to eachother over 127.0.0.1 (they both run with
Host Networking), but that requires one side to flip to pasive mode. We
chose to do that with Calico, by overriding its BIRD config and
special-casing any 127.0.0.1 peer to enable passive mode.

We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle
with the kernel programming filter (ie. to-kernel-routing-table filter),
where we disable programming unreachable routes. This is because routes
coming from metallb have their next-hop set to 127.0.0.1, which makes
bird mark them as unreachable. Unreachable routes in the kernel will
break local access to ExternalIPs, eg. register access from containerd.

All routes pass through without route reflectors and a full mesh as we
use eBGP over private ASNs in our fabric.

We also have to make Calico aware of metallb pools - otherwise, routes
announced by metallb end up being filtered by Calico.

This is all mildly hacky. Here's hoping that Calico will be able to some
day gain metallb-like functionality, ie. IPAM for
externalIPs/LoadBalancers/...

There seems to be however one problem with this change (but I'm not
fixing it yet as it's not critical): metallb would previously only
announce IPs from nodes that were serving that service. Now, however,
the Calico internal mesh makes those appear from every node. This can
probably be fixed by disabling local meshing, enabling route reflection
on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by
some more hacking of the Calico BIRD config :/.

Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836
2020-09-23 18:55:12 +00:00
q3k 0dd5195766 hackdoc: bump
Change-Id: I027a7d8f30d55773ec0e2ec7700bd780e417cb19
2020-09-23 18:31:35 +00:00
q3k 2b8f3c4af7 Merge changes Ib91e4d3b,I5d41fa12,I839863a8
* changes:
  hackdoc: render TOC inline
  hackdoc: fix pub_listen flag in readme
  hackdoc: do not add ?ref= to intra-links unless necessary
2020-09-23 18:14:49 +00:00
q3k 0a2f413b4c hackdoc: render TOC inline
Change-Id: Ib91e4d3b73354e7e19095ea62eed70a23ef96512
2020-09-23 18:13:20 +00:00
q3k 80380f4444 hackdoc: fix pub_listen flag in readme
Change-Id: I5d41fa12f29ec5cff9251bb0ad77fc5fdafef786
2020-09-23 18:13:20 +00:00
q3k 26f44da5f1 hackdoc: do not add ?ref= to intra-links unless necessary
Change-Id: I839863a8c10c54fae11100b885c972bed348eba6
2020-09-23 18:13:20 +00:00
q3k 059fdfed3b k0: add resource requests/limits to nginx, remove gitea
We just had an outage seemingly caused by N-I-C sendings tons of traffic
to gitea, which in turn caused N-I-C to balloon in memory/CPU usage.

I haven't debugged the cause of this traffic, but I have disabled the
gitea TCP forward to Stop The Bleeding.

This change reflects ad-hoc production changes.

Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9
2020-09-20 22:53:40 +00:00
q3k 242ec58a33 k0: add waw-hdd-redundant-q3k-3
Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82
2020-09-20 15:36:24 +00:00
q3k c09d8fedcc Merge "app/onlyoffice: init" 2020-09-16 16:59:06 +00:00