1
0
Fork 0
Commit Graph

18 Commits (4afed98e4eef3ff7074a05595a569d1ef581fc40)

Author SHA1 Message Date
q3k 4842705406 cluster/nix: integrate with readtree
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:

   nix-build -A cluster.nix.provision

   result/bin/provision

Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
2021-02-14 14:46:07 +00:00
q3k 04604b2aae cluster: add admitomatic CA/certificate
Change-Id: Idb32dc38b897aa266b6d2d6fd57a5e38b47db7fc
2021-02-06 17:18:58 +00:00
q3k 36224c617a clustercfg: show diff before switching to new configuration
This is mildly hacky, but lets us be more informed before we switch to a
new configuration.

Change-Id: I008f3f698db702f1e0992bd41a8d1050449d59b5
2020-10-10 16:00:11 +00:00
q3k e7fca3acd8 ci_presubmit: init
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).

For now, you can manually run it to ensure that Everything At Least
Kinda Works.

Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
2020-09-25 21:15:07 +00:00
q3k bc73a44519 cluster/clustercfg: fix BUILD
This is continued fallout after migrating from rules_pip.

Change-Id: Idb9b4d4f22aa36512d220ac31375bae7a0f25e4e
2020-08-22 20:33:37 +00:00
implr cae27ecd99 Replace rules_pip with rules_python; use bazel built upstream grpc
instead of Python packages

As usual with Python sadness, the @pydeps wheels are built on the bazel
host, so stuffing them inside a container_image (or py_image) will cause
new and unexpected kinds of misery.

Change-Id: Id4e4d53741cf2da367f01aa15c21c133c5cf0dba
2020-07-08 18:55:34 +02:00
q3k 0dcc702c64 cluster: bump nearly-expired certs
This makes clustercfg ensure certificates are valid for at least 30
days, and renew them otherwise.

We use this to bump all the certs that were about to expire in a week.
They are now valid until 2021.

There's still some certs that expire in 2020. We need to figure out a
better story for this, especially as the next expiry is 2021 - todays
prod rollout was somewhat disruptive (basically this was done by a full
cluster upgrade-like rollout flow, via clustercfg).

We also drive-by bump the number of mons in ceph-waw3 to 3, as it shouls
be (this gets rid of a nasty SPOF that would've bitten us during this
upgrade otherwise).

Change-Id: Iee050b1b9cba4222bc0f3c7bce9e4cf9b25c8bdc
2020-03-28 18:01:40 +01:00
q3k c78cc13528 cluster/nix: locally build nixos derivations
We change the existing behaviour (copy files & run nixos-rebuild switch)
to something closer to nixops-style. This now means that provisioning
admin machines need Nix installed locally, but that's probably an okay
choice to make.

The upside of this approach is that it's easier to debug and test
derivations, as all data is local to the repo and the workstation, and
deploying just means copying a configuration closure and switching the
system to it. At some point we should even be able to run the entire
cluster within a set of test VMs.

We also bump the kubernetes control plane to 1.14. Kubelets are still at
1.13 and their upgrade is comint up today too.

Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd
2020-02-02 22:31:53 +01:00
q3k d186e9468d cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net
In https://gerrit.hackerspace.pl/c/hscloud/+/70 we accidentally
introduced a split-horizon DNS situation:

 - k0.hswaw.net from the Internet resolves to nodes running the k8s API
   servers, and as such can serve API server traffic
 - k0.hswaw.net from the cluster returned no results

This broke prodvider in two ways:
 - it dialed the API servers at k0.hswaw.net
 - even after the endpoint was moved to
   kubernetes.default.svc.k0.hswaw.net, the apiserver cert didn't cover
   that

Thus, not only we had to change the prodvider endpoint but also change
the APIserver certs to cover this new name.

I'm not sure this should be the target fix. I think at some point we
should only start referring to in-cluster services via their full (or
cluster.local) names, but right now k0.hswaw.net is an exception and as
such a split, and we have no way to access the internal services from
the outside just yet.

However, getting prodvider to work is important enough that this fix is
IMO good enough for now.

Change-Id: I13d0681208c66f4060acecc78b7ae14b8f8d7125
2019-10-04 13:52:34 +02:00
q3k 5f9b1ecd67 rules_pip: update to new version
rules_pip has a new version [1] of their rule system, incompatible with the
version we used, that fixes a bunch of issues, notably:
 - explicit tagging of repositories for PY2/PY3/PY23 support
 - removal of dependency on host pip (in exchange for having to vendor
   wheels)
 - higher quality tooling for locking

We update to the newer version of pip_rules, rename the external
repository to pydeps and move requirements.txt, the lockfile and the
newly vendored wheels to third_party/, where they belong.

[1] - https://github.com/apt-itude/rules_pip/issues/16

Change-Id: I1065ee2fc410e52fca2be89fcbdd4cc5a4755d55
2019-09-25 14:05:07 +02:00
q3k b13b7ffcdb prod{access,vider}: implement
Prodaccess/Prodvider allow issuing short-lived certificates for all SSO
users to access the kubernetes cluster.

Currently, all users get a personal-$username namespace in which they
have adminitrative rights. Otherwise, they get no access.

In addition, we define a static CRB to allow some admins access to
everything. In the future, this will be more granular.

We also update relevant documentation.

Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153
2019-08-30 23:08:18 +02:00
q3k 116da981c9 nix/ -> cluster/nix/
These are related to cluster bootstrapping, not generic language
libraries (like go/ and bzl/).

Change-Id: I03a83c64f3e0fa6cb615d36b4e618f5e92d886ec
2019-07-21 15:53:20 +02:00
Serge Bazanski 2ce367681a *: move away from python_rules
python_rules is completely broken when it comes to py2/py3 support.

Here, we replace it with native python rules from new Bazel versions [1] and rules_pip for PyPI dependencies [2].

rules_pip is somewhat little known and experimental, but it seems to work much better than what we had previously.

We also unpin rules_docker and fix .bazelrc to force Bazel into Python 2 mode - hopefully, this repo will now work
fine under operating systems where `python` is python2 (as the standard dictates).

[1] - https://docs.bazel.build/versions/master/be/python.html

[2] - https://github.com/apt-itude/rules_pip

Change-Id: Ibd969a4266db564bf86e9c96275deffb9610dd44
2019-07-16 22:22:05 +00:00
q3k c0fc3ee442 cluster/clustercfg: add clustercfg-nocerts 2019-06-20 16:11:38 +02:00
q3k f81f7d462a cluster/clustercfg: gitignore __pycache__ 2019-05-19 03:11:18 +02:00
informatic e24ccd678c clustercfg: fix broken admincreds generation 2019-04-09 13:43:54 +02:00
informatic 598a079f57 clustercfg: extract cfssl handling to separate function 2019-04-09 13:29:33 +02:00
q3k 73cef11c85 *: rejigger tls certs and more
This pretty large change does the following:

 - moves nix from bootstrap.hswaw.net to nix/
 - changes clustercfg to use cfssl and moves it to cluster/clustercfg
 - changes clustercfg to source information about target location of
   certs from nix
 - changes clustercfg to push nix config
 - changes tls certs to have more than one CA
 - recalculates all TLS certs
   (it keeps the old serviceaccoutns key, otherwise we end up with
   invalid serviceaccounts - the cert doesn't match, but who cares,
   it's not used anyway)
2019-04-07 00:06:23 +02:00