This is a mega-change, but attempting to split this up further is
probably not worth the effort.
Summary:
1. Bump up bazel, rules_go, and others.
2. Switch to new go target naming (bye bye go_default_library)
3. Move go deps to go.mod/go.sum, use make gazelle generate from that
4. Bump up Python deps a bit
And also whatever was required to actually get things to work - loads of
small useless changes.
Tested to work on NixOS and Ubuntu 20.04:
$ bazel build //...
$ bazel test //...
Change-Id: I8364bdaa1406b9ae4d0385a6b607f3e7989f98a9
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1583
Reviewed-by: q3k <q3k@hackerspace.pl>
This replaces the old clustercfg script with a brand spanking new
mostly-equivalent Go reimplementation. But it's not exactly the same,
here are the differences:
1. No cluster deployment logic anymore - we expect everyone to use ops/
machine at this point.
2. All certs/keys are Ed25519 and do not expire by default - but
support for short-lived certificates is there, and is actually more
generic and reusable. Currently it's only used for admincreds.
3. Speaking of admincreds: the new admincreds automatically figure out
your username.
4. admincreds also doesn't shell out to kubectl anymore, and doesn't
override your default context. The generated creds can live
peacefully alongside your normal prodaccess creds.
5. gencerts (the new nodestrap without deployment support) now
automatically generates certs for all nodes, based on local Nix
modules in ops/.
6. No secretstore support. This will be changed once we rebuild
secretstore in Go. For now users are expected to manually run
secretstore sync on cluster/secrets.
Change-Id: Ida935f44e04fd933df125905eee10121ac078495
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1498
Reviewed-by: q3k <q3k@hackerspace.pl>
This is a chonky refactor that get rids of the previous cluster-centric
defs-* plain nix file setup.
Now, nodes are configured individually in plain nixos modules, and are
provided a view of all other nodes in the 'machines' attribute. Cluster
logic is moved into modules which inspect this array to find other nodes
within the same cluster.
Kubernetes options are not fully clusterified yet (ie., they are still
hardcode to only provide the 'k0' cluster) but that can be fixed later.
The Ceph machinery is a good example of how that can be done.
The new NixOS configs are zero-diff against prod. While this is done
mostly by keeping the logic, we had to keep a few newly discovered
'bugs' around by adding some temporary options which keeps things as they
are. These will be removed in a future CL, then introducing a diff (but
no functional changes, hopefully).
We also remove the nix eval from clustercfg as it was not used anymore
(basically since we refactored certs at some point).
Change-Id: Id79772a96249b0e6344046f96f9c2cb481c4e1f4
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1322
Reviewed-by: informatic <informatic@hackerspace.pl>
This removes the need to source env.{sh,fish} when working with hscloud.
This is done by:
1. Implementing a Go library to reliably detect the location of the
active hscloud checkout. That in turn is enabled by
BUILD_WORKSPACE_DIRECTORY being now a thing in Bazel.
2. Creating a tool `hscloud`, with a command `hscloud workspace` that
returns the workspace path.
3. Wrapping this tool to be accessible from Python and Bash.
4. Bumping all users of hscloud_root to use either the Go library or
one of the two implemented wrappers.
We also drive-by replace tools/install.sh to be a proper sh_binary, and
make it yell at people if it isn't being ran as `bazel run
//tools:install`.
Finally, we also drive-by delete cluster/tools/nixops.sh which was never used.
Change-Id: I7873714319bfc38bbb930b05baa605c5aa36470a
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1169
Reviewed-by: informatic <informatic@hackerspace.pl>
This moves the diff-and-activate logic from cluster/nix/provision.nix
into ops/{provision,machines}.nix that can be used for both cluster
machines and bgpwtf machines.
The provisioning scripts now live per-NixOS-config, and anything under
ops.machines.$fqdn now has a .passthru.hscloud.provision derivation
which is that script. When ran, it will attempt to deploy onto the
target machine.
There's also a top-level tool at `ops.provision` which builds all
configurations / machines and can be called with the machine name/fqdn
to call the corresponding provisioner script.
clustercfg is changed to use the new provisioning logic.
Change-Id: I258abce9e8e3db42af35af102f32ab7963046353
This unifies nixpkgs with the one defined in //default.nix and makes it
possible to use readTree to build the provisioners:
nix-build -A cluster.nix.provision
result/bin/provision
Change-Id: I68dd70b9c8869c7c0b59f5007981eac03667b862
This will be, at some point, a script to run on Gerrit presubmit (ie.
right before merge).
For now, you can manually run it to ensure that Everything At Least
Kinda Works.
Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8
instead of Python packages
As usual with Python sadness, the @pydeps wheels are built on the bazel
host, so stuffing them inside a container_image (or py_image) will cause
new and unexpected kinds of misery.
Change-Id: Id4e4d53741cf2da367f01aa15c21c133c5cf0dba
This makes clustercfg ensure certificates are valid for at least 30
days, and renew them otherwise.
We use this to bump all the certs that were about to expire in a week.
They are now valid until 2021.
There's still some certs that expire in 2020. We need to figure out a
better story for this, especially as the next expiry is 2021 - todays
prod rollout was somewhat disruptive (basically this was done by a full
cluster upgrade-like rollout flow, via clustercfg).
We also drive-by bump the number of mons in ceph-waw3 to 3, as it shouls
be (this gets rid of a nasty SPOF that would've bitten us during this
upgrade otherwise).
Change-Id: Iee050b1b9cba4222bc0f3c7bce9e4cf9b25c8bdc
We change the existing behaviour (copy files & run nixos-rebuild switch)
to something closer to nixops-style. This now means that provisioning
admin machines need Nix installed locally, but that's probably an okay
choice to make.
The upside of this approach is that it's easier to debug and test
derivations, as all data is local to the repo and the workstation, and
deploying just means copying a configuration closure and switching the
system to it. At some point we should even be able to run the entire
cluster within a set of test VMs.
We also bump the kubernetes control plane to 1.14. Kubelets are still at
1.13 and their upgrade is comint up today too.
Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd
In https://gerrit.hackerspace.pl/c/hscloud/+/70 we accidentally
introduced a split-horizon DNS situation:
- k0.hswaw.net from the Internet resolves to nodes running the k8s API
servers, and as such can serve API server traffic
- k0.hswaw.net from the cluster returned no results
This broke prodvider in two ways:
- it dialed the API servers at k0.hswaw.net
- even after the endpoint was moved to
kubernetes.default.svc.k0.hswaw.net, the apiserver cert didn't cover
that
Thus, not only we had to change the prodvider endpoint but also change
the APIserver certs to cover this new name.
I'm not sure this should be the target fix. I think at some point we
should only start referring to in-cluster services via their full (or
cluster.local) names, but right now k0.hswaw.net is an exception and as
such a split, and we have no way to access the internal services from
the outside just yet.
However, getting prodvider to work is important enough that this fix is
IMO good enough for now.
Change-Id: I13d0681208c66f4060acecc78b7ae14b8f8d7125
rules_pip has a new version [1] of their rule system, incompatible with the
version we used, that fixes a bunch of issues, notably:
- explicit tagging of repositories for PY2/PY3/PY23 support
- removal of dependency on host pip (in exchange for having to vendor
wheels)
- higher quality tooling for locking
We update to the newer version of pip_rules, rename the external
repository to pydeps and move requirements.txt, the lockfile and the
newly vendored wheels to third_party/, where they belong.
[1] - https://github.com/apt-itude/rules_pip/issues/16
Change-Id: I1065ee2fc410e52fca2be89fcbdd4cc5a4755d55
Prodaccess/Prodvider allow issuing short-lived certificates for all SSO
users to access the kubernetes cluster.
Currently, all users get a personal-$username namespace in which they
have adminitrative rights. Otherwise, they get no access.
In addition, we define a static CRB to allow some admins access to
everything. In the future, this will be more granular.
We also update relevant documentation.
Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153
python_rules is completely broken when it comes to py2/py3 support.
Here, we replace it with native python rules from new Bazel versions [1] and rules_pip for PyPI dependencies [2].
rules_pip is somewhat little known and experimental, but it seems to work much better than what we had previously.
We also unpin rules_docker and fix .bazelrc to force Bazel into Python 2 mode - hopefully, this repo will now work
fine under operating systems where `python` is python2 (as the standard dictates).
[1] - https://docs.bazel.build/versions/master/be/python.html
[2] - https://github.com/apt-itude/rules_pip
Change-Id: Ibd969a4266db564bf86e9c96275deffb9610dd44
This pretty large change does the following:
- moves nix from bootstrap.hswaw.net to nix/
- changes clustercfg to use cfssl and moves it to cluster/clustercfg
- changes clustercfg to source information about target location of
certs from nix
- changes clustercfg to push nix config
- changes tls certs to have more than one CA
- recalculates all TLS certs
(it keeps the old serviceaccoutns key, otherwise we end up with
invalid serviceaccounts - the cert doesn't match, but who cares,
it's not used anyway)