This unforks benji back into upstream. The old fork didn't support a new
authentication method on Ceph, and we don't have multiple clusters
anymore (so we don't need the functionality of the fork).
Change-Id: Ie79313b2321ca2e22ad2874b75a71385af95105f
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1321
Reviewed-by: informatic <informatic@hackerspace.pl>
This is a chonky refactor that get rids of the previous cluster-centric
defs-* plain nix file setup.
Now, nodes are configured individually in plain nixos modules, and are
provided a view of all other nodes in the 'machines' attribute. Cluster
logic is moved into modules which inspect this array to find other nodes
within the same cluster.
Kubernetes options are not fully clusterified yet (ie., they are still
hardcode to only provide the 'k0' cluster) but that can be fixed later.
The Ceph machinery is a good example of how that can be done.
The new NixOS configs are zero-diff against prod. While this is done
mostly by keeping the logic, we had to keep a few newly discovered
'bugs' around by adding some temporary options which keeps things as they
are. These will be removed in a future CL, then introducing a diff (but
no functional changes, hopefully).
We also remove the nix eval from clustercfg as it was not used anymore
(basically since we refactored certs at some point).
Change-Id: Id79772a96249b0e6344046f96f9c2cb481c4e1f4
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1322
Reviewed-by: informatic <informatic@hackerspace.pl>
Reminded by a power failure on bc01n0{1,2}, we migrate away from at
least one of them into another server.
We also fix up the startup join parameter to not include the node itself
(which is not necessary, but a nice thing to have nonetheless).
Since bc01n01 was the initial node of the cluster, we also disable the
init job for k0 (which we don't care about anyway).
Change-Id: I3406471c0f9542e9d802d39138e400b5a5e74794
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1176
Reviewed-by: q3k <q3k@hackerspace.pl>
This removes the need to source env.{sh,fish} when working with hscloud.
This is done by:
1. Implementing a Go library to reliably detect the location of the
active hscloud checkout. That in turn is enabled by
BUILD_WORKSPACE_DIRECTORY being now a thing in Bazel.
2. Creating a tool `hscloud`, with a command `hscloud workspace` that
returns the workspace path.
3. Wrapping this tool to be accessible from Python and Bash.
4. Bumping all users of hscloud_root to use either the Go library or
one of the two implemented wrappers.
We also drive-by replace tools/install.sh to be a proper sh_binary, and
make it yell at people if it isn't being ran as `bazel run
//tools:install`.
Finally, we also drive-by delete cluster/tools/nixops.sh which was never used.
Change-Id: I7873714319bfc38bbb930b05baa605c5aa36470a
Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1169
Reviewed-by: informatic <informatic@hackerspace.pl>
With this we can use Ceph's multi-site support to easily migrate to our
new k0 Ceph cluster.
This migration was done by using radosgw-admin to rename the existing
realm/zonegroup to the new names (hscloud and eu), and then reworking
the jsonnet so that the Rook operator would effectively do nothing.
It sounds weird that creating a bunch of CRs like
Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator,
but that's how Rook works - a CephObjectStore generally creates
everything that the above CRs would create too, but implicitly. Adding
the extra CRs just allows specifying extra settings, like names.
(it wasn't fully a no-op, as the rgw daemon is parametrized by
realm/zonegroup/zone names, so that had to be restarted)
We also make the radosgw serve under object.ceph-eu.hswaw.net, which
allows us to right away start using a zonegroup URL instead of the
zone-only URL.
Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877
This enables radosgw wherever osds are. This should be fast and works
for us because we have little osd hosts.
Change-Id: I4ed014d2790d6c02a2ba8e775aaa1846032dee1e
This is needed to get Rook to talk to an external Ceph 16/Pacific
cluster.
This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our
own CRD rewrite and just slurp in upstream CRD defs.
Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc
Ceph CRD updates would fail with:
ERROR Error updating customresourcedefinitions cephclusters.ceph.rook.io: expected kind, but got map
This wasn't just https://github.com/bitnami/kubecfg/issues/259 . We pull
in the 'solution' from Pulumi
(https://github.com/pulumi/pulumi-kubernetes/pull/622) which just
retries the update via a JSON update instead, and that seems to have
worked.
We also add some better error return wrapping, which I used to debug
this issue properly.
Oof.
Change-Id: I2007a7857e44128d74760174b61b59efa58e9cbc
First pass at a non-rook-managed Ceph cluster. We call it k0 instead of
ceph-waw4, as we pretty much are sure now that we will always have a
one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph
pools for different media kinds (if at all).
For now this has one mon and spinning rust OSDs. This can be iterated on
to make it less terrible with time.
See b/6 for more details.
Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11
This now has a zero diff against prod.
location fields in CephCluster.storage.nodes seem to have been removed
from the CRD at some point. Not sure how the CRUSH tree now gets
populated, but whatever, it's been working like this for a while
already. Same for CephObjectStore.gateway.type.
The Rook Operator has been zero-scaled for a while now due to b/6.
Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59
For a while now we've had spurious diffs against Ceph on k0 because of
a ClusterRole with an aggregationRule.
The way these behave is that the config object has an empty rule list,
and instead populates an aggregationRule which combines other existing
ClusterRoles into that ClusterRole. The control plane then populates the
rule field when the object is read/acted on, which caused us to always
see a diff between the configuration of that ClusterRole.
This hacks together a hardcoded fix for this particular behaviour.
Porting kubecfg over to SSA would probably also fix this - but that's
too much work for now.
Change-Id: I357c1417d4023691e5809f1af23f58f364353388
This moves the diff-and-activate logic from cluster/nix/provision.nix
into ops/{provision,machines}.nix that can be used for both cluster
machines and bgpwtf machines.
The provisioning scripts now live per-NixOS-config, and anything under
ops.machines.$fqdn now has a .passthru.hscloud.provision derivation
which is that script. When ran, it will attempt to deploy onto the
target machine.
There's also a top-level tool at `ops.provision` which builds all
configurations / machines and can be called with the machine name/fqdn
to call the corresponding provisioner script.
clustercfg is changed to use the new provisioning logic.
Change-Id: I258abce9e8e3db42af35af102f32ab7963046353
This annotation is used to permit routes defined by regexes instead of
simple prefix matching. This is used by our synapse deployment for
routing incomming HTTP requests to diffferent Synapse components.
I've stumbled upon this while deploying a new Matrix/Synapse instance.
This hasn't been yet a problem because the existing ingresses for Matrix
deployments predate admitomatic.
Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0
This implements the main identd service that will run on our production
hosts. It's comparatively small, as most of the functionality is
implemented in //cluster/identd/ident and //cluster/identd/kubenat.
Change-Id: I1861fe7c93d105faa19a2bafbe9c85fe36502f73
This is a high-level wrapper for querying identd, and uses IdentError to
carry errors received from the server.
Change-Id: I6444a67117193b97146ffd1548151cdb234d47b5
This is the first pass at an ident protocol client. In the end, we want
to implement an ident protocol server for our in-cluster identd, but
starting out with a client helps me getting familiar with the protocol,
and will allow the server implementation to be tested against the
client.
Change-Id: Ic37b84577321533bab2f2fbf7fb53409a5defb95
These can be used by production jobs to get the source port of the
client connecting over HTTP. A followup CR implements just that.
Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a
This emits short-lived user credentials for a `dev-user` in crdb-waw1
any time someone prodaccesses.
Change-Id: I0266a05c1f02225d762cfd2ca61976af0658639d
DeveloperCredentialsLocation used to glog.Exitf instead of returning an
error, and a consumer (prodaccess) used to not check the return code.
Bad refactor?
Change-Id: I6c2d05966ba6b3eb300c24a51584ccf5e324cd49