hscloud

Author	SHA1	Message	Date
Serge Bazanski	ee41e94e0a	k0: bump certs Change-Id: I9d7a48d64de5d1aa82a134a8c22bfc50ba8ad270 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1402 Reviewed-by: informatic <informatic@hackerspace.pl>	2022-10-09 20:22:43 +00:00
Serge Bazanski	3c31f32307	cluster: bump prodvider certs Change-Id: Ieefe3c733dd40a94c13a5e1c1648dd43d27c180a Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1386 Reviewed-by: implr <implr@hackerspace.pl>	2022-09-10 15:46:39 +00:00
Bartosz Stebel	e69e98da47	third_party/py: update rules_python, use pip-compile for requirements Change-Id: If8309e8e3a4b58142f7479005a9eb4cbb1043cdb Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1324 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-07-05 21:27:31 +00:00
Serge Bazanski	437b0c335f	rook: fix benji This unforks benji back into upstream. The old fork didn't support a new authentication method on Ceph, and we don't have multiple clusters anymore (so we don't need the functionality of the fork). Change-Id: Ie79313b2321ca2e22ad2874b75a71385af95105f Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1321 Reviewed-by: informatic <informatic@hackerspace.pl>	2022-06-19 11:49:12 +00:00
Serge Bazanski	55a486ae49	cluster: refactor nix machinery to fit //ops This is a chonky refactor that get rids of the previous cluster-centric defs-* plain nix file setup. Now, nodes are configured individually in plain nixos modules, and are provided a view of all other nodes in the 'machines' attribute. Cluster logic is moved into modules which inspect this array to find other nodes within the same cluster. Kubernetes options are not fully clusterified yet (ie., they are still hardcode to only provide the 'k0' cluster) but that can be fixed later. The Ceph machinery is a good example of how that can be done. The new NixOS configs are zero-diff against prod. While this is done mostly by keeping the logic, we had to keep a few newly discovered 'bugs' around by adding some temporary options which keeps things as they are. These will be removed in a future CL, then introducing a diff (but no functional changes, hopefully). We also remove the nix eval from clustercfg as it was not used anymore (basically since we refactored certs at some point). Change-Id: Id79772a96249b0e6344046f96f9c2cb481c4e1f4 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1322 Reviewed-by: informatic <informatic@hackerspace.pl>	2022-06-19 11:48:52 +00:00
Serge Bazanski	b0e3693c0e	cluster/kube: calico: fix etcd endpoints Change-Id: Ia93d355ca343fa5a42ec37fbcae9135cb5304f6e Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1285 Reviewed-by: implr <implr@hackerspace.pl>	2022-06-11 19:00:52 +00:00
Bartosz Stebel	0544d27c04	tools, cluster/tools: bazel5 compat: remove unused import Change-Id: I8b264a6c36e4d0f1535f38ad1f41495e62061f26 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1308 Reviewed-by: daz <daz@hackerspace.pl>	2022-06-04 19:56:40 +00:00
Serge Bazanski	d584e76ea3	cluster/clustercfg: fix for nix 2.4 Change-Id: I3f9ebd895495a23ec179ccd237389e8f3e531768 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1284 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-04-04 17:51:44 +00:00
Serge Bazanski	42c17872fd	cluster/certs: bump certs Change-Id: I549364c050a96f72859886e6b724e07924ee3964 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1282 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-04-04 17:51:44 +00:00
Bartosz Stebel	54a34b24a1	cluster/k0: ceph: add tape staging Change-Id: I7fdba86b15f92157888850d2905440b45fb36f17 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1263 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-03-05 22:45:29 +00:00
Patryk Jakuszew	d0a0b18e54	cluster: allow namespace admins to access certificate resources Change-Id: I532dadfe1799da43d12598e388141f8f9a3872de Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1250 Reviewed-by: q3k <q3k@hackerspace.pl>	2022-02-05 15:08:47 +00:00
Serge Bazanski	bdd403c587	cluster: k0: move cockroachdb away from bc01n01, fixup joins Reminded by a power failure on bc01n0{1,2}, we migrate away from at least one of them into another server. We also fix up the startup join parameter to not include the node itself (which is not necessary, but a nice thing to have nonetheless). Since bc01n01 was the initial node of the cluster, we also disable the init job for k0 (which we don't care about anyway). Change-Id: I3406471c0f9542e9d802d39138e400b5a5e74794 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1176 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-12-13 22:30:46 +00:00
Bartosz Stebel	eca1e080d7	calico: restore CNI_NET_DIR Change-Id: I04e17f8639505f5b7cc42e86392abc175b7922db Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1178 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-12-03 03:10:13 +00:00
Bartosz Stebel	12f176c1eb	calico 3.14 -> 1.15 Change-Id: I9eceaf26017e483235b97c8d08717d2750fabe25 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/995 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-11-20 22:12:52 +00:00
Serge Bazanski	0f8e5a2132	*: do not require env.sh This removes the need to source env.{sh,fish} when working with hscloud. This is done by: 1. Implementing a Go library to reliably detect the location of the active hscloud checkout. That in turn is enabled by BUILD_WORKSPACE_DIRECTORY being now a thing in Bazel. 2. Creating a tool `hscloud`, with a command `hscloud workspace` that returns the workspace path. 3. Wrapping this tool to be accessible from Python and Bash. 4. Bumping all users of hscloud_root to use either the Go library or one of the two implemented wrappers. We also drive-by replace tools/install.sh to be a proper sh_binary, and make it yell at people if it isn't being ran as `bazel run //tools:install`. Finally, we also drive-by delete cluster/tools/nixops.sh which was never used. Change-Id: I7873714319bfc38bbb930b05baa605c5aa36470a Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1169 Reviewed-by: informatic <informatic@hackerspace.pl>	2021-10-17 21:21:58 +00:00
Serge Bazanski	3b67afe81b	cluster/certs: refresh Change-Id: I2aa8fead4427b917afa4758ea0078125d9c4e914 Reviewed-on: https://gerrit.hackerspace.pl/c/hscloud/+/1153 Reviewed-by: q3k <q3k@hackerspace.pl>	2021-10-07 19:58:35 +00:00
Piotr Dobrowolski	e839f95079	cluster/kube/k0: add matrix and informatic personal ceph users Change-Id: Ied8d474709b8053e9fc339435d3ca1ca5fdfa710	2021-09-14 22:21:22 +02:00
Serge Bazanski	4b8ee32246	cluster/kube: always enable flexdriver Documentation says [1] this is disabled by default in 1.1, but that documentation kinda lies [2]. [1] - `235d5a384b/Documentation/flexvolume.md (ceph-flexvolume-configuration)` [2] - `64e28af741 (diff-d1eb5cba50e3770b61ccd3c730cd40514053e1da0233dfe09b5e7967e76a2a6cL424-L425)` Change-Id: Ia92c99e137ed751db62c0f56d42c4901986d0bb8	2021-09-14 21:39:39 +02:00
Serge Bazanski	38f72fe094	cluster: k0: move ceph-waw3 to proper realm/zonegroup With this we can use Ceph's multi-site support to easily migrate to our new k0 Ceph cluster. This migration was done by using radosgw-admin to rename the existing realm/zonegroup to the new names (hscloud and eu), and then reworking the jsonnet so that the Rook operator would effectively do nothing. It sounds weird that creating a bunch of CRs like Object{Realm,ZoneGroup,Zone} realm would be a no-op for the operator, but that's how Rook works - a CephObjectStore generally creates everything that the above CRs would create too, but implicitly. Adding the extra CRs just allows specifying extra settings, like names. (it wasn't fully a no-op, as the rgw daemon is parametrized by realm/zonegroup/zone names, so that had to be restarted) We also make the radosgw serve under object.ceph-eu.hswaw.net, which allows us to right away start using a zonegroup URL instead of the zone-only URL. Change-Id: I4dca55a705edb3bd28e54f50982c85720a17b877	2021-09-14 21:39:39 +02:00
Serge Bazanski	18084c1e86	cluster/nix: k0: enable rgw on osds This enables radosgw wherever osds are. This should be fast and works for us because we have little osd hosts. Change-Id: I4ed014d2790d6c02a2ba8e775aaa1846032dee1e	2021-09-14 21:39:39 +02:00
Serge Bazanski	085a8ff247	cluster: k0: upgrade to ceph 16.2.5 This was fun. See b/6 for a log of how swimmingly this went. Change-Id: I96c3c18b5d33ef86523b3506f49a390419e9ca7f	2021-09-14 21:39:39 +02:00
Serge Bazanski	464fb04f39	cluster: k0: bump rook to 1.6 This is needed to get Rook to talk to an external Ceph 16/Pacific cluster. This is mostly a bunch of CRD/RBAC changes. Most notably, we yeet our own CRD rewrite and just slurp in upstream CRD defs. Change-Id: I08e7042585722ae4440f97019a5212d6cf733fcc	2021-09-14 21:39:37 +02:00
Serge Bazanski	6579e842b0	kartongips: paper over^W^Wfix CRD updates Ceph CRD updates would fail with: ERROR Error updating customresourcedefinitions cephclusters.ceph.rook.io: expected kind, but got map This wasn't just https://github.com/bitnami/kubecfg/issues/259 . We pull in the 'solution' from Pulumi (https://github.com/pulumi/pulumi-kubernetes/pull/622) which just retries the update via a JSON update instead, and that seems to have worked. We also add some better error return wrapping, which I used to debug this issue properly. Oof. Change-Id: I2007a7857e44128d74760174b61b59efa58e9cbc	2021-09-11 20:54:34 +00:00
Serge Bazanski	05c4b5515b	cluster/nix: symlink /sbin/lvm This is needed by the new Rook OSD daemons. Change-Id: I16eb24332db40a8209e7eb9747a81fa852e5cad9	2021-09-11 20:45:45 +00:00
Serge Bazanski	9848e7e15f	cluster: deploy NixOS-based ceph First pass at a non-rook-managed Ceph cluster. We call it k0 instead of ceph-waw4, as we pretty much are sure now that we will always have a one-kube-cluster-to-one-ceph-cluster correspondence, with different Ceph pools for different media kinds (if at all). For now this has one mon and spinning rust OSDs. This can be iterated on to make it less terrible with time. See b/6 for more details. Change-Id: Ie502a232c700af93f33fcad9fa1c57058161aa11	2021-09-11 20:33:24 +00:00
q3k	1dbefed537	Merge "cluster/kube: remove ceph diff against k0 production"	2021-09-11 20:32:57 +00:00
q3k	9f639694ba	Merge "kartongips: switch default diff behaviour to subset, nag users"	2021-09-11 20:18:34 +00:00
q3k	29f314b620	Merge "kartongips: implement proper diffing of aggregated ClusterRoles"	2021-09-11 20:18:28 +00:00
Serge Bazanski	4f0468fa26	cluster/kube: remove ceph diff against k0 production This now has a zero diff against prod. location fields in CephCluster.storage.nodes seem to have been removed from the CRD at some point. Not sure how the CRUSH tree now gets populated, but whatever, it's been working like this for a while already. Same for CephObjectStore.gateway.type. The Rook Operator has been zero-scaled for a while now due to b/6. Change-Id: I30a836f273f4c1529f60fa9297c96b7aac412f59	2021-09-11 12:43:53 +00:00
Serge Bazanski	59c8149df4	kartongips: switch default diff behaviour to subset, nag users Change-Id: I998cdf7e693f6d1ce86c7ea411f47320d72a5906	2021-09-11 12:43:50 +00:00
Serge Bazanski	72d7574536	kartongips: implement proper diffing of aggregated ClusterRoles For a while now we've had spurious diffs against Ceph on k0 because of a ClusterRole with an aggregationRule. The way these behave is that the config object has an empty rule list, and instead populates an aggregationRule which combines other existing ClusterRoles into that ClusterRole. The control plane then populates the rule field when the object is read/acted on, which caused us to always see a diff between the configuration of that ClusterRole. This hacks together a hardcoded fix for this particular behaviour. Porting kubecfg over to SSA would probably also fix this - but that's too much work for now. Change-Id: I357c1417d4023691e5809f1af23f58f364353388	2021-09-11 12:40:18 +00:00
Serge Bazanski	b3c6770f8d	ops, cluster: consolidate NixOS provisioning This moves the diff-and-activate logic from cluster/nix/provision.nix into ops/{provision,machines}.nix that can be used for both cluster machines and bgpwtf machines. The provisioning scripts now live per-NixOS-config, and anything under ops.machines.$fqdn now has a .passthru.hscloud.provision derivation which is that script. When ran, it will attempt to deploy onto the target machine. There's also a top-level tool at `ops.provision` which builds all configurations / machines and can be called with the machine name/fqdn to call the corresponding provisioner script. clustercfg is changed to use the new provisioning logic. Change-Id: I258abce9e8e3db42af35af102f32ab7963046353	2021-09-10 23:55:52 +00:00
Serge Bazanski	432fa30ded	cluster/certs: bump ca-kube-prodivider Redeployed. Change-Id: I01110433f89df5595de0f9587508104d6091a774	2021-08-29 17:20:59 +00:00
Serge Bazanski	89a16f4de4	cluster/admitomatic: allow use-regex n-i-c annotation This annotation is used to permit routes defined by regexes instead of simple prefix matching. This is used by our synapse deployment for routing incomming HTTP requests to diffferent Synapse components. I've stumbled upon this while deploying a new Matrix/Synapse instance. This hasn't been yet a problem because the existing ingresses for Matrix deployments predate admitomatic. Change-Id: I821e58b214450ccf0de22d2585c3b0d11fbe71c0	2021-06-06 12:58:11 +00:00
q3k	7251f2720e	Merge changes Ib068109f,I9a00487f,I1861fe7c,I254983e5,I3e2bedca, ... * changes: cluster/identd/ident: update README cluster/kube: deploy identd cluster/identd: implement cluster/identd/kubenat: implement cluster/identd/cri: import cluster/identd/ident: add TestE2E cluster/identd/ident: add Query function cluster/identd/ident: add IdentError cluster/identd/ident: add basic ident protocol server cluster/identd/ident: add basic ident protocol client	2021-05-28 23:08:10 +00:00
Serge Bazanski	46c3137d36	cluster/identd/ident: update README Change-Id: Ib068109ff37749207e7b2a18c07f51d3c4ed3fd6	2021-05-26 19:46:13 +00:00
Serge Bazanski	2414afe3c0	cluster/kube: deploy identd Change-Id: I9a00487fc4a972ecb0904055dbaaab08221062c1	2021-05-26 19:46:09 +00:00
Serge Bazanski	044386d638	cluster/identd: implement This implements the main identd service that will run on our production hosts. It's comparatively small, as most of the functionality is implemented in //cluster/identd/ident and //cluster/identd/kubenat. Change-Id: I1861fe7c93d105faa19a2bafbe9c85fe36502f73	2021-05-26 19:46:06 +00:00
Serge Bazanski	6b649f8234	cluster/identd/kubenat: implement This is a library to find pod information for a given TCP 4-tuple. Change-Id: I254983e579e3aaa04c0c5491851f4af94a3f4249	2021-05-26 19:46:02 +00:00
Serge Bazanski	ae052f0804	cluster/identd/cri: import This imports the CRI protobuf/gRPC specs. These are pulled from: https://raw.githubusercontent.com/kubernetes/cri-api/master/pkg/apis/runtime/v1alpha2/api.proto Our host containerd does not implement v1, so we go with v1alpha2. Change-Id: I3e2bedca76edc85eea9b61a8634c92175f0d2a30	2021-05-26 19:45:58 +00:00
Serge Bazanski	3638a3d76a	cluster/identd/ident: add TestE2E Change-Id: I8a95fadf19376de2806cb63897b77e370559392f	2021-05-23 16:27:22 +00:00
Serge Bazanski	8e603e13e5	cluster/identd/ident: add Query function This is a high-level wrapper for querying identd, and uses IdentError to carry errors received from the server. Change-Id: I6444a67117193b97146ffd1548151cdb234d47b5	2021-05-23 16:27:17 +00:00
Serge Bazanski	1c2bc12ad0	cluster/identd/ident: add IdentError This adds a Go error type that can be used to wrap any ErrorResponse. Change-Id: I57fbd056ac774f4e2ae3bdf85941c1010ada0656	2021-05-23 16:26:59 +00:00
Serge Bazanski	ce2737f2e7	cluster/identd/ident: add basic ident protocol server This adds an ident protocol server and tests for it. Change-Id: I830f85faa7dce4220bd7001635b20e88b4a8b417	2021-05-23 16:26:54 +00:00
Serge Bazanski	d4438d67a2	cluster/identd/ident: add basic ident protocol client This is the first pass at an ident protocol client. In the end, we want to implement an ident protocol server for our in-cluster identd, but starting out with a client helps me getting familiar with the protocol, and will allow the server implementation to be tested against the client. Change-Id: Ic37b84577321533bab2f2fbf7fb53409a5defb95	2021-05-23 16:26:50 +00:00
Serge Bazanski	e17f7edde0	cluster/kube: nginx: add Hscloud-Nic-Source-* headers These can be used by production jobs to get the source port of the client connecting over HTTP. A followup CR implements just that. Change-Id: Ic8e29eaf806bb196d8cfcfb604ff66ae4d0d166a	2021-05-22 19:16:39 +00:00
Serge Bazanski	ba2f4d8215	cluster/prodvider: deploy Change-Id: I01d931a664e4b09c0d75fb01fb3f2528bc0f1a53	2021-05-19 22:13:26 +00:00
Serge Bazanski	02e1598eb3	cluster/prodvider: emit crdb certs This emits short-lived user credentials for a `dev-user` in crdb-waw1 any time someone prodaccesses. Change-Id: I0266a05c1f02225d762cfd2ca61976af0658639d	2021-05-19 22:13:22 +00:00
Serge Bazanski	bade46d45f	go/pki: fix error return DeveloperCredentialsLocation used to glog.Exitf instead of returning an error, and a consumer (prodaccess) used to not check the return code. Bad refactor? Change-Id: I6c2d05966ba6b3eb300c24a51584ccf5e324cd49	2021-05-19 22:12:08 +00:00
q3k	5ae5cbec81	Merge "cluster/kube: bump nginx-ingress-controller, backport openssl 1.1.1k"	2021-05-19 15:34:45 +00:00

1 2 3 4 5

247 commits