hscloud

hswaw

hscloud

mirror of https://gerrit.hackerspace.pl/hscloud

Author	SHA1	Message	Date
q3k	b1aadd88ff	k0: add q3k's personal s3 user Change-Id: I5681774e1dca2cf4a865d9e1a24602ed4334f006	2020-06-24 17:19:36 +00:00
q3k	0037edaa5b	cluster/tools/rook-s3cmd-config: build using bazel This turns the existing script into a proper sh_binary, and injects dependencies (kubectl and jq) as deps into it. This change also pulls in BUILDfiles for jq, and a dep (oniguruma) into //third_party, and adds buildable external repositories for them. The jq/oniguruma BUILDfiles are lifted from https://github.com/attilaolah/bazel-tools/. Change-Id: If2e548bd60a8fd34e4f3be767ae59c6b2f2286d9	2020-06-13 22:46:41 +02:00
implr	d9df5879e3	add radosgw bucket for spark Change-Id: Id8ea8901ce038ccbf11afabe0e6272c358b32cf2	2020-06-13 21:31:56 +02:00
q3k	9b2ce179a8	Merge "cluster/kube: split up cluster.jsonnet"	2020-06-13 17:52:27 +00:00
q3k	dbfa988c73	cluster/kube: split up cluster.jsonnet It was getting large and unwieldy (to the point where kubecfg was slow). In this change, we: - move the Cluster function to cluster.libsonnet - move the Cluster instantiation into k0.libsonnet - shuffle some fields around to make sure things are well split between k0-specific and general cluster configs. - add 'view' files that build on 'cluster.libsonnet' to allow rendering either the entire k0 state, or some subsets (for speed) - update the documentation, drive-by some small fixes and reindantation Change-Id: I4b8d920b600df79100295267efe21b8c82699d5b	2020-06-13 19:51:58 +02:00
q3k	66a26a8f02	WORKSPACE: remove nixpkgs/rules_nix We're not using them for anything. Initially they were going to be used for nixops, but nixops is not very good, so let's just drop them. We still have a Nix dependency for clustercfg.py when provisioning nodes, but rules_nix/nixpkgs in WORKSPACE were unrelated to that. Change-Id: I28c249507d1be9c5dbbd1ee764deccd9ab038549	2020-06-07 02:22:14 +02:00
q3k	ce81c39081	ops/metrics: basic cluster setup with prometheus We handwavingly plan on implementing monitoring as a two-tier system: - a 'global' component that is reponsible for global aggregation, long-term storage and alerting. - multiple 'per-cluster' components, that collect metrics from Kubernetes clusters and export them to the global component. In addition, several lower tiers (collected by per-cluster components) might also be implemented in the future - for instance, specific to some subprojects. Here we start sketching out some basic jsonnet structure (currently all in a single file, with little parametrization) and a cluster-level prometheus server that scrapes Kubernetes Node and cAdvisor metrics. This review is mostly to get this commited as early as possible, and to make sure that the little existing Prometheus scrape configuration is sane. Change-Id: If37ac3b1243b8b6f464d65fee6d53080c36f992c	2020-06-06 15:56:10 +02:00
q3k	7371b7288b	tools/secretstore: add sync command, re-encrypt This kills two birds with one stone: - update the secretstore tool to be slightly smarter about secrets, to the point where we can now just point it at a secret directory and ask it to 'sync' all secrets in there - runs the new fancy sync command on all keys to update them, which is a follow up to gerrit/328. Change-Id: I0eec4a3e8afcd9481b0b248154983aac25657c40	2020-06-04 19:25:07 +00:00
patryk	c410432d94	personal/patryk/arma3: create a S3 bucket account for Arma3 mods Change-Id: Idd31b5f46fcaebfcd72334dc82fbc8df805203b1	2020-06-04 18:51:51 +02:00
informatic	cb96eb6df6	Merge "crdb.k0: add sso client"	2020-05-31 12:26:04 +00:00
q3k	e55493f635	calico: fix access to resources from controller This fixes even more networking issues. Change-Id: I754656a01e3de8a34055280908b343a1a25a4707	2020-05-30 17:57:05 +02:00
q3k	ba375e62b2	calico: fix node name selection This was an attempt to make new calico nodes use a full FQDN. However, this change seemingly also makes the calico control plane use the FQDN for all existing nodes, as such breaking CNI for new pods. We revert this change, thereby keeping all calico nodes names as hostnames. We could fix this by editing /var/lib/calico/nodename on hosts to FQDNs, but it might not be worth the effort. See https://github.com/projectcalico/calico/issues/1093 for more context. Change-Id: I52bfb00f604053d57d3009aebd6c50db7dc74f58	2020-05-30 16:18:13 +02:00
informatic	42da0e9aec	crdb.k0: add sso client Change-Id: I7490a3594694d61a19910e436983937667ed34bd	2020-05-30 14:34:33 +02:00
q3k	d81bf72d7f	calico: upgrade to 3.14, fix calicoctl We still use etcd as the data store (and as such didn't set up k8s CRDs for Calico), but that's okay for now. Change-Id: If6d66f505c6b40f2646ffae7d33d0d641d34a963	2020-05-28 16:47:16 +02:00
q3k	1223cde4d4	cluster: fix nuke's personal storage Change-Id: I422a6d9f7a483e7c44cc8dfd8c0d8a98d9e17e46	2020-05-16 17:38:23 +02:00
q3k	741c08f66c	cluster: add nuke's personal storage He needs some personal backup space, and we have enough best effort spare capacity for that. Change-Id: I75ed6f62e79d33907c0974ec5f2839389ce62543	2020-05-14 18:13:53 +00:00
q3k	a168c50132	SECURITY: cluster: limit api objects modifiable by namespace admins This previous allowed all namespace admins (ie. personal-$user namespace users) to create any sort of obejct they wanted within that namespace. This could've been exploited to allow creation of a RoleBinding that would then allow to bind a serviceaccount to the insecure podsecuritypolicy, thereby allowing escalation to root on nodes. As far as I've checked, this hasn't been exploited, and the access to the k8s cluster has so far also been limited to trusted users. This has been deployed to production. Change-Id: Icf8747d765ccfa9fed843ec9e7b0b957ff27d96e	2020-05-11 20:49:31 +02:00
q3k	d436de2010	cluster/rook: bump to 1.1.9 This bumps Rook/Ceph. The new resources (mostly RBAC) come from following https://rook.io/docs/rook/v1.1/ceph-upgrade.html . It's already deployed on production. The new CSI driver has not been tested, but the old flexvolume-based provisioners still work. We'll migrate when Rook offers a nice solution for this. We've hit a kubecfg bug that does not allow controlling the CephCluster CRD directly anymore (I had to apply it via kubecfg show / kubectl apply -f instead). This might be due to our bazel/prod k8s version mismatch, or it might be related to https://github.com/bitnami/kubecfg/issues/259. Change-Id: Icd69974b294b823e60b8619a656d4834bd6520fd	2020-05-02 23:30:52 +02:00
Bartosz Stebel	98ef1518e0	add vpn insecure namespace Change-Id: I8a774ae625342af3521ad0ab11a8f6d4e4ef6c97	2020-04-24 13:28:38 +02:00
q3k	8adbd49051	*: more hackdoc updates Change-Id: Ib9830c66fe36c423d38f447905c470b67cde5399	2020-04-10 22:10:18 +02:00
q3k	4f7cc0064f	Revert "*: update docs for hackdoc" This reverts commit `cc8c69c897`. Reason for revert: <INSERT REASONING HERE> Change-Id: I1315e930e2ef69db3188eda05e4aa0b12db24274	2020-04-10 20:09:35 +00:00
q3k	cc8c69c897	*: update docs for hackdoc Change-Id: I256ec4499da2289f8f7ea3766ce40f2b0ffb0dc1	2020-04-10 21:20:53 +02:00
q3k	c881cf3c22	devtools/hackdoc: init This is hackdoc, a documentation rendering tool for monorepos. This is the first code iteration, that can only serve from a local git checkout. The code is incomplete, and is WIP. Change-Id: I68ef7a991191c1bb1b0fdd2a8d8353aba642e28f	2020-04-08 20:03:12 +02:00
q3k	0dcc702c64	cluster: bump nearly-expired certs This makes clustercfg ensure certificates are valid for at least 30 days, and renew them otherwise. We use this to bump all the certs that were about to expire in a week. They are now valid until 2021. There's still some certs that expire in 2020. We need to figure out a better story for this, especially as the next expiry is 2021 - todays prod rollout was somewhat disruptive (basically this was done by a full cluster upgrade-like rollout flow, via clustercfg). We also drive-by bump the number of mons in ceph-waw3 to 3, as it shouls be (this gets rid of a nasty SPOF that would've bitten us during this upgrade otherwise). Change-Id: Iee050b1b9cba4222bc0f3c7bce9e4cf9b25c8bdc	2020-03-28 18:01:40 +01:00
q3k	90e8e68bab	crdb.k0: add bugless-dev (for q3k) Change-Id: I3988e1c37f0a0c54ef1ba248f01e026d6e8c72b6	2020-03-25 10:55:05 +01:00
q3k	e186c87c1b	cluster: bump rook to 1.0.6 In preparation for updating to 1.1.0, which will be much more involved. Also fix a typo in registry.libsonnet, whoops. Change-Id: I7668bf53c7580f99fdf56fe6227f04a468f8de50	2020-02-21 12:57:02 +01:00
q3k	114edc2398	kube/mirko: add kube.CephObjectStoreUser Change-Id: I2a67076eeaf41ada41f5ae3ee588025e4c16b9e1	2020-02-18 22:55:13 +01:00
q3k	0d83300b18	cluster: set ceph-waw3 mon replicas to 1 This reflects current production. This needs to get bumped up to 3 at some point as otherwise we lose HA for this cluster. Change-Id: Ie5937e6a216b635ecbc4c82ecd182a410167c3f8	2020-02-15 11:48:39 +00:00
q3k	58d08595f1	{cluster,}/README: update Change-Id: Ie211fd34316c407f29506b67187632fd22a4f75b	2020-02-15 01:00:42 +01:00
q3k	d7364520e9	cluster: bump kubelets to 1.14.3 Change-Id: I02ed978a49629cdfc3f3587ad640e8cc5a5fad23	2020-02-02 23:43:28 +01:00
q3k	e2095b2ce9	cluster: remove unused module-cluster.nix Change-Id: I819d803fc7454cfd63a11a109ec73c9578f598b8	2020-02-02 23:43:00 +01:00
q3k	c78cc13528	cluster/nix: locally build nixos derivations We change the existing behaviour (copy files & run nixos-rebuild switch) to something closer to nixops-style. This now means that provisioning admin machines need Nix installed locally, but that's probably an okay choice to make. The upside of this approach is that it's easier to debug and test derivations, as all data is local to the repo and the workstation, and deploying just means copying a configuration closure and switching the system to it. At some point we should even be able to run the entire cluster within a set of test VMs. We also bump the kubernetes control plane to 1.14. Kubelets are still at 1.13 and their upgrade is comint up today too. Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd	2020-02-02 22:31:53 +01:00
q3k	aa76e55eea	cert-manager: fix DNS for http01 k0 splitdns Change-Id: I73847daec9796cb891cf2fe58c2633c5fa768861	2019-12-29 02:49:30 +01:00
q3k	0c337acf89	benji: fix in waw2, run in waw3 This needed an upstream change to allow only some pools to be backed up, otherwise benji would crash when stubmling upon the first PVC from a pool that wasn't backed by the ceph cluster it was acting upon. Change-Id: I52bf163c16352cb59fdd3dbdd576145ce1dbac03	2019-12-21 23:45:07 +01:00
q3k	ba8e79e8f4	kube-apiserver: fix cert mismatch, again This time from a bare hscloud checkout to make sure _nothing_ is fucked up. This causes no change remotely, just makes te repo reflect reality. Change-Id: Ie8db01300771268e0371c3cdaf1930c8d7cbfb1a	2019-12-17 02:13:55 +01:00
q3k	050af01b83	cluster: add q3k's new SSH key Change-Id: I872a75cc89a62c9487433fa5e8e5767953e309c9	2019-12-17 01:58:58 +01:00
q3k	e5a956a1c8	*: bump to q3k's kubecfg, kubernetes 1.16 Change-Id: I302876d5a45cbfb63d87ad9f6ea9aaeff7bec17d	2019-11-17 22:38:40 +01:00
q3k	fd323a0f55	cluster: sync to prod Change-Id: If311f1ce44653bb54e0a10ad2fdd65685722a64d	2019-11-17 19:49:04 +01:00
q3k	96c428f7d7	nixops: fix Change-Id: I15ebde319fcae3f9771da6a549e52783e0ec4409	2019-11-17 19:00:46 +01:00
q3k	c33ebcc79f	cluster: add ceph-waw3, move metallb to bgp Change-Id: Iebf369f9a02e44be163ef4afc2e0f23c4b009898	2019-11-01 18:43:45 +01:00
q3k	e67f6fec98	cluster/secrets: really try to fix apiserver key/cert Change-Id: I6b0ea601246b665585adb040b9819344bc683e78	2019-10-31 17:36:44 +01:00
q3k	737cafd548	cluster/certs: fix kube-apiserver key/cert mismatch :/ Change-Id: I3601a18d3ab1eae4183b59be43c497cd27dfe704	2019-10-31 17:30:48 +01:00
q3k	d493ab66ca	*: add dcr01s{22,24} Change-Id: I072e825e2e1d199d9da50b9d38a9ffba68e61182	2019-10-31 17:07:50 +01:00
q3k	6f773e0004	smsgw: productionize, implement kube/mirko This productionizes smsgw. We also add some jsonnet machinery to provide a unified service for Go micro/mirkoservices. This machinery provides all the nice stuff: - a deployment - a service for all your types of pots - TLS certificates for HSPKI We also update and test hspki for a new name scheme. Change-Id: I292d00f858144903cbc8fe0c1c26eb1180d636bc	2019-10-04 13:52:34 +02:00
q3k	d186e9468d	cluster: move prodvider to kubernetes.default.svc.k0.hswaw.net In https://gerrit.hackerspace.pl/c/hscloud/+/70 we accidentally introduced a split-horizon DNS situation: - k0.hswaw.net from the Internet resolves to nodes running the k8s API servers, and as such can serve API server traffic - k0.hswaw.net from the cluster returned no results This broke prodvider in two ways: - it dialed the API servers at k0.hswaw.net - even after the endpoint was moved to kubernetes.default.svc.k0.hswaw.net, the apiserver cert didn't cover that Thus, not only we had to change the prodvider endpoint but also change the APIserver certs to cover this new name. I'm not sure this should be the target fix. I think at some point we should only start referring to in-cluster services via their full (or cluster.local) names, but right now k0.hswaw.net is an exception and as such a split, and we have no way to access the internal services from the outside just yet. However, getting prodvider to work is important enough that this fix is IMO good enough for now. Change-Id: I13d0681208c66f4060acecc78b7ae14b8f8d7125	2019-10-04 13:52:34 +02:00
q3k	e31d64f265	kube: move cert-manager resources to kube.local.libsonnet This way kubernetes consumers don't have to import anything from cluster/, hopefully. We also create a small abstraction for local additions for kube.libsonnet without having to modify upstream. Change-Id: I209095781f91c8867250a647fe944370cddd67d0	2019-10-02 21:03:13 +02:00
q3k	54490d385e	cluster/coredns: add cluster fqdn top level domain This means that in addition to services being discoverable the 'classic' way: <svcname>.<namespace>.svc.cluster.local They are now discoverable as: <svcname>.<namespace>.svc.<fqdn> For instance, on k0 you can now internally resolve: $ kubectl run --rm -it foo --image=nixery.dev/shell/dnsutils bash bash-4.4# dig +short coffee-svc.default.svc.k0.hswaw.net 10.10.12.192 Change-Id: Ie6875b54ed6358f30f888ca0cd96e011520ace20	2019-10-02 20:49:13 +02:00
q3k	95868eeddc	benji: back up daily instead of hourly Every benji backup seems to cycle blocks (eg. delete some and recreate them). Since wasabi has a minimum billing retention policy of 90 days, this means that every uploaded and then an hour later deleted object costs us. Currently we seem to be storing around 200G of data in wasabi for Benji but already have 600G of deleted objects. This is suboptimal. This change has already been deployed on production. Change-Id: I67302d23a1c45974fb5d51ec9a8cff28260830dc	2019-09-26 21:49:24 +00:00
q3k	57515a2525	Merge "rules_pip: update to new version"	2019-09-25 12:05:58 +00:00
q3k	5f9b1ecd67	rules_pip: update to new version rules_pip has a new version [1] of their rule system, incompatible with the version we used, that fixes a bunch of issues, notably: - explicit tagging of repositories for PY2/PY3/PY23 support - removal of dependency on host pip (in exchange for having to vendor wheels) - higher quality tooling for locking We update to the newer version of pip_rules, rename the external repository to pydeps and move requirements.txt, the lockfile and the newly vendored wheels to third_party/, where they belong. [1] - https://github.com/apt-itude/rules_pip/issues/16 Change-Id: I1065ee2fc410e52fca2be89fcbdd4cc5a4755d55	2019-09-25 14:05:07 +02:00

1 2 3

125 Commits (91e1a8c9c50dddc9d8c8c263e793ebe4dc767103)