hscloud

cheshire

hscloud

Author	SHA1	Message	Date
patryk	34668a5b7b	k0: add cz3's personal s3 user Change-Id: I51ee80eb05c34cfd8b03e15fcaefb5f235587c50	2020-11-28 13:45:25 +01:00
q3k	f18a531f9b	prodvider: bump to Go 1.15.5 Change-Id: I0f7999deb571aef12533f0ceee21c0283bc0bdc4	2020-11-27 09:50:09 +00:00
q3k	0754ed86a2	prodvider: fix build after k8s update, add to CI presubmit Change-Id: I5a3794541853abd1fb16e67e285edfa29c2f5cf7	2020-11-27 09:43:47 +00:00
q3k	e00fe3a448	cluster/tools/kartongips: skip tests broken by fork These tests are broken as they depend on some test data that we currently don't have in hscloud. They should be fixed ASAP. Change-Id: I2571c2958cb84e145a7e3a44171685ecf43cf499	2020-11-12 00:45:15 +01:00
q3k	640336144d	cluster/tools: integrate kartongips as main kubecfg tool Change-Id: If6a6c8e9c9163f0fc25adcaa8680857fdca69cd3	2020-11-12 00:40:08 +01:00
q3k	be538db63b	cluster/tools/kartongips: init This forks bitnami/kubecfg into kartongips. The rationale is that we want to implement hscloud-specific functionality that wouldn't really be upstreamable into kubecfg (like secret support, mulit-cluster support). We forked off from github.com/q3k/kubecfg at commit b6817a94492c561ed61a44eeea2d92dcf2e6b8c0. Change-Id: If5ba513905e0a86f971576fe7061a471c1d8b398	2020-11-12 00:39:34 +01:00
q3k	bfe9bb0e3a	k0: add woju's personal s3 user Change-Id: I8ed5bb5428594b74460f1b89185d684cb6c26268	2020-10-27 20:50:50 +01:00
q3k	e77f7717d4	k0: bump to 1.16.5 Change-Id: I548808ce4e0deb0513a1e00963f383d84b9d920c	2020-10-10 22:39:50 +02:00
q3k	1257389d3d	k0: expose controller-manager and scheduler metrics We want to be able to scrape controller-manager and scheduler metrics into Prometheus. For that, each of them needs to: 1) listen on a secure port 2) have authn enabled With this, any k8s user with the right permissions (and a bearer token or TLS certificate) can come in and access metrics over a node's public IP address. Access without a certificate/token gets thrown into the system:anonymous user, which as no access to any API. Change-Id: I267680f92f748ba63b6762e6aaba3c417446e50b	2020-10-10 16:00:15 +00:00
q3k	36224c617a	clustercfg: show diff before switching to new configuration This is mildly hacky, but lets us be more informed before we switch to a new configuration. Change-Id: I008f3f698db702f1e0992bd41a8d1050449d59b5	2020-10-10 16:00:11 +00:00
q3k	2e001e5046	k0: bump to 1.15.4 This notably fixes the annoying loopback issues that prevented hosts from accessing externalip services with externalTrafficPolicy: local from nodes that weren't running the service. Which means, hopefuly, no more registry pull failures when nginx-ingress gets misplaced! Change-Id: Id4923fd0fce2e28c31a1e65518b0e984165ca9ec	2020-10-03 16:32:38 +00:00
q3k	2a223705fd	cluster: bump certs This has been deployed to k0 nodes. Current state of cluster certificates: cluster/certs/ca-etcd.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-etcdpeer.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-kube.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-kubefront.crt Not After : Apr 4 17:59:00 2024 GMT cluster/certs/ca-kube-prodvider.cert Not After : Sep 1 21:30:00 2021 GMT cluster/certs/etcd-bc01n01.hswaw.net.cert Not After : Mar 28 15:53:00 2021 GMT cluster/certs/etcd-bc01n02.hswaw.net.cert Not After : Mar 28 16:45:00 2021 GMT cluster/certs/etcd-bc01n03.hswaw.net.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcd-calico.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcd-dcr01s22.hswaw.net.cert Not After : Oct 3 15:33:00 2021 GMT cluster/certs/etcd-dcr01s24.hswaw.net.cert Not After : Oct 3 15:38:00 2021 GMT cluster/certs/etcd-kube.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcdpeer-bc01n01.hswaw.net.cert Not After : Mar 28 15:53:00 2021 GMT cluster/certs/etcdpeer-bc01n02.hswaw.net.cert Not After : Mar 28 16:45:00 2021 GMT cluster/certs/etcdpeer-bc01n03.hswaw.net.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/etcdpeer-dcr01s22.hswaw.net.cert Not After : Oct 3 15:33:00 2021 GMT cluster/certs/etcdpeer-dcr01s24.hswaw.net.cert Not After : Oct 3 15:38:00 2021 GMT cluster/certs/etcd-root.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-apiserver.cert Not After : Oct 3 15:26:00 2021 GMT cluster/certs/kube-controllermanager.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kubefront-apiserver.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-kubelet-bc01n01.hswaw.net.cert Not After : Mar 28 15:53:00 2021 GMT cluster/certs/kube-kubelet-bc01n02.hswaw.net.cert Not After : Mar 28 16:45:00 2021 GMT cluster/certs/kube-kubelet-bc01n03.hswaw.net.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-kubelet-dcr01s22.hswaw.net.cert Not After : Oct 3 15:33:00 2021 GMT cluster/certs/kube-kubelet-dcr01s24.hswaw.net.cert Not After : Oct 3 15:38:00 2021 GMT cluster/certs/kube-proxy.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-scheduler.cert Not After : Mar 28 15:15:00 2021 GMT cluster/certs/kube-serviceaccounts.cert Not After : Mar 28 15:15:00 2021 GMT Change-Id: I94030ce78c10f7e9a0c0257d55145ef629195314	2020-10-03 16:32:32 +00:00
q3k	fbe234bdb2	cluster: rename module-* into modules/* Change-Id: I65e06f3e9cec2ba0071259eb755eddbbd1025b97	2020-10-03 14:57:30 +00:00
q3k	c7de7e562f	cluster: do not export metallb routes to mesh peers This prevents metallb routes being announced from all peers to our ToR, thereby preventing issues with traffic hitting services with externalTrafficPolicy: local. There still is the from-host loopback issue, but that will be fixed by upgrading to kube 1.15. Change-Id: Ifc9964b46840aee82d99f0b6550188550e46fe04	2020-10-03 14:56:52 +00:00
q3k	f0acf16564	prodvider: use SANs in service certificates This fixes compatibility with prodaccess tools built with Go 1.15, which introduced 'X.509 CommonName deprecation' [1]. [1] - https://golang.org/doc/go1.15#commonname Change-Id: I228cde3e5651a3e36f527783f2ccb4a2f6b7a8e3	2020-10-03 14:56:35 +00:00
q3k	44628f2b9e	Merge "k0.hswaw.net: pass metallb through Calico"	2020-10-02 22:54:57 +00:00
q3k	e7fca3acd8	ci_presubmit: init This will be, at some point, a script to run on Gerrit presubmit (ie. right before merge). For now, you can manually run it to ensure that Everything At Least Kinda Works. Change-Id: I28b305fa81a4ca4a8e94ce4daa06fe9ae0184fe8	2020-09-25 21:15:07 +00:00
q3k	a5ed644980	k0.hswaw.net: pass metallb through Calico Previously, we had the following setup: .-----------. \| ..... \| .-----------.-\| \| dcr01s24 \| \| .-----------.-\| \| \| dcr01s22 \| \| \| .---\|-----------\| \|-' .--------. \| \|---------. \| \| \| dcsw01 \| <----- \| metallb \| \|-' '--------' \|---------' \| '-----------' Ie., each metallb on each node directly talked to dcsw01 over BGP to announce ExternalIPs to our L3 fabric. Now, we rejigger the configuration to instead have Calico's BIRD instances talk BGP to dcsw01, and have metallb talk locally to Calico. .-------------------------. \| dcr01s24 \| \|-------------------------\| .--------. \|---------. .---------. \| \| dcsw01 \| <----- \| Calico \|<--\| metallb \| \| '--------' \|---------' '---------' \| '-------------------------' This makes Calico announce our pod/service networks into our L3 fabric! Calico and metallb talk to eachother over 127.0.0.1 (they both run with Host Networking), but that requires one side to flip to pasive mode. We chose to do that with Calico, by overriding its BIRD config and special-casing any 127.0.0.1 peer to enable passive mode. We also override Calico's Other Bird Template (bird_ipam.cfg) to fiddle with the kernel programming filter (ie. to-kernel-routing-table filter), where we disable programming unreachable routes. This is because routes coming from metallb have their next-hop set to 127.0.0.1, which makes bird mark them as unreachable. Unreachable routes in the kernel will break local access to ExternalIPs, eg. register access from containerd. All routes pass through without route reflectors and a full mesh as we use eBGP over private ASNs in our fabric. We also have to make Calico aware of metallb pools - otherwise, routes announced by metallb end up being filtered by Calico. This is all mildly hacky. Here's hoping that Calico will be able to some day gain metallb-like functionality, ie. IPAM for externalIPs/LoadBalancers/... There seems to be however one problem with this change (but I'm not fixing it yet as it's not critical): metallb would previously only announce IPs from nodes that were serving that service. Now, however, the Calico internal mesh makes those appear from every node. This can probably be fixed by disabling local meshing, enabling route reflection on dcsw01 (to recreate the mesh routing through dcsw01). Or, maybe by some more hacking of the Calico BIRD config :/. Change-Id: I3df1f6ae7fa1911dd53956ced3b073581ef0e836	2020-09-23 18:55:12 +00:00
q3k	059fdfed3b	k0: add resource requests/limits to nginx, remove gitea We just had an outage seemingly caused by N-I-C sendings tons of traffic to gitea, which in turn caused N-I-C to balloon in memory/CPU usage. I haven't debugged the cause of this traffic, but I have disabled the gitea TCP forward to Stop The Bleeding. This change reflects ad-hoc production changes. Change-Id: I37e11609f408fa3e3fbfafafba44dc83149b90a9	2020-09-20 22:53:40 +00:00
q3k	242ec58a33	k0: add waw-hdd-redundant-q3k-3 Change-Id: Id3718877d1e67d48c6726d7649a565db657cfc82	2020-09-20 15:36:24 +00:00
patryk	8d069d8d1a	cluster/certs: refresh prodvider CA Change-Id: I35578fb62ddf10e7419c2c347e70322cf4ea0b6a	2020-09-01 22:02:52 +00:00
q3k	316411790a	cluster/nix: update nodes - we update NixOS to 20.09pre - we fix an ACME option that's now required - we switch from systemd-timesyncd to chrony (as timesyncd took a long time to sync clocks after restart, leading to MON_CLOCK_SKEW errors from ceph) This has been deployed in production. Change-Id: Ibfcd41567235bae3e3d8abeeed61f4694ae614ad	2020-08-23 00:58:29 +02:00
q3k	bc73a44519	cluster/clustercfg: fix BUILD This is continued fallout after migrating from rules_pip. Change-Id: Idb9b4d4f22aa36512d220ac31375bae7a0f25e4e	2020-08-22 20:33:37 +00:00
q3k	d5918c8e72	cluster: change q3k's laptop key Paranoia is dead, long live Mimeomia. This has already been deployed to production. Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0	2020-08-22 22:29:42 +02:00
q3k	0581bbf8a0	games/factorio: add modproxy This adds a mod proxy system, called, well, modproxy. It sits between Factorio server instances and the Factorio mod portal, allowing for arbitrary mod download without needing the servers to know Factorio credentials. Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212	2020-08-14 13:03:46 +02:00
q3k	3d29484ebb	k0: move registry to ceph-waw3 ceph-waw2 has currently some production issues [1] which have started to cause write failures in the registry. The registry is the only user of ceph-waw2's affected pool, so we reduce the dumpster fire blast radious by moving it over to ceph-waw3. This has already been deployed and data has been migrated over (via s3cmd sync), and the migration has been verified (by a push and pull, and pull of an older image). [1] - pgs stuck inactive in the object storage pool Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347	2020-08-04 01:36:51 +02:00
q3k	4ded56ab8a	prodvider: emit client/server cert Change-Id: I024782a7dfa6e16ff5f562a62ddd8fe3bf299c51	2020-08-01 22:01:05 +02:00
q3k	f3312ef77e	*: developer machine HSPKI credentials In addition to k8s certificates, prodaccess now issues HSPKI certificates, with DN=$username.sso.hswaw.net. These are installed into XDG_CONFIG_HOME (or os equiv). //go/pki will now automatically attempt to load these certificates. This means you can now run any pki-dependant tool with -hspki_disable, and with automatic mTLS! Change-Id: I5b28e193e7c968d621bab0d42aabd6f0510fed6d	2020-08-01 17:15:52 +02:00
q3k	509ab6e29a	k0/cockroach: add public DNS entry for cockroach Change-Id: I934bf348e2165148b515b709e853ab67f039a402	2020-07-30 22:56:30 +02:00
implr	cae27ecd99	Replace rules_pip with rules_python; use bazel built upstream grpc instead of Python packages As usual with Python sadness, the @pydeps wheels are built on the bazel host, so stuffing them inside a container_image (or py_image) will cause new and unexpected kinds of misery. Change-Id: Id4e4d53741cf2da367f01aa15c21c133c5cf0dba	2020-07-08 18:55:34 +02:00
informatic	97a6ca8a8b	Merge "cluster/kube/lib/nginx: add gitea-prod ingress service"	2020-07-02 17:15:53 +00:00
informatic	0697e01144	cluster/kube/lib/registry: allow auth'd users to pull all images "Anyone can pull all images" rule did only match on anonymous users. Now it should match all users, including authenticated ones. Change-Id: I2205299093feca51f30526ba305eadbaa0a68ecb	2020-07-02 18:45:42 +02:00
informatic	f00edf6ee8	cluster/kube/lib/nginx: add gitea-prod ingress service We would like gitea to have its ssh server exposed on TCP port 22 on the same address as its web interface. We would also still like to use all the automation around ingresses already in place (like cert-manager integration). To solve this, we create an additional LoadBalancer service for nginx-ingress-controller and set up special tcp-services forwarding rule to pass port 22 traffic to gitea-prod/gitea service, like we already do in case of gerrit. Change-Id: I5bfc901ebe858464f8e9c2f3b2216b254ccd6c4d	2020-07-02 18:30:38 +02:00
q3k	b1aadd88ff	k0: add q3k's personal s3 user Change-Id: I5681774e1dca2cf4a865d9e1a24602ed4334f006	2020-06-24 17:19:36 +00:00
q3k	0037edaa5b	cluster/tools/rook-s3cmd-config: build using bazel This turns the existing script into a proper sh_binary, and injects dependencies (kubectl and jq) as deps into it. This change also pulls in BUILDfiles for jq, and a dep (oniguruma) into //third_party, and adds buildable external repositories for them. The jq/oniguruma BUILDfiles are lifted from https://github.com/attilaolah/bazel-tools/. Change-Id: If2e548bd60a8fd34e4f3be767ae59c6b2f2286d9	2020-06-13 22:46:41 +02:00
implr	d9df5879e3	add radosgw bucket for spark Change-Id: Id8ea8901ce038ccbf11afabe0e6272c358b32cf2	2020-06-13 21:31:56 +02:00
q3k	9b2ce179a8	Merge "cluster/kube: split up cluster.jsonnet"	2020-06-13 17:52:27 +00:00
q3k	dbfa988c73	cluster/kube: split up cluster.jsonnet It was getting large and unwieldy (to the point where kubecfg was slow). In this change, we: - move the Cluster function to cluster.libsonnet - move the Cluster instantiation into k0.libsonnet - shuffle some fields around to make sure things are well split between k0-specific and general cluster configs. - add 'view' files that build on 'cluster.libsonnet' to allow rendering either the entire k0 state, or some subsets (for speed) - update the documentation, drive-by some small fixes and reindantation Change-Id: I4b8d920b600df79100295267efe21b8c82699d5b	2020-06-13 19:51:58 +02:00
q3k	66a26a8f02	WORKSPACE: remove nixpkgs/rules_nix We're not using them for anything. Initially they were going to be used for nixops, but nixops is not very good, so let's just drop them. We still have a Nix dependency for clustercfg.py when provisioning nodes, but rules_nix/nixpkgs in WORKSPACE were unrelated to that. Change-Id: I28c249507d1be9c5dbbd1ee764deccd9ab038549	2020-06-07 02:22:14 +02:00
q3k	ce81c39081	ops/metrics: basic cluster setup with prometheus We handwavingly plan on implementing monitoring as a two-tier system: - a 'global' component that is reponsible for global aggregation, long-term storage and alerting. - multiple 'per-cluster' components, that collect metrics from Kubernetes clusters and export them to the global component. In addition, several lower tiers (collected by per-cluster components) might also be implemented in the future - for instance, specific to some subprojects. Here we start sketching out some basic jsonnet structure (currently all in a single file, with little parametrization) and a cluster-level prometheus server that scrapes Kubernetes Node and cAdvisor metrics. This review is mostly to get this commited as early as possible, and to make sure that the little existing Prometheus scrape configuration is sane. Change-Id: If37ac3b1243b8b6f464d65fee6d53080c36f992c	2020-06-06 15:56:10 +02:00
q3k	7371b7288b	tools/secretstore: add sync command, re-encrypt This kills two birds with one stone: - update the secretstore tool to be slightly smarter about secrets, to the point where we can now just point it at a secret directory and ask it to 'sync' all secrets in there - runs the new fancy sync command on all keys to update them, which is a follow up to gerrit/328. Change-Id: I0eec4a3e8afcd9481b0b248154983aac25657c40	2020-06-04 19:25:07 +00:00
patryk	c410432d94	personal/patryk/arma3: create a S3 bucket account for Arma3 mods Change-Id: Idd31b5f46fcaebfcd72334dc82fbc8df805203b1	2020-06-04 18:51:51 +02:00
informatic	cb96eb6df6	Merge "crdb.k0: add sso client"	2020-05-31 12:26:04 +00:00
q3k	e55493f635	calico: fix access to resources from controller This fixes even more networking issues. Change-Id: I754656a01e3de8a34055280908b343a1a25a4707	2020-05-30 17:57:05 +02:00
q3k	ba375e62b2	calico: fix node name selection This was an attempt to make new calico nodes use a full FQDN. However, this change seemingly also makes the calico control plane use the FQDN for all existing nodes, as such breaking CNI for new pods. We revert this change, thereby keeping all calico nodes names as hostnames. We could fix this by editing /var/lib/calico/nodename on hosts to FQDNs, but it might not be worth the effort. See https://github.com/projectcalico/calico/issues/1093 for more context. Change-Id: I52bfb00f604053d57d3009aebd6c50db7dc74f58	2020-05-30 16:18:13 +02:00
informatic	42da0e9aec	crdb.k0: add sso client Change-Id: I7490a3594694d61a19910e436983937667ed34bd	2020-05-30 14:34:33 +02:00
q3k	d81bf72d7f	calico: upgrade to 3.14, fix calicoctl We still use etcd as the data store (and as such didn't set up k8s CRDs for Calico), but that's okay for now. Change-Id: If6d66f505c6b40f2646ffae7d33d0d641d34a963	2020-05-28 16:47:16 +02:00
q3k	1223cde4d4	cluster: fix nuke's personal storage Change-Id: I422a6d9f7a483e7c44cc8dfd8c0d8a98d9e17e46	2020-05-16 17:38:23 +02:00
q3k	741c08f66c	cluster: add nuke's personal storage He needs some personal backup space, and we have enough best effort spare capacity for that. Change-Id: I75ed6f62e79d33907c0974ec5f2839389ce62543	2020-05-14 18:13:53 +00:00
q3k	a168c50132	SECURITY: cluster: limit api objects modifiable by namespace admins This previous allowed all namespace admins (ie. personal-$user namespace users) to create any sort of obejct they wanted within that namespace. This could've been exploited to allow creation of a RoleBinding that would then allow to bind a serviceaccount to the insecure podsecuritypolicy, thereby allowing escalation to root on nodes. As far as I've checked, this hasn't been exploited, and the access to the k8s cluster has so far also been limited to trusted users. This has been deployed to production. Change-Id: Icf8747d765ccfa9fed843ec9e7b0b957ff27d96e	2020-05-11 20:49:31 +02:00

1 2 3 4

158 Commits (65aead706923ef571e9f6e1a226acf82f35d4359)