1
0
Fork 0
Commit Graph

135 Commits (31e41d5ff705d9e211b28855c9367111b855e378)

Author SHA1 Message Date
q3k d5918c8e72 cluster: change q3k's laptop key
Paranoia is dead, long live Mimeomia.

This has already been deployed to production.

Change-Id: Ibbc5015b5277380a3450f76e62d3fab6e71be1a0
2020-08-22 22:29:42 +02:00
q3k 0581bbf8a0 games/factorio: add modproxy
This adds a mod proxy system, called, well, modproxy.

It sits between Factorio server instances and the Factorio mod portal,
allowing for arbitrary mod download without needing the servers to know
Factorio credentials.

Change-Id: I7bc405a25b6f9559cae1f23295249f186761f212
2020-08-14 13:03:46 +02:00
q3k 3d29484ebb k0: move registry to ceph-waw3
ceph-waw2 has currently some production issues [1] which have started to
cause write failures in the registry. The registry is the only user of
ceph-waw2's affected pool, so we reduce the dumpster fire blast radious
by moving it over to ceph-waw3.

This has already been deployed and data has been migrated over (via
s3cmd sync), and the migration has been verified (by a push and pull,
and pull of an older image).

[1] - pgs stuck inactive in the object storage pool

Change-Id: I26789b52008bb7be953954ec3fd3dd727ac15347
2020-08-04 01:36:51 +02:00
q3k 4ded56ab8a prodvider: emit client/server cert
Change-Id: I024782a7dfa6e16ff5f562a62ddd8fe3bf299c51
2020-08-01 22:01:05 +02:00
q3k f3312ef77e *: developer machine HSPKI credentials
In addition to k8s certificates, prodaccess now issues HSPKI
certificates, with DN=$username.sso.hswaw.net. These are installed into
XDG_CONFIG_HOME (or os equiv).

//go/pki will now automatically attempt to load these certificates. This
means you can now run any pki-dependant tool with -hspki_disable, and
with automatic mTLS!

Change-Id: I5b28e193e7c968d621bab0d42aabd6f0510fed6d
2020-08-01 17:15:52 +02:00
q3k 509ab6e29a k0/cockroach: add public DNS entry for cockroach
Change-Id: I934bf348e2165148b515b709e853ab67f039a402
2020-07-30 22:56:30 +02:00
implr cae27ecd99 Replace rules_pip with rules_python; use bazel built upstream grpc
instead of Python packages

As usual with Python sadness, the @pydeps wheels are built on the bazel
host, so stuffing them inside a container_image (or py_image) will cause
new and unexpected kinds of misery.

Change-Id: Id4e4d53741cf2da367f01aa15c21c133c5cf0dba
2020-07-08 18:55:34 +02:00
informatic 97a6ca8a8b Merge "cluster/kube/lib/nginx: add gitea-prod ingress service" 2020-07-02 17:15:53 +00:00
informatic 0697e01144 cluster/kube/lib/registry: allow auth'd users to pull all images
"Anyone can pull all images" rule did only match on anonymous users. Now
it should match all users, including authenticated ones.

Change-Id: I2205299093feca51f30526ba305eadbaa0a68ecb
2020-07-02 18:45:42 +02:00
informatic f00edf6ee8 cluster/kube/lib/nginx: add gitea-prod ingress service
We would like gitea to have its ssh server exposed on TCP port 22 on the
same address as its web interface. We would also still like to use all
the automation around ingresses already in place (like cert-manager
integration).

To solve this, we create an additional LoadBalancer service for
nginx-ingress-controller and set up special tcp-services forwarding rule
to pass port 22 traffic to gitea-prod/gitea service, like we already do
in case of gerrit.

Change-Id: I5bfc901ebe858464f8e9c2f3b2216b254ccd6c4d
2020-07-02 18:30:38 +02:00
q3k b1aadd88ff k0: add q3k's personal s3 user
Change-Id: I5681774e1dca2cf4a865d9e1a24602ed4334f006
2020-06-24 17:19:36 +00:00
q3k 0037edaa5b cluster/tools/rook-s3cmd-config: build using bazel
This turns the existing script into a proper sh_binary, and injects
dependencies (kubectl and jq) as deps into it.

This change also pulls in BUILDfiles for jq, and a dep (oniguruma) into
//third_party, and adds buildable external repositories for them.

The jq/oniguruma BUILDfiles are lifted from
https://github.com/attilaolah/bazel-tools/.

Change-Id: If2e548bd60a8fd34e4f3be767ae59c6b2f2286d9
2020-06-13 22:46:41 +02:00
implr d9df5879e3 add radosgw bucket for spark
Change-Id: Id8ea8901ce038ccbf11afabe0e6272c358b32cf2
2020-06-13 21:31:56 +02:00
q3k 9b2ce179a8 Merge "cluster/kube: split up cluster.jsonnet" 2020-06-13 17:52:27 +00:00
q3k dbfa988c73 cluster/kube: split up cluster.jsonnet
It was getting large and unwieldy (to the point where kubecfg was slow).
In this change, we:

 - move the Cluster function to cluster.libsonnet
 - move the Cluster instantiation into k0.libsonnet
 - shuffle some fields around to make sure things are well split between
   k0-specific and general cluster configs.
 - add 'view' files that build on 'cluster.libsonnet' to allow rendering
   either the entire k0 state, or some subsets (for speed)
 - update the documentation, drive-by some small fixes and reindantation

Change-Id: I4b8d920b600df79100295267efe21b8c82699d5b
2020-06-13 19:51:58 +02:00
q3k 66a26a8f02 WORKSPACE: remove nixpkgs/rules_nix
We're not using them for anything. Initially they were going to be used
for nixops, but nixops is not very good, so let's just drop them.

We still have a Nix dependency for clustercfg.py when provisioning
nodes, but rules_nix/nixpkgs in WORKSPACE were unrelated to that.

Change-Id: I28c249507d1be9c5dbbd1ee764deccd9ab038549
2020-06-07 02:22:14 +02:00
q3k ce81c39081 ops/metrics: basic cluster setup with prometheus
We handwavingly plan on implementing monitoring as a two-tier system:

 - a 'global' component that is reponsible for global aggregation,
   long-term storage and alerting.
 - multiple 'per-cluster' components, that collect metrics from
   Kubernetes clusters and export them to the global component.

In addition, several lower tiers (collected by per-cluster components)
might also be implemented in the future - for instance, specific to some
subprojects.

Here we start sketching out some basic jsonnet structure (currently all
in a single file, with little parametrization) and a cluster-level
prometheus server that scrapes Kubernetes Node and cAdvisor metrics.

This review is mostly to get this commited as early as possible, and to
make sure that the little existing Prometheus scrape configuration is
sane.

Change-Id: If37ac3b1243b8b6f464d65fee6d53080c36f992c
2020-06-06 15:56:10 +02:00
q3k 7371b7288b tools/secretstore: add sync command, re-encrypt
This kills two birds with one stone:

 - update the secretstore tool to be slightly smarter about secrets, to
   the point where we can now just point it at a secret directory and
   ask it to 'sync' all secrets in there
 - runs the new fancy sync command on all keys to update them, which
   is a follow up to gerrit/328.

Change-Id: I0eec4a3e8afcd9481b0b248154983aac25657c40
2020-06-04 19:25:07 +00:00
patryk c410432d94 personal/patryk/arma3: create a S3 bucket account for Arma3 mods
Change-Id: Idd31b5f46fcaebfcd72334dc82fbc8df805203b1
2020-06-04 18:51:51 +02:00
informatic cb96eb6df6 Merge "crdb.k0: add sso client" 2020-05-31 12:26:04 +00:00
q3k e55493f635 calico: fix access to resources from controller
This fixes even more networking issues.

Change-Id: I754656a01e3de8a34055280908b343a1a25a4707
2020-05-30 17:57:05 +02:00
q3k ba375e62b2 calico: fix node name selection
This was an attempt to make new calico nodes use a full FQDN. However,
this change seemingly also makes the calico control plane use the FQDN
for all existing nodes, as such breaking CNI for new pods.

We revert this change, thereby keeping all calico nodes names as
hostnames. We could fix this by editing /var/lib/calico/nodename on
hosts to FQDNs, but it might not be worth the effort.

See https://github.com/projectcalico/calico/issues/1093 for more
context.

Change-Id: I52bfb00f604053d57d3009aebd6c50db7dc74f58
2020-05-30 16:18:13 +02:00
informatic 42da0e9aec crdb.k0: add sso client
Change-Id: I7490a3594694d61a19910e436983937667ed34bd
2020-05-30 14:34:33 +02:00
q3k d81bf72d7f calico: upgrade to 3.14, fix calicoctl
We still use etcd as the data store (and as such didn't set up k8s CRDs
for Calico), but that's okay for now.

Change-Id: If6d66f505c6b40f2646ffae7d33d0d641d34a963
2020-05-28 16:47:16 +02:00
q3k 1223cde4d4 cluster: fix nuke's personal storage
Change-Id: I422a6d9f7a483e7c44cc8dfd8c0d8a98d9e17e46
2020-05-16 17:38:23 +02:00
q3k 741c08f66c cluster: add nuke's personal storage
He needs some personal backup space, and we have enough best effort
spare capacity for that.

Change-Id: I75ed6f62e79d33907c0974ec5f2839389ce62543
2020-05-14 18:13:53 +00:00
q3k a168c50132 SECURITY: cluster: limit api objects modifiable by namespace admins
This previous allowed all namespace admins (ie. personal-$user namespace
users) to create any sort of obejct they wanted within that namespace.

This could've been exploited to allow creation of a RoleBinding that
would then allow to bind a serviceaccount to the insecure
podsecuritypolicy, thereby allowing escalation to root on nodes.

As far as I've checked, this hasn't been exploited, and the access to
the k8s cluster has so far also been limited to trusted users.

This has been deployed to production.

Change-Id: Icf8747d765ccfa9fed843ec9e7b0b957ff27d96e
2020-05-11 20:49:31 +02:00
q3k d436de2010 cluster/rook: bump to 1.1.9
This bumps Rook/Ceph. The new resources (mostly RBAC) come from
following https://rook.io/docs/rook/v1.1/ceph-upgrade.html .

It's already deployed on production. The new CSI driver has not been
tested, but the old flexvolume-based provisioners still work. We'll
migrate when Rook offers a nice solution for this.

We've hit a kubecfg bug that does not allow controlling the CephCluster
CRD directly anymore (I had to apply it via kubecfg show / kubectl apply
-f instead). This might be due to our bazel/prod k8s version mismatch,
or it might be related to https://github.com/bitnami/kubecfg/issues/259.

Change-Id: Icd69974b294b823e60b8619a656d4834bd6520fd
2020-05-02 23:30:52 +02:00
Bartosz Stebel 98ef1518e0 add vpn insecure namespace
Change-Id: I8a774ae625342af3521ad0ab11a8f6d4e4ef6c97
2020-04-24 13:28:38 +02:00
q3k 8adbd49051 *: more hackdoc updates
Change-Id: Ib9830c66fe36c423d38f447905c470b67cde5399
2020-04-10 22:10:18 +02:00
q3k 4f7cc0064f Revert "*: update docs for hackdoc"
This reverts commit cc8c69c897.

Reason for revert: <INSERT REASONING HERE>

Change-Id: I1315e930e2ef69db3188eda05e4aa0b12db24274
2020-04-10 20:09:35 +00:00
q3k cc8c69c897 *: update docs for hackdoc
Change-Id: I256ec4499da2289f8f7ea3766ce40f2b0ffb0dc1
2020-04-10 21:20:53 +02:00
q3k c881cf3c22 devtools/hackdoc: init
This is hackdoc, a documentation rendering tool for monorepos.

This is the first code iteration, that can only serve from a local git
checkout.

The code is incomplete, and is WIP.

Change-Id: I68ef7a991191c1bb1b0fdd2a8d8353aba642e28f
2020-04-08 20:03:12 +02:00
q3k 0dcc702c64 cluster: bump nearly-expired certs
This makes clustercfg ensure certificates are valid for at least 30
days, and renew them otherwise.

We use this to bump all the certs that were about to expire in a week.
They are now valid until 2021.

There's still some certs that expire in 2020. We need to figure out a
better story for this, especially as the next expiry is 2021 - todays
prod rollout was somewhat disruptive (basically this was done by a full
cluster upgrade-like rollout flow, via clustercfg).

We also drive-by bump the number of mons in ceph-waw3 to 3, as it shouls
be (this gets rid of a nasty SPOF that would've bitten us during this
upgrade otherwise).

Change-Id: Iee050b1b9cba4222bc0f3c7bce9e4cf9b25c8bdc
2020-03-28 18:01:40 +01:00
q3k 90e8e68bab crdb.k0: add bugless-dev (for q3k)
Change-Id: I3988e1c37f0a0c54ef1ba248f01e026d6e8c72b6
2020-03-25 10:55:05 +01:00
q3k e186c87c1b cluster: bump rook to 1.0.6
In preparation for updating to 1.1.0, which will be much more involved.

Also fix a typo in registry.libsonnet, whoops.

Change-Id: I7668bf53c7580f99fdf56fe6227f04a468f8de50
2020-02-21 12:57:02 +01:00
q3k 114edc2398 kube/mirko: add kube.CephObjectStoreUser
Change-Id: I2a67076eeaf41ada41f5ae3ee588025e4c16b9e1
2020-02-18 22:55:13 +01:00
q3k 0d83300b18 cluster: set ceph-waw3 mon replicas to 1
This reflects current production. This needs to get bumped up to 3 at some point as otherwise we lose HA for this cluster.

Change-Id: Ie5937e6a216b635ecbc4c82ecd182a410167c3f8
2020-02-15 11:48:39 +00:00
q3k 58d08595f1 {cluster,}/README: update
Change-Id: Ie211fd34316c407f29506b67187632fd22a4f75b
2020-02-15 01:00:42 +01:00
q3k d7364520e9 cluster: bump kubelets to 1.14.3
Change-Id: I02ed978a49629cdfc3f3587ad640e8cc5a5fad23
2020-02-02 23:43:28 +01:00
q3k e2095b2ce9 cluster: remove unused module-cluster.nix
Change-Id: I819d803fc7454cfd63a11a109ec73c9578f598b8
2020-02-02 23:43:00 +01:00
q3k c78cc13528 cluster/nix: locally build nixos derivations
We change the existing behaviour (copy files & run nixos-rebuild switch)
to something closer to nixops-style. This now means that provisioning
admin machines need Nix installed locally, but that's probably an okay
choice to make.

The upside of this approach is that it's easier to debug and test
derivations, as all data is local to the repo and the workstation, and
deploying just means copying a configuration closure and switching the
system to it. At some point we should even be able to run the entire
cluster within a set of test VMs.

We also bump the kubernetes control plane to 1.14. Kubelets are still at
1.13 and their upgrade is comint up today too.

Change-Id: Ia9832c47f258ee223d93893d27946d1161cc4bbd
2020-02-02 22:31:53 +01:00
q3k aa76e55eea cert-manager: fix DNS for http01 k0 splitdns
Change-Id: I73847daec9796cb891cf2fe58c2633c5fa768861
2019-12-29 02:49:30 +01:00
q3k 0c337acf89 benji: fix in waw2, run in waw3
This needed an upstream change to allow only some pools to be backed up,
otherwise benji would crash when stubmling upon the first PVC from a
pool that wasn't backed by the ceph cluster it was acting upon.

Change-Id: I52bf163c16352cb59fdd3dbdd576145ce1dbac03
2019-12-21 23:45:07 +01:00
q3k ba8e79e8f4 kube-apiserver: fix cert mismatch, again
This time from a bare hscloud checkout to make sure _nothing_ is fucked
up.

This causes no change remotely, just makes te repo reflect reality.

Change-Id: Ie8db01300771268e0371c3cdaf1930c8d7cbfb1a
2019-12-17 02:13:55 +01:00
q3k 050af01b83 cluster: add q3k's new SSH key
Change-Id: I872a75cc89a62c9487433fa5e8e5767953e309c9
2019-12-17 01:58:58 +01:00
q3k e5a956a1c8 *: bump to q3k's kubecfg, kubernetes 1.16
Change-Id: I302876d5a45cbfb63d87ad9f6ea9aaeff7bec17d
2019-11-17 22:38:40 +01:00
q3k fd323a0f55 cluster: sync to prod
Change-Id: If311f1ce44653bb54e0a10ad2fdd65685722a64d
2019-11-17 19:49:04 +01:00
q3k 96c428f7d7 nixops: fix
Change-Id: I15ebde319fcae3f9771da6a549e52783e0ec4409
2019-11-17 19:00:46 +01:00
q3k c33ebcc79f cluster: add ceph-waw3, move metallb to bgp
Change-Id: Iebf369f9a02e44be163ef4afc2e0f23c4b009898
2019-11-01 18:43:45 +01:00