1
0
Fork 0
Commit Graph

530 Commits (35d437883b6a02fd73c0e5258d40bcf065ce3065)

Author SHA1 Message Date
q3k e31d64f265 kube: move cert-manager resources to kube.local.libsonnet
This way kubernetes consumers don't have to import anything from
cluster/, hopefully.

We also create a small abstraction for local additions for
kube.libsonnet without having to modify upstream.

Change-Id: I209095781f91c8867250a647fe944370cddd67d0
2019-10-02 21:03:13 +02:00
q3k 54490d385e cluster/coredns: add cluster fqdn top level domain
This means that in addition to services being discoverable the 'classic'
way:

    <svcname>.<namespace>.svc.cluster.local

They are now discoverable as:

    <svcname>.<namespace>.svc.<fqdn>

For instance, on k0 you can now internally resolve:

    $ kubectl run --rm -it foo --image=nixery.dev/shell/dnsutils bash
    bash-4.4# dig +short coffee-svc.default.svc.k0.hswaw.net
    10.10.12.192

Change-Id: Ie6875b54ed6358f30f888ca0cd96e011520ace20
2019-10-02 20:49:13 +02:00
q3k 325e9476bf hswaw/smsgw: implement
The SMS gateway service allows consumers to subscribe to SMS messages
received by a Twilio phone number.

This is useful for receiving SMS auth messages.

Change-Id: Ib02a4306ad0d856dd10c7ca9241d9163809e7084
2019-09-27 12:54:16 +02:00
q3k 95868eeddc benji: back up daily instead of hourly
Every benji backup seems to cycle blocks (eg. delete some and recreate
them).

Since wasabi has a minimum billing retention policy of 90 days, this
means that every uploaded and then an hour later deleted object costs
us.

Currently we seem to be storing around 200G of data in wasabi for Benji
but already have 600G of deleted objects. This is suboptimal.

This change has already been deployed on production.

Change-Id: I67302d23a1c45974fb5d51ec9a8cff28260830dc
2019-09-26 21:49:24 +00:00
q3k 47b7e850e7 dc/arista-proxy: fix by using github.com/q3k/cursedjson
Change-Id: Id9657a30af8c16afe4ddde7e2ac04f4508a2fd18
2019-09-26 18:32:39 +02:00
q3k 6781f62ec4 Merge "app/radio: add support for following relays" 2019-09-25 12:06:17 +00:00
q3k 57515a2525 Merge "rules_pip: update to new version" 2019-09-25 12:05:58 +00:00
q3k 5f9b1ecd67 rules_pip: update to new version
rules_pip has a new version [1] of their rule system, incompatible with the
version we used, that fixes a bunch of issues, notably:
 - explicit tagging of repositories for PY2/PY3/PY23 support
 - removal of dependency on host pip (in exchange for having to vendor
   wheels)
 - higher quality tooling for locking

We update to the newer version of pip_rules, rename the external
repository to pydeps and move requirements.txt, the lockfile and the
newly vendored wheels to third_party/, where they belong.

[1] - https://github.com/apt-itude/rules_pip/issues/16

Change-Id: I1065ee2fc410e52fca2be89fcbdd4cc5a4755d55
2019-09-25 14:05:07 +02:00
q3k 2d81427410 app/radio: add support for following relays
Change-Id: Ib079d657239b1bf5294ad8457370d56a0093dd6d
2019-09-25 13:59:08 +02:00
q3k 5f3a5e0310 cluster/kube: emergency fixes after evition
Some pods got evicted. Some of them broke.

  - postgres in matrix and nginx in internet because of the new policies
    (chown issues)
  - cas proxy in matrix because apparently the image was not reuploaded
    to the regsitry after ceph-waw1 died, and another node didn't have it
  - registry because it had a weak image pin an downgraded to some
    broken version on another node

Change-Id: I836036872629843c8ede1b7f67982112c90d71f0
2019-09-25 02:58:15 +02:00
q3k db2a2a029f Merge "Get in the Cluster, Benji!" 2019-09-18 20:40:12 +00:00
q3k a01c487a6e cluster: allow insecure pods in rook-ceph-system
This is required for the agent to start a socket on each host for
kubelet-to-rook access.

Change-Id: I78529df81185aeaacdcb494138f72f0224a029c6
2019-09-05 16:01:19 +00:00
q3k 350aa88421 Merge "cluster: add nextcloud user for object store" 2019-09-02 14:33:24 +00:00
q3k 8c009bb302 Merge "cluster: disable unauthenticated read only port on kubelets" 2019-09-02 14:33:13 +00:00
q3k 13bb1bf4e3 Get in the Cluster, Benji!
Here we introduce benji [1], a backup system based on backy2. It lets us
backup Ceph RBD objects from Rook into Wasabi, our offsite S3-compatible
storage provider.

Benji runs as a k8s CronJob, every hour at 42 minutes. It does the
following:
 - runs benji-pvc-backup, which iterates over all PVCs in k8s, and backs
   up their respective PVs to Wasabi
 - runs benji enforce, marking backups outside our backup policy [2] as
   to be deleted
 - runs benji cleanup, to remove unneeded backups
 - runs a custom script to backup benji's sqlite3 database into wasabi
   (unencrypted, but we're fine with that - as the metadata only contains
   image/pool names, thus Ceph PV and pool names)

[1] - https://benji-backup.me/index.html
[2] - latest3,hours48,days7,months12, which means the latest 3 backups,
      then one backup for the next 48 hours, then one backup for the next
      7 days, then one backup for the next 12 months, for a total of 65
      backups (deduplicated, of course)

We also drive-by update some docs (make them mmore separated into
user/admin docs).

Change-Id: Ibe0942fd38bc232399c0e1eaddade3f4c98bc6b4
2019-09-02 16:33:02 +02:00
q3k 9496d9910a cluster: add nextcloud user for object store
Change-Id: Ib08be16f71ff5e1b72ca6ad436de4b12427dd407
2019-09-02 16:33:02 +02:00
q3k 42553cd044 cluster: disable unauthenticated read only port on kubelets
This port was leaking kubelet state, including information on running
pods. No secrets were leaked (if they were not text-pasted into
env/args), but this still shouldn't be available.

As far as I can tell, nothing depends on this port, other than some
enterprise load balancers that require HTTP for node 'health' checks.

Change-Id: I9549b73e0168fe3ea4dce43cbe8fdc2ca4575961
2019-09-02 16:33:02 +02:00
q3k c349ccf2fd Merge "prodvider: clean up LDAP connections" 2019-08-31 14:57:44 +00:00
q3k 896926c921 prodvider: clean up LDAP connections
Change-Id: Ic95e6d1b845832fa0fb2da51b418bcdcb8fd05c4
2019-08-31 15:00:51 +02:00
q3k 1503983c27 Merge "rook/ceph: bump" 2019-08-30 23:21:13 +00:00
q3k ed9cf98316 Merge "prod{access,vider}: implement" 2019-08-30 23:21:09 +00:00
informatic eabbe8a11e app/matrix: update software components, refactor config handling
Dynamic config generation based on environment variables in Synapse is
no longer supported. To pass secrets to container we use a patch that
implements configuration overrides via environment variables directly.
(to be upstreamed...)

Due to Synapse update, appservice configuration ConfigMaps don't need to
be copied into Synapse /data volume anymore.

Change-Id: I70e6480983bfb997362739c6ce0ec3c313320836
2019-08-30 23:21:53 +02:00
informatic b20b366092 app/matrix: change storageclass to waw-hdd-paranoid-2
Change-Id: I757942409f4ef4da69d4cf1925d26dc758c65311
2019-08-30 23:21:53 +02:00
q3k 71a21c7693 rook/ceph: bump
Change-Id: I046df292cad11650adb829cc8a73100cc1d1ecc8
2019-08-30 23:08:26 +02:00
q3k b13b7ffcdb prod{access,vider}: implement
Prodaccess/Prodvider allow issuing short-lived certificates for all SSO
users to access the kubernetes cluster.

Currently, all users get a personal-$username namespace in which they
have adminitrative rights. Otherwise, they get no access.

In addition, we define a static CRB to allow some admins access to
everything. In the future, this will be more granular.

We also update relevant documentation.

Change-Id: Ia18594eea8a9e5efbb3e9a25a04a28bbd6a42153
2019-08-30 23:08:18 +02:00
q3k d16454badc cert-manager: bump to v0.9.1
We just got this email:

We've been working with Jetstack, the authors of cert-manager, on a
series of fixes to the client. Cert-manager sometimes falls into a
traffic pattern where it sends really excessive traffic to Let's
Encrypt's servers, continuously. To mitigate this, we plan to start
blocking all traffic from cert-manager versions less than 0.8.0 (the
current semver minor release), as of November 1, 2019. Please upgrade
all of your cert-manager instances before then.

We're sending this email because this is the contact address of your
cert-manager instance at:

 185.236.240.37 .

Version 0.8.0 is much better but we still observe excessive traffic in
some cases. We're working with Jetstack to improve these cases. As new
versions of cert-manager are released, we will add the non-current
versions to our block list after 3 months. We strongly encourage
cert-manager users to stay up-to-date with new versions.

Also, there is an opportunity to help both Jetstack and Let's Encrypt.
Once you've upgraded, please check the logs for your cert-manager
instances from time to time. Are they making excessive requests to Let's
Encrypt (more than, say, 10 per day over multiple days)? If so, please
share details at https://github.com/jetstack/cert-manager/issues/1948 .

Thanks,
Let's Encrypt Team

Change-Id: Ic7152150ac1c96941423878c6d4b6209e07429cf
2019-08-29 17:21:49 +02:00
Serge Bazanski ef93747aec cccampix: updates from camp
Change-Id: I77e6d9fb6e91b0b7e2d1f89e80164ee8116b5d50
2019-08-29 14:53:18 +02:00
Serge Bazanski a2960f526c birdie: use passwords
Change-Id: I2204ba0b09648799dfd5bd01bd15d2580b3cb3c8
2019-08-22 20:13:47 +02:00
Serge Bazanski ec71cb50bd Draw the actual rest of the fucking owl.
Change-Id: Ia04fb49ebbe3a5afccc57e62f6335e35b45192fe
2019-08-22 18:14:35 +02:00
Serge Bazanski 915b265b8a bgpwtf/cccampix: deploy pgpencryptor
Change-Id: I3714c81b663781d9b449695760d83c1b8841d0e0
2019-08-22 18:14:02 +02:00
Serge Bazanski 187c4bb60a pgpencryptor: potentially fix crash on encyptor close
We seem to be hitting a bug where the encryptor doesn't initialize
because of a lacking gpg binary, and then crashes on .Close().

This should fix the issue, but is untested.

    goroutine 70 [running]:
    code.hackerspace.pl/hscloud/bgpwtf/cccampix/pgpencryptor/gpg.(*CLIEncryptor).Close(0x0)
            bgpwtf/cccampix/pgpencryptor/gpg/gpg.go:144 +0x22
    main.(*service).Encrypt(0xc000345e00, 0x16d13a0, 0xc00047f260, 0x1688400, 0xc00003d4a0)
            bgpwtf/cccampix/pgpencryptor/main.go:132 +0x6f9
    code.hackerspace.pl/hscloud/bgpwtf/cccampix/proto._PGPEncryptor_Encrypt_Handler(0x133bf00, 0xc000345e00, 0x16c6300, 0xc0000d6000, 0x2247b78, 0xc0001f8000)
            bazel-out/k8-fastbuild/bin/bgpwtf/cccampix/proto/linux_amd64_stripped/ix_go_proto%/code.hackerspace.pl/hscloud/bgpwtf/cccampix/proto/ix.pb.go:1816 +0xad
    google.golang.org/grpc.(*Server).processStreamingRPC(0xc000160c00, 0x16d6ce0, 0xc000161500, 0xc0001f8000, 0xc0004244e0, 0x21b00e0, 0xc0000c6ff0, 0x0, 0x0)
            external/org_golang_google_grpc/server.go:1175 +0xacd
    google.golang.org/grpc.(*Server).handleStream(0xc000160c00, 0x16d6ce0, 0xc000161500, 0xc0001f8000, 0xc0000c6ff0)
            external/org_golang_google_grpc/server.go:1254 +0xcbe
    google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000404770, 0xc000160c00, 0x16d6ce0, 0xc000161500, 0xc0001f8000)
            external/org_golang_google_grpc/server.go:690 +0x9f
    created by google.golang.org/grpc.(*Server).serveStreams.func1
            external/org_golang_google_grpc/server.go:688 +0xa1
    created by google.golang.org/grpc.(*Server).serveStreams.func1
            external/org_golang_google_grpc/server.go:688 +0xa1

Change-Id: Idd167a120e157005f44d255a61ef13dc80e8eeed
2019-08-22 18:14:02 +02:00
Serge Bazanski bfcaedcf2b prodimage: add gnpug, use pl mirrors
Change-Id: I6245e9b1b127c5db574d58e35b5f3006551d795b
2019-08-14 19:21:48 +02:00
q3k 73b96184c7 Merge "bgpwtf/cccampix: cronjobify ripe-sync" 2019-08-14 12:34:10 +00:00
Serge Bazanski 821fa5fcc4 bgpwtf/cccampix: cronjobify ripe-sync
Change-Id: I185c2702384941b6537a6a4048bdb2e1c4e183ba
2019-08-14 14:33:30 +02:00
lb5tr 716ecf6bc5 bgpwtf/cccampix/pgpencryptor: implement service
TODO:
  * tests

Change-Id: I5d0506542070236a8ee879fcb54bc9518e23b5e3
2019-08-12 19:17:05 -07:00
Serge Bazanski 49bf87f8e1 bgpwtf/cccampix: fix da build
Change-Id: Id890b0f4c7a7bd7d961d2105b388b1b0b14f9015
2019-08-11 23:51:50 +02:00
q3k 1fad2e5c6e bgpwtf/cccampix: draw the rest of the fucking owl
Change-Id: I49fd5906e69512e8f2d414f406edc0179522f225
2019-08-11 23:43:25 +02:00
q3k ddfd6591f8 *: bump docker images and storage pools
This brings all core services back to life after The Failure.

Change-Id: I98b0c104c66fa11f646864018356e9c3a226a1f9
2019-08-11 23:42:47 +02:00
q3k d533892efa Fix crdb-waw1
We accidentally created crdb-waw2 in
https://gerrit.hackerspace.pl/c/hscloud/+/2.

We remove it now and also backport a manual change that makes the
crdb-waw1 service public via a LoadBalancer.

Change-Id: I3bbd6f01b82c6efa458cc44776f086ba36e9f20c
2019-08-11 23:42:47 +02:00
q3k 17641a8607 Merge "annoyatron: temp fix" 2019-08-11 17:25:37 +00:00
lb5tr e5f8e8ae0c bgpwtf/cccampix/pgpencryptor: add service base
Add emacs swap files to .gitignore.

Change-Id: I5e0e3e31a0a0cd6d73e6c89a82b73412f0f78a15
2019-08-10 10:51:07 -07:00
q3k 0e223ec77f bgpwtf/cccampix/proto: add PGPEncryptor service
Change-Id: I932ce6bf5fdb792eb83945a8e46551f169e51c97
2019-08-09 19:02:32 +02:00
q3k 1f3674fafa annoyatron: temp fix
Change-Id: Ib70425f69b9ea5811c1adff3316789c5d5042d82
2019-08-08 17:49:39 +02:00
q3k d07861b7df ceph-waw1 -> ceph-waw2
Change-Id: I03d6244b9697a9efc06492114ef90cdb01e17601
2019-08-08 17:49:31 +02:00
q3k 30317b4278 go/mirko: add SQL migrations machinery
This uses github.com/golang-migrate/migrate and adds a Source that
allows using go_embed data files.

We also provide a test/example.

Change-Id: Icd2b6c7f7d0f728073b3fdf39b432b33ce61a3cd
2019-08-03 23:49:43 +02:00
q3k 2316ac0e99 bgpwtf/cccampix/irr: limit concurrency
Change-Id: I958322f33c86469f9c3e21d1bd962faede2a3fee
2019-08-03 23:49:43 +02:00
q3k e06c314e92 Merge "bgpwtf/cccampix: add IRR daemon" 2019-08-02 11:42:39 +00:00
q3k 113baaf9c1 Merge "bgpwtf/cccampix/peeringdb: allow multiple routers per peer" 2019-08-02 11:41:48 +00:00
q3k 6eaaaf9bab bgpwtf/cccampix: add IRR daemon
We add a small IRR service for getting a parsed RPSL from IRRs. For now,
we only support RIPE and ARIN, and only the following attributes:
 - remarks
 - import
 - export

Since RPSL/RFC2622 is fucking insane, there is no guarantee that the
parser, especially the import/export parser, is correct. But it should
be good enough for our use. We even throw in some tests for good
measure.

    $ grpcurl -format text -plaintext -d 'as: "26625"' 127.0.0.1:4200 ix.IRR.Query
    source: SOURCE_ARIN
    attributes: <
      import: <
        expressions: <
          peering: "AS6083"
          actions: "pref=10"
        >
        filter: "ANY"
      >
    >
    attributes: <
      import: <
        expressions: <
          peering: "AS12491"
          actions: "pref=10"
        >
        filter: "ANY"
      >
    >

Change-Id: I8b240ffe2cd3553a25ce33dbd3917c0aef64e804
2019-08-02 13:39:42 +02:00
q3k 0607abae1d bgpwtf/cccampix/peeringdb: allow multiple routers per peer
Change-Id: I84200cc0056d569e962c104cf082ce10f9c4025f
2019-08-02 13:39:41 +02:00