Kubernetes workload identity looks pretty scary when you read about it in the docs, but it really is a better (and simpler) way to give specific permissions to Kubernetes workloads than less secure methods such as using service account keys. I had a specific use case in mind – getting a set of collections from mongodb to bigquery on a regular schedule – and since I’m running Kube in that project anyway, it seemed a reasonable solution to use a Kube cronjob.
Maybe you’re not using kubernetes at all but just want to transfer data from mongo to bigquery – I’ll show you how to run those parts of the article locally too.
Even if that doesn’t match your exact end to end use case, there shoud be something here for anyone who wants to work with any of the topics mentioned in the (long) journey in this article covers.
Here’s a summary of the main topics:
- Cloud build to create images to run on Kubernetes
- Cloud builder container images use google manaintained prebuilt images
- Artifact registry for serving your build images
- GCP service accounts versus Kubernetes service accounts
- Iam policy binding
- Kubernetes workload identity federation
- Kubernetes jobs and cronjobs
- bq for bigquery
- gsutil to move data to cloud storage
- yq and jq for manipulating yml and json files
- Mongoexport to get data out of mongo
- Doppler and Kubernetes Secrets to manage credentials
I’ll also provide generalized bash scripts to set everything up, so you can cannibalize them to fit your own use case. Details on the repo for those at the end of the article.
I’m using an autopilot Kubernetes cluster for this project. There are a few extra steps to prepare the node pools on other kinds of cluster for workload identity. I won’t be covering that in this article, but you read more about it in the first couple of sections here.
Environment
I have multiple NAMESPACES in my cluster, and the ones I refer to here are ‘gcp-stg’ and ‘gcp-prod’. I also refer to MODES. Many of the resource names in the scripts are derived from either the NAMESPACE or the MODE. The MODE is derived from the NAMESPACE so MODE ‘stg’ refers to NAMESPACE ‘gcp-stg’ and MODE ‘prod’ refers to NAMESPACE ‘gcp-prod’. You may need to tweak some of the scripts to mirror your naming conventions.
The configuration details for my Mongo instances use doppler as the source of truth for secrets, but these are already in my cluster as kubernetes secrets as my api uses these to access Mongo. I cover how to do that in Sharing secrets between Doppler, GCP and Kubernetes. For the purposes of the kubernetes focused sections of this article, let’s assume you’ve created a kubernetes secret with your Mongo credentials safely stored.
You’ll also need kubectl, gcloud, gsutil, yq and jq installed, as well as a modern bash.
Creating GCP and Kubernetes service accounts
A GCP service account will define the permissions allocated to access to cloud resources, whereas a Kubernetes service account completes the mapping between a kubernetes resource (a KSA) and a GCP service account (GSA). When creating Kubernetes pods, we refer to the KSA, which then refers to the GSA behind the scenes. To put it another way, the KSA is an abstraction of a linked GSA – meaning that there is no need for the potential credential leakage associated with managing GSA keys. It’s this mapping that provides the basis for Workload identity federation.
The scripts to create service accounts
I use a wrapper for this to ensure that the 2nd script runs under bash regardless of which shell executes it. I’m adding this to my cluster startup process in case I restart the cluster at some future point.
Some notes on the script
Execute sh mtob-sa.sh ‘your-namespace’ (in my case ‘sh mtob-sa.sh gcp-dev’) and it will
- Use jq to create a condition json to apply when we allocate a role to the GSA. Conditions are useful to fine tune the permission – for example how long it should be valid for. You may want to only allow it for say “+2 hours”. In my case, I’m creating a cronjob that will run for as long as the cluster is running – so I’ve given it a long expiry time.
- Delete and recreate both the KSA and GSA. We want to make sure that the current state of permissions on the GSA are only the ones required now, not some that may have been set in the past.
- The roles I’m assigning are the minimum required to load exported data from mongo to cloud storage, then load that data into bigquery – creating and replacing tables as required. I’m assuming the bigquery dataset they will live in already exists.
- Give the GSA ‘roles/iam.workloadIdentityUser’. Note the very specific member email address for this purpose.
- Finally map the GSA to the KSA we created earlier using an annotation to enable workload identity.
You can examine the status of these service accounts (substituting your namespace name and project id) with
gcloud iam service-accounts describe mtob-gsa-gcp-stg@youprojectid.iam.gserviceaccount.com
and
kubectl describe serviceaccounts mtob-robot-gcp-stg -n gcp-stg
Create the script to export a collection from mongo
This script will run in a Kubernetes pod. You should create the receiving dataset in the bigquery console, and the cloud storage bucket before starting. Note that these scripts expect stg and prod data in separate datasets, with the dataset name derived from the MODE value. You may want to tweak this for your own naming conventions.
Some notes on this script
- This takes 2 arguments – MODE (in my case this is stg or prod – the data is kept separate through th entire process) and TABLE (what to call it on bigquery – this is probably going to be same as the collection name on mongo)
- The bucket and folder. This bucket and folder combination, along with the MODE, is used to pass the mongo data to on its way to bigquery. It will also be a handy backup of the latest state of each collection.
- ndjson – this is a variant of json which instead of commas between each element of a JSON array, inserts a new line. This is the format that bigquery likes to import from and allows the loading of files that would otherwise be too large to represent in a single json file.
- Schemas – although bigquery can automatically detect data format, it doesn’t always get it right. I normally load the data once with –autodetect, check and correct the schema it generates, download it, then use that corrected schema for subsequent uploads. There’s a script to do that later in this article
- The mongo environment variables such as DB_HOST etc come from a Kubernetes secret which you should create (or use your preferred alternative method to inject env) – for example see Sharing secrets between Doppler, GCP and Kubernetes and the section later in this article fro running the scripts locally.
- Mongo exports various special fields (like $oid and $date) – bigquery doesn’t like dollars in fieldnames, so we have to do a small edit to drop them. This is fine for my tables but you may need to make it more sophisticated if you have some other anomalies.
- The location variable for bigquery will depend on where you are located – I’m using eu.
Create a script to move all the required collections
This script will run in kubernetes an call mtob.sh for each mongo collection that needs imported into a bigquery table.
Some notes on this script
- Use bash (not sh) to run
- The NAMESPACE variable comes from Kubernetes – more on that later when we look at yaml files
- In my example, mode and namespace are related – NAMESPACE gcp-stg implies MODE stg. You may need to tweak the MODE extraction to your use case
- Notice we exit with a failure code if something goes wrong. This is a signal that will be picked up in the job yaml file that it should report an error in kubectl listings – more on that later
Running these scripts locally
It’s possible that you’ll want to run these scripts locally – maybe you’re not using kubernetes at all but just want to transfer data from mongo to bigquery, but even if you are using kubernetes you’ll probably want to test them before building your image. These scripts should work as is, except we need to find a way to provide the mongo credentials (DB_HOST, DB_USER, DB_PASSWORD, DB_NAME) and NAMESPACE that normally would come from the kubernetes environment.
As manually set environment variables
Just set these in your terminal shell – eg (NAMESPACE=gcp-stg) before running the scripts
From Doppler
If you are using doppler, after loggining into the correct doppler project and configuration you can inject environment variables like this
doppler run bash mtob-all.sh
From a Kubernetes secret
Set up each value in your script like this
DB_NAME=$(kubectl get secret yoursecretname -o jsonpath='{.data.DB_NAME}' | base64 --decode)
See Kubernetes secret values as shell environment variables for a handy way to automate this.
Schemas
The scripts are expecting to find schemas to describe how to import the collections into bigquery tables. Here’s the extract
bq --location=eu load \
--replace \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
"${DS}.${TABLE}" \
"${URI}" \
"${SCHEMA}"
Initially of course these won’t exist, so you can use the autodetect flag to create them – just remove the $SCHEMA parameter from the above and run locally. This will have a stab at creating an intial table in bigquery. We can then download the schema, tweak it if necessary and use it to ensure that future loads are consistent.
Downloading the schemas
Here’s a script to download the schemas – just run it for every collection
Building the image
Returing to the Kubernetes setup, now we can build the image with cloud build, with
gcloud builds submit --config=mtob-build.yaml
Some notes on the build yaml file
- You should create (or use an existing) artifact repository in your project’s Artifact registry
- Substitute the region, artifacts,and gcpimage with your names. The gcp image name can be anything you want. It’s what will run in Kubernetes.
- Cloud build automatically detects PROJECT_ID
- We use the prebuilt docker image from Cloud builder container images to create our image using the Docker file mtob-Dockerfile (more on that shortly)
Build docker file
This is referenced by the cloud build yaml.
Some notes on this Dockerfile
- We’ll use the prebuild gcloud-slim image from Cloud builder container images
- We need to add Mongoexport to that image.
- The gcloud container has a number of entry points. Since we’ll be running the mtob-all.bash script created earlier, we want to use the bash entry point. Anything in the CMD section will be passed over to bash to run.
Recap
We’re almost there – all that remains now is to set up a job or a cronjob – I’ll show how to do both – to run this container to do all the exports according so some schedule. Here’s what we have
- A kubernetes service account (KSA) linked to a gcp service account (GSA) which has the necessary permissions to access cloud storage and bigquery.
- Workload identity federation setup to use this KSA
- Credentials for mongo in a kubernetes secret
- A script that can take an array of collections from mongo, back them up to cloud storage and load them into a bigquery dataset
- An image in the artifact repository that kubernetes can run to do all that, parameterized to segregate multiple modes and namespaces
Kubernetes yaml files
We’ll create 2 yaml files
- one to run a Kubernetes job to do a once off import
- one to run a Kubernetes cronjob to run at regular intervals
job yaml file
Some notes on the job yaml file
- The podFailurePolicy is used to communicate to kubectl that there has been an error (so kubectl show pods will report the error). Back in mtob-all.bash, we exit 88 on an error. This yaml file tells kubernetes what to do if an error 88 is detected.
- envFrom is used to inject my Mongo credentials from the Kubernetes secret which contains them.
- env is used to extract the namespace we are running in from the pod metadata and pass it on to the mtob-all.bash script
- ‘restartPolicy: Never’ means don’t keep running the pod over and over. Just stop when it’s run once.
cron job yaml file
Some notes on the cron job yaml file.
- This is very similar to the job yaml files, except that we add a section to do with scheduling and history.
- The schedule uses crontab syntax to set when and how often to run this thing. This example runs it every day at the top of each hour. Here’s a very handy site to interpret cron syntax.
- When a Kube job finishes, it leaves the pod visible with a status of ‘completed’. Running something every hour will leave a clutter, so setting history limits can clean away old pod entries from kubectl. Worth setting these high to start, then reduce when you’re comfortabel everything is golden.
yq
So as to avoid creating seperate yaml files for each namespace and serviceaccount name, we can use yq to make minor changes to a basic template, which we’ll do with the scripts below
This one injects the namespace and service accountname into the job yaml
This one injects the namespace and service accountname into the cron job yaml
Summary
Since there’s a lot going on in this article, I’ve used a series of scripts to illustrate the steps instead of going into too much detail on the background. Watch out for future more deep dive articles on some of these topics, in particular on workload identity. As usual, contact me if you are interested in any particular subject mentioned here.
Related
Links
These scripts are on github.