Workload identity with Kubernetes cronjobs to synch Mongo to Bigquery

Kubernetes workload identity looks pretty scary when you read about it in the docs, but it really is a better (and simpler) way to give specific permissions to Kubernetes workloads than less secure methods such as using service account keys. I had a specific use case in mind – getting a set of collections from mongodb to bigquery on a regular schedule – and since I’m running Kube in that project anyway, it seemed a reasonable solution to use a Kube cronjob.

Maybe you’re not using kubernetes at all but just want to transfer data from mongo to bigquery – I’ll show you how to run those parts of the article locally too.

Even if that doesn’t match your exact end to end use case, there shoud be something here for anyone who wants to work with any of the topics mentioned in the (long) journey in this article covers.

Here’s a summary of the main topics:

Cloud build to create images to run on Kubernetes
Cloud builder container images use google manaintained prebuilt images
Artifact registry for serving your build images
GCP service accounts versus Kubernetes service accounts
Iam policy binding
Kubernetes workload identity federation
Kubernetes jobs and cronjobs
bq for bigquery
gsutil to move data to cloud storage
yq and jq for manipulating yml and json files
Mongoexport to get data out of mongo
Doppler and Kubernetes Secrets to manage credentials

I’ll also provide generalized bash scripts to set everything up, so you can cannibalize them to fit your own use case. Details on the repo for those at the end of the article.

I’m using an autopilot Kubernetes cluster for this project. There are a few extra steps to prepare the node pools on other kinds of cluster for workload identity. I won’t be covering that in this article, but you read more about it in the first couple of sections here.

Page Content hide

1 Environment

2 Creating GCP and Kubernetes service accounts

2.1 The scripts to create service accounts

2.1.1 Some notes on the script

3 Create the script to export a collection from mongo

3.1 Some notes on this script

3.2 Create a script to move all the required collections

3.2.1 Some notes on this script

3.2.2 Running these scripts locally

3.2.2.1 As manually set environment variables

3.2.2.2 From Doppler

3.2.2.3 From a Kubernetes secret

3.3 Schemas

3.3.1 Downloading the schemas

4 Building the image

4.1 Some notes on the build yaml file

4.2 Build docker file

4.2.1 Some notes on this Dockerfile

5 Recap

6 Kubernetes yaml files

6.1 job yaml file

6.1.1 Some notes on the job yaml file

6.2 cron job yaml file

6.2.1 Some notes on the cron job yaml file.

6.3 yq

6.3.1 This one injects the namespace and service accountname into the job yaml

6.3.2 This one injects the namespace and service accountname into the cron job yaml

7 Summary

8 Related

9 Links

Environment

I have multiple NAMESPACES in my cluster, and the ones I refer to here are ‘gcp-stg’ and ‘gcp-prod’. I also refer to MODES. Many of the resource names in the scripts are derived from either the NAMESPACE or the MODE. The MODE is derived from the NAMESPACE so MODE ‘stg’ refers to NAMESPACE ‘gcp-stg’ and MODE ‘prod’ refers to NAMESPACE ‘gcp-prod’. You may need to tweak some of the scripts to mirror your naming conventions.

The configuration details for my Mongo instances use doppler as the source of truth for secrets, but these are already in my cluster as kubernetes secrets as my api uses these to access Mongo. I cover how to do that in Sharing secrets between Doppler, GCP and Kubernetes. For the purposes of the kubernetes focused sections of this article, let’s assume you’ve created a kubernetes secret with your Mongo credentials safely stored.

You’ll also need kubectl, gcloud, gsutil, yq and jq installed, as well as a modern bash.

Creating GCP and Kubernetes service accounts

A GCP service account will define the permissions allocated to access to cloud resources, whereas a Kubernetes service account completes the mapping between a kubernetes resource (a KSA) and a GCP service account (GSA). When creating Kubernetes pods, we refer to the KSA, which then refers to the GSA behind the scenes. To put it another way, the KSA is an abstraction of a linked GSA – meaning that there is no need for the potential credential leakage associated with managing GSA keys. It’s this mapping that provides the basis for Workload identity federation.

The scripts to create service accounts

I use a wrapper for this to ensure that the 2nd script runs under bash regardless of which shell executes it. I’m adding this to my cluster startup process in case I restart the cluster at some future point.

# set up workload identity service accounts for mtob
NS=$1
if [ -z "${NS}" ]; then
  echo "first arg should be a namespace like gcp-prod"
  exit 1
fi
# this needs to be run under bash
bash mtob-setup-wid.bash "${NS}"

Create Kubernetes and GCP service accounts – mtob-sa.sh

Some notes on the script

Execute sh mtob-sa.sh ‘your-namespace’ (in my case ‘sh mtob-sa.sh gcp-dev’) and it will

Use jq to create a condition json to apply when we allocate a role to the GSA. Conditions are useful to fine tune the permission – for example how long it should be valid for. You may want to only allow it for say “+2 hours”. In my case, I’m creating a cronjob that will run for as long as the cluster is running – so I’ve given it a long expiry time.
Delete and recreate both the KSA and GSA. We want to make sure that the current state of permissions on the GSA are only the ones required now, not some that may have been set in the past.
The roles I’m assigning are the minimum required to load exported data from mongo to cloud storage, then load that data into bigquery – creating and replacing tables as required. I’m assuming the bigquery dataset they will live in already exists.
Give the GSA ‘roles/iam.workloadIdentityUser’. Note the very specific member email address for this purpose.
Finally map the GSA to the KSA we created earlier using an annotation to enable workload identity.


#!/bin/bash

NS=$1
if [ -z "${NS}" ]; then
  echo "first arg should be a namespace like gcp-prod"
  exit 1
fi


REGION="...your project region eg europe-west1"
CLUSTERNAME="...your cluster name"
PROJECT="...your project ID"

# these are the names I'm using througout - you can use your own
KSA="mtob-robot-${NS}"
GSANAME="mtob-gsa-${NS}"
GSA="${GSANAME}@${PROJECT}.iam.gserviceaccount.com"

# for demonstration i'm putting a condition on when permissions
# should expire
EXPIRE=" 5 years"
TITLE="Temporary sa for mongo to bigquery"
TS=$(date -d " ${EXPIRE}" --utc  %FT%TZ)

# just a temporary file to use for a condition file
TEMP=$(mktemp)


# make conditions for iam binding
jq -n ' {"expression": $E, title: $T, description: $D}' \
  --arg T "${TITLE}"  \
  --arg E "request.time < timestamp('${TS}')"  \
  --arg D "expires at ${TS}" > $TEMP

# get to the right cluster and ns
gcloud container clusters get-credentials ${CLUSTERNAME} \
    --region=${REGION}

# not strictly necessary, but maybe handy for more easily checking
kubectl config set-context --current --namespace=${GCP_STG}

# create a kube service account
KL=$(kubectl get serviceaccount -n ${NS}| grep ${KSA})
if [ -n "${KL}" ]; then
  kubectl delete serviceaccount ${KSA} \
    --namespace ${NS} 
fi

kubectl create serviceaccount ${KSA} \
    --namespace ${NS}

# create service account and give it required roles
SL=$(gcloud iam service-accounts list | grep ${GSA})
if [ -n "${SL}" ]; then
  gcloud iam service-accounts delete ${GSA} --quiet
fi

gcloud iam service-accounts create ${GSANAME} \
    --project=${PROJECT} \
    --display-name="GCP SA ${GSANAME} for use with kube ${KSA} for mongo to bq"

# roles required for storage and bigquery to be assigned to gcp service account
ROLES=('bigquery.dataEditor' 'bigquery.user' 'storage.objectAdmin')


# assign each of the required roles
for role in "${ROLES[@]}"
do
  gcloud projects add-iam-policy-binding ${PROJECT} \
    --member "serviceAccount:${GSA}" \
    --role "roles/${role}" \
    --condition-from-file=${TEMP}
done


# assign it to kube sa
gcloud iam service-accounts add-iam-policy-binding ${GSA} \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:${PROJECT}.svc.id.goog[${NS}/${KSA}]" 

kubectl annotate serviceaccount ${KSA} \
    --namespace ${NS} \
    iam.gke.io/gcp-service-account=${GSA} \
    --overwrite

# clean up
rm $TEMP

mtob-setup-wid.bash sets up SA with required permissions

You can examine the status of these service accounts (substituting your namespace name and project id) with

gcloud iam service-accounts describe mtob-gsa-gcp-stg@youprojectid.iam.gserviceaccount.com

and

kubectl describe serviceaccounts mtob-robot-gcp-stg -n gcp-stg

Create the script to export a collection from mongo

This script will run in a Kubernetes pod. You should create the receiving dataset in the bigquery console, and the cloud storage bucket before starting. Note that these scripts expect stg and prod data in separate datasets, with the dataset name derived from the MODE value. You may want to tweak this for your own naming conventions.

Some notes on this script

This takes 2 arguments – MODE (in my case this is stg or prod – the data is kept separate through th entire process) and TABLE (what to call it on bigquery – this is probably going to be same as the collection name on mongo)
The bucket and folder. This bucket and folder combination, along with the MODE, is used to pass the mongo data to on its way to bigquery. It will also be a handy backup of the latest state of each collection.
ndjson – this is a variant of json which instead of commas between each element of a JSON array, inserts a new line. This is the format that bigquery likes to import from and allows the loading of files that would otherwise be too large to represent in a single json file.
Schemas – although bigquery can automatically detect data format, it doesn’t always get it right. I normally load the data once with –autodetect, check and correct the schema it generates, download it, then use that corrected schema for subsequent uploads. There’s a script to do that later in this article
The mongo environment variables such as DB_HOST etc come from a Kubernetes secret which you should create (or use your preferred alternative method to inject env) – for example see Sharing secrets between Doppler, GCP and Kubernetes and the section later in this article fro running the scripts locally.
Mongo exports various special fields (like $oid and $date) – bigquery doesn’t like dollars in fieldnames, so we have to do a small edit to drop them. This is fine for my tables but you may need to make it more sophisticated if you have some other anomalies.
The location variable for bigquery will depend on where you are located – I’m using eu.


# usage mode table (e.g stg yourcollection)
# you also need to be logged into doppler
MODE=$1
TABLE=$2
DS="yourdataset_${MODE}"
BUCKET="gs://yourbucket"
FOLDER="/exmongo"
TEMP=$(mktemp)
TEMP2=$(mktemp)
FILE=${TABLE}.ndjson
PREFIX="/exmongo/${MODE}"
SCHEMA="./schemas/prod-${TABLE}.json"
NS="gcp-${MODE}"

# 1- extract from mongo
echo "exporting ${FILE} and cleaning ${TEMP}"

# these DB_ come from kube secrets in env
HOST="mongodb srv://${DB_HOST}"
CONNECTION="${HOST}/${DB_NAME}"

# get files out of mongo
mongoexport  --uri "${CONNECTION}" --username "${DB_USER}" --password "${DB_PASSWORD}" --type=json -c "${TABLE}" -o "${TEMP}"

cat "${TEMP}" |  sed -E s/"\\$/"/g > "${TEMP2}"
rm "${TEMP}"

# for example mtobq prod yourcollection
URI="${BUCKET}${PREFIX}/${FILE}"

echo "...moving ${FILE} to ${URI}"
gsutil mv ${TEMP2} ${BUCKET}${PREFIX}/${FILE}

# now to bq
echo "...loading ${URI} to bigquery table ${DS}.${TABLE} with schema ${SCHEMA}"

bq --location=eu load \
--replace \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
"${DS}.${TABLE}" \
"${URI}" \
"${SCHEMA}"

mtob.sh – extract data from a collection in mongo and load into bigquery

Create a script to move all the required collections

This script will run in kubernetes an call mtob.sh for each mongo collection that needs imported into a bigquery table.

Some notes on this script

Use bash (not sh) to run
The NAMESPACE variable comes from Kubernetes – more on that later when we look at yaml files
In my example, mode and namespace are related – NAMESPACE gcp-stg implies MODE stg. You may need to tweak the MODE extraction to your use case
Notice we exit with a failure code if something goes wrong. This is a signal that will be picked up in the job yaml file that it should report an error in kubectl listings – more on that later

#!/bin/bash

# this comes from the kube yaml file env
echo "Running in namespace ${NAMESPACE}"

# extract the mode 
MODE=$(echo $NAMESPACE | sed -E s/^.*-//)
echo "...running mode ${MODE} on namespace ${NAMESPACE}"

# run all the tables in this ns
COLLECTIONS=(
  'collectiona' 'collectionb' 'collectionc'
)
GOOD=0
BAD=0

for c in "${COLLECTIONS[@]}"
do
  echo ""
  echo "----working on collection ${c}----"
  sh mtob.sh "${MODE}" "${c}"
  if [ $? -ne 0 ]; then
    echo "ERROR - failed on collection ${c}"
    let "BAD  "
  else 
    echo "...finished on collection ${c}"
    let "GOOD  "
  fi
done

echo ""
echo "----all done----"
echo "loaded ${GOOD} from ${#COLLECTIONS[@]} collections from mongo to bigquery"
if [ $BAD -ne 0 ]; then
  echo "ERROR there were ${BAD} failures"
  exit 88
else
  exit 0
fi

mtob-all.bash – a wrapper to export all the required collections

Running these scripts locally

It’s possible that you’ll want to run these scripts locally – maybe you’re not using kubernetes at all but just want to transfer data from mongo to bigquery, but even if you are using kubernetes you’ll probably want to test them before building your image. These scripts should work as is, except we need to find a way to provide the mongo credentials (DB_HOST, DB_USER, DB_PASSWORD, DB_NAME) and NAMESPACE that normally would come from the kubernetes environment.

As manually set environment variables

Just set these in your terminal shell – eg (NAMESPACE=gcp-stg) before running the scripts

From Doppler

If you are using doppler, after loggining into the correct doppler project and configuration you can inject environment variables like this

doppler run bash mtob-all.sh

From a Kubernetes secret

Set up each value in your script like this

DB_NAME=$(kubectl get secret yoursecretname -o jsonpath='{.data.DB_NAME}' | base64 --decode)

See Kubernetes secret values as shell environment variables for a handy way to automate this.

Schemas

The scripts are expecting to find schemas to describe how to import the collections into bigquery tables. Here’s the extract

bq --location=eu load \
--replace \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
"${DS}.${TABLE}" \
"${URI}" \
"${SCHEMA}"

Initially of course these won’t exist, so you can use the autodetect flag to create them – just remove the $SCHEMA parameter from the above and run locally. This will have a stab at creating an intial table in bigquery. We can then download the schema, tweak it if necessary and use it to ensure that future loads are consistent.

Downloading the schemas

Here’s a script to download the schemas – just run it for every collection

MODE=$1
TABLE=$2
DS="cx_ds_${1}"
bq show --format json ${DS}.${TABLE} | jq '.schema.fields' > schemas/${MODE}-${TABLE}.json

getschema.sh – download for each table after autodetecting

Building the image

Returing to the Kubernetes setup, now we can build the image with cloud build, with

gcloud builds submit --config=mtob-build.yaml

Some notes on the build yaml file

You should create (or use an existing) artifact repository in your project’s Artifact registry
Substitute the region, artifacts,and gcpimage with your names. The gcp image name can be anything you want. It’s what will run in Kubernetes.
Cloud build automatically detects PROJECT_ID
We use the prebuilt docker image from Cloud builder container images to create our image using the Docker file mtob-Dockerfile (more on that shortly)


steps:

- name: 'gcr.io/cloud-builders/docker'
  args: 
    - "build"
    - "--file"
    - "mtob-Dockerfile"
    - "--tag"
    - "${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_ARTIFACTS}/${_GCP_IMAGE}"
    - "."

substitutions: 
  _REGION: yourregion
  _ARTIFACTS: yourartifactrepository
  _GCP_IMAGE: yourimage

images:
  - "${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_ARTIFACTS}/${_GCP_IMAGE}"

mtob-build.yaml

Build docker file

This is referenced by the cloud build yaml.

Some notes on this Dockerfile

We’ll use the prebuild gcloud-slim image from Cloud builder container images
We need to add Mongoexport to that image.
The gcloud container has a number of entry points. Since we’ll be running the mtob-all.bash script created earlier, we want to use the bash entry point. Anything in the CMD section will be passed over to bash to run.

FROM gcr.io/cloud-builders/gcloud-slim
COPY ./ /

RUN curl "https://fastdl.mongodb.org/tools/db/mongodb-database-tools-ubuntu2004-x86_64-100.9.4.tgz" --output mdb.tgz
RUN tar -zxvf mdb.tgz && mv mongodb-database-tools-ubuntu2004-x86_64-100.9.4 mdb
RUN cp mdb/bin/* /usr/local/bin/ && rm mdb.tgz && rm -r mdb

ENTRYPOINT [ "/bin/bash"]
CMD ["mtob-all.bash"]

mtob-Dockerfile – build our image with gcloud an mongoexport installed

Recap

We’re almost there – all that remains now is to set up a job or a cronjob – I’ll show how to do both – to run this container to do all the exports according so some schedule. Here’s what we have

A kubernetes service account (KSA) linked to a gcp service account (GSA) which has the necessary permissions to access cloud storage and bigquery.
Workload identity federation setup to use this KSA
Credentials for mongo in a kubernetes secret
A script that can take an array of collections from mongo, back them up to cloud storage and load them into a bigquery dataset
An image in the artifact repository that kubernetes can run to do all that, parameterized to segregate multiple modes and namespaces

Kubernetes yaml files

We’ll create 2 yaml files

one to run a Kubernetes job to do a once off import
one to run a Kubernetes cronjob to run at regular intervals

job yaml file

Some notes on the job yaml file

The podFailurePolicy is used to communicate to kubectl that there has been an error (so kubectl show pods will report the error). Back in mtob-all.bash, we exit 88 on an error. This yaml file tells kubernetes what to do if an error 88 is detected.
envFrom is used to inject my Mongo credentials from the Kubernetes secret which contains them.
env is used to extract the namespace we are running in from the pod metadata and pass it on to the mtob-all.bash script
‘restartPolicy: Never’ means don’t keep running the pod over and over. Just stop when it’s run once.

apiVersion: batch/v1
kind: Job
metadata:
  name: mtob-job
  namespace: gcp-stg
spec:
  backoffLimit: 2
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: mtob
        operator: In
        values: [88]
  template:
    spec:
      serviceAccountName: mtob-robot-gcp-stg
      containers:
      - envFrom:
        - secretRef:
            name: doppler-secrets
        image: yourregion-docker.pkg.dev/yourproject/yourartifactrepo/yourimage:latest
        name: mtob
        env:
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
      restartPolicy: Never

mtob-job.yaml – run the container once off

cron job yaml file

Some notes on the cron job yaml file.

This is very similar to the job yaml files, except that we add a section to do with scheduling and history.
The schedule uses crontab syntax to set when and how often to run this thing. This example runs it every day at the top of each hour. Here’s a very handy site to interpret cron syntax.
When a Kube job finishes, it leaves the pod visible with a status of ‘completed’. Running something every hour will leave a clutter, so setting history limits can clean away old pod entries from kubectl. Worth setting these high to start, then reduce when you’re comfortabel everything is golden.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: mtob-cron
  namespace: gcp-stg
spec:
  schedule: "0 0/1 * * *"
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      backoffLimit: 2
      podFailurePolicy:
        rules:
        - action: FailJob
          onExitCodes:
            containerName: mtob
            operator: In
            values: [88]
      template:
        spec:
          serviceAccountName: mtob-robot-gcp-stg
          containers:
          - envFrom:
            - secretRef:
                name: doppler-secrets
            image: yourregion-docker.pkg.dev/yourproject/yourartifactrepo/yourimage
            name: mtob
            env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          restartPolicy: Never

mtob-cron.yaml – to run the container according to a schedule

yq

So as to avoid creating seperate yaml files for each namespace and serviceaccount name, we can use yq to make minor changes to a basic template, which we’ll do with the scripts below

This one injects the namespace and service accountname into the job yaml

# create a job for mtob
NS=$1
if [ -z "${NS}" ]; then
  echo "1st arg should be namespace eg gcp-stg"
  exit 1
fi

JOB=mtob-job

# delete current version of job if it exists
C=$(kubectl get jobs -n ${NS} | grep "${JOB}")
if [ -n "${C}" ]; then
  kubectl delete job ${JOB} -n ${NS}
fi

# the job template
FILE="${JOB}.yaml"

# the kubice account name
KSA="mtob-robot-${NS}"

# build the spec and apply
yq  e ".metadata.namespace = \"${NS}\"" $FILE | \
  yq e ".spec.template.spec.serviceAccountName = \"${KSA}\"" - | \
  kubectl apply -f -

mtob-job.sh – tweak and apply job yaml

This one injects the namespace and service accountname into the cron job yaml

# create a job for mtob
NS=$1
if [ -z "${NS}" ]; then
  echo "1st arg should be namespace eg gcp-stg"
  exit 1
fi

CRON="mtob-cron"

# delete current version of job if it exists
C=$(kubectl get cronjobs -n ${NS} | grep "${CRON}")
if [ -n "${C}" ]; then
  kubectl delete cronjob ${CRON} -n ${NS}
fi

# the job template
FILE="${CRON}.yaml"

# the kubice account name
KSA="mtob-robot-${NS}"

# build the spec and apply
yq e ".metadata.namespace = \"${NS}\"" $FILE | \
  yq e ".spec.jobTemplate.spec.template.spec.serviceAccountName = \"${KSA}\"" - | \
  kubectl apply -f -

mtob-cron.sh – tweak the cronjob yaml and submit

Summary

Since there’s a lot going on in this article, I’ve used a series of scripts to illustrate the steps instead of going into too much detail on the background. Watch out for future more deep dive articles on some of these topics, in particular on workload identity. As usual, contact me if you are interested in any particular subject mentioned here.

Sharing secrets between Doppler, GCP and Kubernetes

Links

These scripts are on github.

Desktop Liberation

The definitive resource for Google Apps Script and Microsoft Office automation