A friend of mine hit an Apps Script problem the other day when transferring large amounts of data between Drive and Cloud Storage. We’ve all had the mysterious ‘unexpected Javacript runtime error’ from the IDE that comes and goes, and it’s down to a number of things, one of which appears to be UrlFetch silently timing out.

I wondered if there might be a better way to solve this problem using Cloud Run to be directed to make cloud transfers. Of course this means that Apps Script isn’t really involved at all, except to make the initial cloud run request, and if Apps Script isn’t involved it means we can also do streaming transfers rather than heave around large POST bodies.

The solution I came up with works on node, from the cli or from any app that can make a post really. Here’s the writeup from github

Page Content hide

1 bm-drive-cloud

2 background

3 configuration and parameter files

3.1 service account files

3.1.1 sa.json

3.2 work.json

3.2.1 work.json example

3.3 result

4 how to run

4.1 from cli

4.2 as an express app

4.3 from cloud run

4.3.1 authenticate

4.3.2 Build a container

4.3.3 deploy the container

4.3.4 test cloud run

5 setting up the service accounts

6 plugins

7 todo

8 apps script

9 cloud run authentication

10 collaboration

11 Links

bm-drive-cloud

For transferring files between drive and gcs from node or using cloud run

background

The idea here is to use the Google cloud storage and drive apis to stream files between them use node to control what’s being copied. All the files are copied in parallel, and even more interesting a flavour of this can be deployed on Google Cloud Run – this means that you can use it from Apps Script (or anything else that can post) to transfer files blisteringly fast between Drive and Gcs

configuration and parameter files

Transfer and authentication is controlled via JSON files.

service account files

Service accounts are used to give bm-drive-cloud access to each of the cloud platforms. In the case of Drive, impersonation is also supported, as the concept of ‘My Drive’, would normally be associated with a user, rather than a service account. You can supply multiple service accounts as it’s possible that you’d like to keep these with separate permissions depending on the platform.

sa.json

This should be a combined file containing an array of the service account credentials needed to access Drive and/or GCS. Here’s an example.

[
  {
    "name": "drive",
    "content": {
      "type": "service_account",
      "project_id": "xxx",
      "private_key_id": "xxx",
      ...etc as downloaded from the cloud console
    }
  },
  {
    "name": "gcs",
    "content": {
      "type": "service_account",
      "project_id": "xxx",
      "private_key_id": "xxx",
      ...etc as downloaded from the cloud console
    }
  }
]

The name can be whatever you like, and the content should be the complete contents of the .json service account credential file. Additional info on how to set up service accounts and enable Drive delegation will come later in this documentation

work.json

This contains the copying work required. All files mentioned are streamed simultaneously. The work files is and array of instructions, each instructions with 4 parts

property	purpose	values
op	the operation	currently only cp (copy) is implemented
from	the ‘from’ platform	the source of the files to operate on
from.sa	service account name	this is the name you’ve given the service account in sa.json
from.type	type of platform	‘drive’ or ‘gcs’ is supported
from.subject	who to impersonate	the email address of the person to impersonate – valid for drive, ignore for gcs
to	the ‘to’ platform	all the properties are the same as for ‘from’
files	array of files	the files that this from/to applies to
files.from	a path	where to copy the file from using the ‘from’ credentials already provided
files.to	a path	where to copy the file to using the ‘to’ credentials already provided

work.json example

[
  {
    "op": "cp",
    "from": {
      "sa": "gcs",
      "type": "gcs"
    },
    "to": {
      "sa": "drive",
      "type": "drive",
      "subject": "bruce@mcpher.com"
    },
    "files": [
      {
        "to": "/images/d.png",
        "from": "mybucketname/images/c.png"
      },
      {
        "to": "/images/e.png",
        "from": "mybucketname/images/e.png"
      }
    ]
  },
  {
    "op": "cp",
    "to": {
      "sa": "gcs",
      "type": "gcs"
    },
    "from": {
      "sa": "drive",
      "type": "drive",
      "subject": "bruce@mcpher.com"
    },
    "files": [
      {
        "from": "/images/x.png",
        "to": "somebucketname/images/x.png"
      },
      {
        "from": "xxxxxx_zzz (accepts fileids as well as folder paths)",
        "to": "someotherbucketname/images/i.png"
      }
    ]
  }
]

Note that files.to and files.from for Drive accept a folder path on My Drive, although files.from with a path may resolve to multiple files – in which case the latest in used. Files.from also accepts a drive fileid.

For GCS platform, the bucketname must be in the path.

Note that you can also copy between the same platform (for example from one folder in drive to another, or from one GCS bucket to another)

result

The result from both the cli and cloud run is a summary of what was copied, along with how long each file took and the size plus the created fileId. This is the result from Cloud Run of copying multiple large files. Note that it seems to run at about 20 megabytes per second on cloud run – this example was 2x30mb files simultaneously. Not sure at this point if parallelism even slows things down. Will need to try with a whole bunch of files simultaneously when I have time.

[
	[{
		"size": 33027954,
		"took": 3366,
		"to": {
			"pathName": "bmcrusher-test-bucket-store/dump/202106/14.csv",
			"type": "gcs",
			"mimeType": "text/csv",
			"fileId": "dump%2F202106%2F14.csv"
		},
		"from": {
			"pathName": "/dump/20210614.csv",
			"type": "drive",
			"mimeType": "text/csv",
			"fileId": "xxx"
		}
	}, {
		"size": 29189802,
		"took": 3035,
		"to": {
			"pathName": "bmcrusher-test-bucket-store/dump/202106/16.csv",
			"type": "gcs",
			"mimeType": "text/csv",
			"fileId": "dump%2F202106%2F16.csv"
		},
		"from": {
			"pathName": "/dump/20210616.csv",
			"type": "drive",
			"mimeType": "text/csv",
			"fileId": "xxx"
		}
	}]
]

how to run

clone this repo, and set up your work.json and sa.json files

from cli

node localindex -s sa/sa.json -w jsons/work.json

as an express app

this is normally run from cloud run, but you can test it out with curl.

this will kick off a server

node index

The post body is consists of the sa and work jsons combined – like this

{
  work: { ... the content of work.json },
  sa: {... the content of sa.json}
}

If your json files are separate, you can combine them to post with curl like this

curl -X POST -H "Content-Type:application/json"  -d "{\"work\":$(cat jsons/work.json),\"sa\":$(cat sa/sa.json)}" \
http://localhost:8080

from cloud run

There’s a fair bit of setup to go through to set up your cloud run endpoint

authenticate

First make sure you are logged in to gcloud (otherwise you’ll have all sorts of worrying and surprising errors)

gcloud auth login

Build a container

You can use the Dockerfile in the repo. Here my project id is ‘bmcrusher-test’ and my build is going to be called ‘bm-drive-cloud’. This will use cloud build to create a container, and will also upload to the cloud registry. There are various apis it needs to enable in your project, but the build process generally asks if it can. You’ll need billing enabled on your project, but cloud run has a generous free tier.

gcloud builds submit --tag gcr.io/bmcrusher-test/bm-drive-cloud

deploy the container

This will deploy and create an endpoint for your app on cloud run. You’ll be asked a few things – you’ll most likely want to pick ‘Cloud Run (fully managed)’, and the service name as the default. There are some hints during the dialog on how to make your choices permament for the future if you want to.

Next you’ll be asked for a region – probably best to pick the same one that your cloud storage is hosted at.

Finally you’ll be asked whether unauthenticated invocations are allowed. We’ll deal with authentication (to the cloud run endpoint) later. Authentication is already of course in place to Drive and Storage using your service accounts in sa.json. This applies just to be able to invoke the cloud run endpoint at all. For now, just allow unauthenticated invocations.

gcloud run deploy --image gcr.io/bmcrusher-test/bm-drive-cloud

There will be a link to a log file displayed you can look at to check if there are any deployment errors

test cloud run

At the end of the deployment you’ll see something like this

Service [bm-drive-cloud] revision [bm-drive-cloud-xxxx] has been deployed and is serving 100 percent of traffic.
Service URL: https://bm-drive-cloud-zzzzzz.run.app

The service URL is how to you access your app via a post, so we can just repeat the curl command from previously to test it, this time using the service URL

curl -X POST -H "Content-Type:application/json"  -d "{\"work\":$(cat jsons/work.json),\"sa\":$(cat sa/sa.json)}" \
https://bm-drive-cloud-zzzzzz.run.app

You can see the log files from the cloud console under cloud run.

setting up the service accounts

You need to create service accounts with storage and drive permissions (or you could use one with both permissions), download the key file(s) and set up your sa.json file as described earlier. If you are using Drive, you’ll also need to enable G suite delegation and allow it to access Drive Scopes. All this is desperately yawn-making stuff, so I won’t repeat it here – but you can see some screen shots of how at The nodejs Drive API, service account impersonation and async iterators here

plugins

Currently this only supports Drive and GCS. However its a plugin architecture, so we can add other providers like DropBox and of course local files from Node too – watch this space.