is my latest solution to these conundrums and has these features
- Run stages in sequence
- Automatically split stages of jobs into manageable chunks that can be run in parallel.
- Watch progress on a webapp dashboard.
- Persist all the results Server side, and have access to any previous stage.
- Automatically reduce the results of parallel chunks.
- Be restartable (not yet, but soon)
This post consists of multiple parts, so if you want to get right to it –
The job is to retrieve almost 1m rows from a Fusion table, analyze and manipulate the data, and optimize sql scripts so it can be imported into a Google Spanner or CockroachDb SQL database. This amount of data is not even loadable into Apps Script in the regular run time limit – in other words I can’t even get to stage 3.
Here’s a couple of screenshots from a large run, in the middle of running, with some notes.
- The Getting data stage has been split into 24 “Threads”, all running the same script “Orchestration.getData”, in parallel.
- The light yellow blobs are threads currently in progress, and the dark yellow are threads that have completed so far.
- “Items in” shows the number of items that need to be processed by each stage, and “Items out” are how many items were created.
- There are implicit split, map and reduce operation associated with each stage
- The data is never passed over to the client (in this case it would be too large anyway), and is persisted Server side using compression and cache record linking (Zipping to make stuff fit in cache or properties service.) to overcome cache quota limits.
Here we are a little further along, the blue shows a reduce operation in progress
And finally, the job is complete
The final output is a zip file containing a set of sql files to optimized to bring this data in to cockroachDB, which I’ll use over on my Getting cockroachdb running on google cloud platform page.
Just over 4 minutes to run 9 stages on the data from 1m rows from a Fusion table. 4 of the stages had parallel threads, so in fact there were 84 script instances running here. 2 hours of processing in 4 minutes!
Why not join our forum, follow the blog or follow me on Twitter to ensure you get updates when they are available.