As the scriptDB deprecation date approaches, you may have decided which database to move to and are making plans for the switch over. The problem is though that if you have many thousands of records you might hit the quota time for running. You could certainly split it up, but as noted in Comparing all back ends performance the most you’ll get out of scriptDB querying is 2000 a minute, and if your back end is parse.com (about the fastest), the most you can write is about 5000 a minute. So if you have more than about 8000 records you’ll probably hit the 6 minute limit. Running a very large conversion in serial could tie you up for a while.
Parallel processing with unpredictable sequencing is not an easy subject, but once you figure it out it, it may be useful for all sorts of longer running processes.
Libraries
You’ll need these libraries
cTriggerHappy:
MuIOvLUHIRpRlID7V_gEpMqi_d-phDA33
which you can include or fork as you prefer. Another library you need in you application is Database abstraction with google apps script, which is
Mj61W-201_t_zC9fJg1IzYiz3TLx7pV4j
Starting off
You should read Parallel processing in Apps Script for the detail of how this works, but in summary –
- Your job is split into a number of chunks
- Each of these chunks is run independently and simultaneously – each of these chunks is called ‘mapping’
- An addition run called ‘reduce’ is scheduled and run when all the mapping is finished. This will consolidate the results of each of the mappings into a single result.
- A final run called ‘process’ is scheduled and run. This does any post processing required once all the data is consolidated by the reduce function.
Something like this.
var scriptHandler = new cDataHandler.DataHandler ('random', cDataHandler.dhConstants.DB.SCRIPTDB, undefined, 'myddb', ScriptDb.getMyDb()); assert(scriptHandler .isHappy(), 'unable to get scriptdb handler','handler'); var userStore = PropertiesService.getScriptProperties(); var handler = new cDataHandler.DataHandler ( 'random', cDataHandler.dhConstants.DB.PARSE, undefined, 'myParse', JSON.parse(userStore.getProperty("parseKeys"))); assert(handler.isHappy(), 'unable to get parse handler','handler'); var result = handler.save ( scriptHandler.query().data); assert (result.handlerCode >=0 , result , 'writing to parse);
The problem starts when you hit quota limits on the time to read and write. Parse is much faster than scriptDB (see Comparing all back ends performance), but you can still soon hit the processing time limits doing both. We could split it into bits like this – this would be the easy route.
var result = handler.save ( scriptHandler.query(undefined, {limit:1000} ).data); assert (result.handlerCode >=0 , result , 'writing to parse); var result = handler.save ( scriptHandler.query(undefined, {limit:1000,skip:1000} ).data); assert (result.handlerCode >=0 , result , 'writing to parse); var result = handler.save ( scriptHandler.query(undefined, {limit:1000,skip:2000} ).data); assert (result.handlerCode >=0 , result , 'writing to parse); var result = handler.save ( scriptHandler.query(undefined, {limit:1000,skip:3000} ).data); assert (result.handlerCode >=0 , result , 'writing to parse); ... etc
But we’d need to run each separate and it might take a long time to look after that. Using the boiler plate from Parallel processing in Apps Script, we can do something like this
The split
We’ll set some maximum chunk size that should be run in one go, then use the boiler plate below – this deletes all the current parse.com data, gets the size of the current scriptDB and lines up a number of map jobs to run.
function splitJobIntoTasks () { // need this for each function that might be triggered var tHappy = new cTriggerHappy.TriggerHappy (getControl()); // get the scriptdb to see how much work there is var handler = getTheScriptDb(); var result = handler.count(); assert (result.handleCode >=0, result, 'counting scriptdb'); var toDo = result.data[0].count; var total = toDo; var maxChunkSize = 1000; tHappy.log (null, 'deleting parse data', 'splitJobIntoTasks'); var parseHandler = getTheParse(); var result = parseHandler.remove(); assert (result.handleCode >= 0, result, 'deleting parse data'); // i'm splitting the work in chunks tHappy.log (null, 'starting to split','splitJobIntoTasks'); tHappy.init (); var chunkNumber; for (chunkNumber=0 ; toDo > 0 && maxChunkSize ; chunkNumber++) { var n = Math.min(toDo, maxChunkSize); tHappy.saveTask ( { index:chunkNumber, skip:total-toDo, limit:n }); toDo-=n; } tHappy.log(null, 'finished splitting'); tHappy.triggerTasks (); tHappy.log(null, 'triggering is done','splitJobIntoTasks'); return chunkNumber; }
The Map
Some number of these will be scheduled to run simultaneously – each one will run a chunk of work as defined by tHappy.saveTask() during the split phase.
function workMap() { // need this for each function that might be triggered var tHappy = new cTriggerHappy.TriggerHappy (getControl()); try { // your result goes here var result = {data:null,handleCode:0,handleError:'',task:tHappy.somethingToMap()}; // first find something to do // if anything to do if (result.task) { tHappy.log( null, 'starting mapping for job ' + result.task.jobKey + '/' + result.task.taskIndex + ' task ' + result.task.key ,'workMap'); // get the scriptDB var handler = getTheScriptDb(); // get the parse.com var parseHandler = getTheParse(); // get the data segment var sData = handler.query(undefined, {skip:result.task.params.skip, limit:result.task.params.limit}); assert (sData.handleCode >=0 && sData.data.length === result.task.params.limit, sData, ' getting scriptdb data'); tHappy.log (null, 'got ' + sData.data.length + ' records from scriptDB starting at ' + result.task.params.skip, 'workMap'); result.data = [result.task.params.skip,sData.data.length]; // write it to parse var pData = parseHandler.save(sData.data); assert (pData.handleCode >=0 , result, ' writing parse data'); tHappy.log (null, 'wrote ' + sData.data.length + ' records to parse starting at ' + result.task.params.skip, 'workMap'); tHappy.finished (result); tHappy.log(null, 'finished mapping'); } } catch (err) { tHappy.log(null, err, 'mapping failure'); throw err; } return {handleError: result.handleError, handleCode: result.handleCode}; }
The reduction
In this example, there’s nothing much to do, since the map operations were independent of each other. This just consolidates the results and triggers a process. To ensure workflow integrity, a reduce must always follow a set of maps, even though in this case its not doing a lot.
function workReduce () { // need this for each function that might be triggered var tHappy = new cTriggerHappy.TriggerHappy (getControl()); try { // bring all the results together tHappy.log(null, 'starting reduction','workReduce'); tHappy.reduce(); tHappy.log(null, 'finishing reduction','workReduce'); } catch (err) { tHappy.log(null, err, 'reduction failure'); throw err; } }
The process
Again, not too much to do. We’ll just check that parse.com ended up with the expected number of records.
function workProcess() { // need this for each function that might be triggered var tHappy = new cTriggerHappy.TriggerHappy (getControl()); // all is over, we get the reduced data and do something with it. try { var reduced = tHappy.somethingToProcess (); tHappy.log( null, 'starting processing for job ' + (reduced ? JSON.stringify(reduced) : ' - but nothing to do'),'workProcess'); if (reduced) { // we'll just report on the parse records var parseHandler = getTheParse(); var pData = parseHandler.count(); assert (pData.handleCode >=0,pData,'counting parse data'); tHappy.log (null, 'parse contains ' + pData.data[0].count + ' records', 'workProcess'); // mark it as processed tHappy.processed(reduced); // we'll use the logger too tHappy.log( null, 'finished processing','workProcess'); // clean up any triggers we know we're done tHappy.cleanupAllTriggers(); } } catch (err) { tHappy.log(null, err, 'processing failure'); throw err; } }
Some utilities
I created a couple of functions that are used a number of times here. Note that I’ve specified a siloId (‘random’) in this example. If you are already using Database abstraction with google apps script, then your scriptDB will have this. If not you can just leave it as null and it will take the whole scriptDb. Note also that you’ll need to have set up your parse credentials in your property store (see DriverParse)
function getTheScriptDb() { var handler = new cDataHandler.DataHandler ( 'random', cDataHandler.dhConstants.DB.SCRIPTDB, undefined, 'myddb', ScriptDb.getMyDb()); assert(handler.isHappy(), 'unable to get scriptdb handler','handler'); return handler; } function getTheParse() { var userStore = PropertiesService.getScriptProperties(); var handler = new cDataHandler.DataHandler ( 'random', cDataHandler.dhConstants.DB.PARSE, undefined, 'myParse', JSON.parse(userStore.getProperty("parseKeys"))); assert(handler.isHappy(), 'unable to get parse handler','handler'); return handler; }
The log
You’ll notice tHappy.log() being called a few times.It’s tough to debug triggered simultaneous tasks, but the .log() can be used to report progress of each of the various threads in a single place. Here’s a typical log. You’ll see that pretty much all 6 mapping operations were running simultaneously and about 15 minutes of processing time was done in about 6 .
triggered 1 tasks from 1 (thread limit 0) |
30 July 2014 15:07:34 BST |
The control
This can use any of the Database abstraction with google apps script supported backends for controls files and logs, but I prefer using Drive for control files and sheets for the log file. You can set all that up here. You’ll need to create and specify drive folders and a spreadsheet to use for logging if you are using the defaults below.
function getControl () { return { script: { id: "1i5qxNh1qs_-b6hKRMY_W1MR4nnwuyhHQG6Y9tSINrgyqFQW6qH-GMP__", reduceFunction: 'workReduce', taskFunction:'workMap', processFunction:'workProcess' }, taskAccess: { siloId: 'migrateTasks.json', db: cDataHandler.dhConstants.DB.DRIVE, driverSpecific: '/datahandler/driverdrive/tasks', driverOb: null }, logAccess: { siloId: 'thappylog', db: cDataHandler.dhConstants.DB.SHEET, driverSpecific: '12pTwh5Wzg0W4ZnGBiUI3yZY8QFoNI8NNx_oCPynjGYY', driverOb: null }, reductionAccess: { siloId: 'migrateReductions.json', db: cDataHandler.dhConstants.DB.DRIVE, driverSpecific: '/datahandler/driverdrive/tasks', driverOb: null }, jobAccess: { siloId: 'migrateJobs.json', db: cDataHandler.dhConstants.DB.DRIVE, driverSpecific: '/datahandler/driverdrive/tasks', driverOb: null }, reportAccess: { siloId: 'thappyreport', db: cDataHandler.dhConstants.DB.SHEET, driverSpecific: '12pTwh5Wzg0W4ZnGBiUI3yZY8QFoNI8NNx_oCPynjGYY', driverOb: null }, triggers: true, delay:5000, enableLogging:true, threads:0, stagger:1000, killStage:false, debugGrabbing:false };
The report
Getting a report of the orchestration database or drive files can sometime be helpful. You can get one like this, and it will end up wherever you’ve set control.reportAccess to save to.
function report () { // need this for each function that might be triggered var tHappy = new cTriggerHappy.TriggerHappy (getControl()); tHappy.report(); }
Orchestration
One of the biggest challenges of all this was the locking mechanism between multiple instances of the same script. I’m using the library for Using named locks with Google Apps Scripts for all locking. Whether or not to log debugging of locking attempts can be switched on and off with the control.debugging property. It’s used like this. Take a look at this log output if things don’t seem to be going as expected.
function grabControl(who, f) { if (control_.debugGrabbing) { self.log(null,who,'trying to grab ' + new Date().getTime()); } var info; var r = new cNamedLock.NamedLock(undefined,undefined,true).setKey('thappy' , control_.jobAccess).protect(who+'_'+instance_, function (lock) { if (control_.debugGrabbing) { info = ':lock-' + JSON.stringify(lock.getInfo()); self.log(null,who,'grabbed ' + new Date().getTime() + info); } return f(); }).result; if (control_.debugGrabbing) { self.log(null,who,'ungrabbed ' + new Date().getTime() + info); } return r; }
You can see the kind of output you’ll get below, where you can follow the game between each task as it tries to control who does what next. Note you should use the timestamp to examine the log – since it is also protected by locking – and it’s not guaranteed that locks are distributed int the order they are asked for, since deadlocks are avoided by using a random retry interval.
For help and more information join our forum, follow the blog or follow me on Twitter