Services‎ > ‎Desktop Liberation‎ > ‎

Parallel processing in Apps Script

There's no getting away from the fact  that Apps Script is slower than the equivalent client based JavaScript processing. It is fundamentally synchronous in implementation, and also has limits on processing time and a host of other quotas. For a cloud based, free service that's about extending Drive capabilities rather than being scalable in the manner of Google App Engine, I suppose it's normal. But let's see if we can at least subvert at least these two things
  • get over the 6 minute maximum execution time for Apps Script
  • run things in parallel
I figured that if I implemented a rudimentary Map/Reduce capability that could split a meaty task into multiple chunks, run them all at the same time on separate threads, then bring the result together for final processing, then I could achieve these two goals.   The TriggerBuilder service is key to this, but it's rather difficult to control execution. Specifically this innocent looking sentence taken from the documentation.

Specifies the duration (in milliseconds) after the current time that the trigger will run. (plus or minus 15 minutes).

Plus or minus 15 minutes…  (why specify in milliseconds?)

In any case, let's press on and see what we have here. Here's a primer for a way to orchestrate parallel tasks

TriggerHappy


Libraries

I provide a library (cTriggerHappy) - MuIOvLUHIRpRlID7V_gEpMqi_d-phDA33, which you can include or fork as you prefer. Another library you need in you application is Database abstraction with google apps script, which is Mj61W-201_t_zC9fJg1IzYiz3TLx7pV4j.  If you are forking the library, it needs Database abstraction with google apps script and Using named locks with Google Apps Scripts.

How to set up

This is fairly extreme scripting, so it's a little complex. You should first take a look at the primer slides and start with a copy of an example application

The control object. 

This is used to manage orchestration, and specifies various things including some setup so you can use Database abstraction with google apps script. Although you could probably use any of the supported back end databases, I recommend Google Drive for the data, and a spreadsheet for logging and reporting. Here's an example control function, which you should tailor to your own environment. There are 5 data types each of which could be held in independent data stores if required.

// this is the orchestration package for a piece of work that will be split into tasks
// it describes where to store itself, and keeps track of all the chunks
// it can be stored in any of the back end databases described in http://ramblings.mcpher.com/Home/excelquirks/dbabstraction
// this example is using google drive

// this identifies this scripts and the functions it will run
function getControl () {
  return {
    script: {
      id: "1A7lJCKs1KFlj20fBqXjQFne0IhWV0ZpKcYrsYulwxvu__rSZBFnIJPwJ",
      reduceFunction: 'workReduce',
      taskFunction:'workMap',
      processFunction:'workProcess'
    },
    taskAccess: {
      siloId:  'tasks.json',
      db: cDataHandler.dhConstants.DB.DRIVE,
      driverSpecific: '/datahandler/driverdrive/tasks',
      driverOb: null
    },
    logAccess: {
      siloId:  'thappylog',
      db: cDataHandler.dhConstants.DB.SHEET,
      driverSpecific: '12pTwh5Wzg0W4ZnGBiUI3yZY8QFoNI8NNx_oCPynjGYY',
      driverOb: null
    },
    reductionAccess: {
      siloId:  'reductions.json',
      db: cDataHandler.dhConstants.DB.DRIVE,
      driverSpecific: '/datahandler/driverdrive/tasks',
      driverOb: null
    },
    jobAccess: {
      siloId:  'jobs.json',
      db: cDataHandler.dhConstants.DB.DRIVE,
      driverSpecific: '/datahandler/driverdrive/tasks',
      driverOb: null
    },
    reportAccess: {
      siloId:  'thappyreport',
      db: cDataHandler.dhConstants.DB.SHEET,
      driverSpecific: '12pTwh5Wzg0W4ZnGBiUI3yZY8QFoNI8NNx_oCPynjGYY',
      driverOb: null
    },
    triggers: true,
    delay:5000,
    enableLogging:true,
    threads:0,
    stagger:1000
  };
}

Other control parameters.

triggers:true

For testing, you should run with this false, then change to true when everything looks good.. This will cause no triggers to be generated, but will instead allow the process to be run sequentially in line

delay:5000

This is the number of milliseconds to wait between trigger creations and execution. The TriggerBuilder choreography seems to be a little more solid if you wait a bit before starting execution of a trigger

enableLogging:true

Debugging can be tricky with detached processes. This allows logging material to be written to the store described in logAccess:{}

threads:0

TriggerHappy will attempt to create as many parallel threads as are needed to run everything at once. This might cause some quota problems, so you can set this to some number other than 0. This limits the number of parallel processes to a specific number. When one completes, others will be generated as required.

stagger:1000

This is the number of milliseconds to wait between trigger creations. The TriggerBuilder choreography seems to be a little more solid if you wait a bit between creating triggers. 

script.id:"1A7lJCKs1KFlj20fBqXjQFne0IhWV0ZpKcYrsYulwxvu__rSZBFnIJPwJ"

This is a unique script ID to allow multiple scripts to use the same database. Triggers associated with the given script ID will only execute on tasks it is meant to.

script.reduceFunction, taskFunction, processFunction

The names of the 3 functions that will be triggered to to the reduce, map and process functions. 

Splitting up the work

Each job needs to be split into work packages called tasks. These tasks should be able to be run in any order and need to be independent of each other.  Here's an example, that is splitting a task into 5 chunks. 


function splitJobIntoTasks () {
   
  // need this for each function that might be triggered
  var tHappy = new cTriggerHappy.TriggerHappy (getControl()); 
  
  // i'm splitting the work in chunks
  tHappy.log(null, 'starting to split','splitJobIntoTasks');
  tHappy.init ();
  var nChunks = 5;
  
  for (var i=0; i < nChunks ; i++ ) {
    // this is results package for each task chunk and where to store itself
    // change this to the storage of your choice, and add any parameters you need to the parameters object
    tHappy.saveTask ( {index:i, something:'some user values', numObs:tHappy.randBetween(20,100)});
  } 
  // launch everything
  tHappy.log(null, 'finished splitting');
  tHappy.triggerTasks ();
  tHappy.log(null, 'triggering is done','splitJobIntoTasks');
  return nChunks;
}

The .saveTask() method allows you pass any parameters you want that will be available to your taskFunction. Note I'm using the .log() method regularly to report progress in the log.

the .triggerTasks() sets off the whole business of scheduling tasks to be mapped. 
 

the taskFunction

This is the map stage. Tasks will be scheduled to run each of the chunks of work. The taskFunction is the one that gets called for each chunk. This is where you would execute the point of your application. In this example, I'm generating various random objects, controlled by values I passed to each chunk when I split the tasks in the first place.  Depending on the setting of control.threads, all or some of these tasks will be triggered to run simultaneously. Additional threads will be initiated as required until  there are no more tasks needing dealt with.

Note that there are a few mandatory requirements here. 

  • create an object with handleCode, handleError, and task properties. Fill the task property with something to do with the .somethingToMap() method.
var result = {data:null,handleCode:0,handleError:'',task:tHappy.somethingToMap()}; 
  • store the result in an array, in the same object
result.data = obs;

  • signal any errors if necessary
          result.handleCode = TASK_STATUS.FAILED;
          result.handleError = err;
  • signal when complete
     tHappy.finished (result);



function workMap() {

     
  // need this for each function that might be triggered
  var tHappy = new cTriggerHappy.TriggerHappy (getControl());
  
  // your result goes here
  var result = {data:null,handleCode:0,handleError:'',task:tHappy.somethingToMap()}; 
  // first find something to do
  
  
  // if anything to do 
  if (result.task) {
    
    tHappy.log( null, 'starting mapping for job ' +  result.task.jobKey + '/' + result.task.taskIndex +  ' task ' + result.task.key  ,'workMap');
    var ob = generateRandomObject(10);
    var obs= [];
    try {
      // this is the work - for illustration use the params
      for (var i=0;i <  result.task.params.numObs;i++) {
        obs.push(generateRandomValues(ob));
      }
      // store the result and status
      result.data = obs;
    }
    catch(err) {
      // store the error
      result.handleCode = TASK_STATUS.FAILED;
      result.handleError = err;
      tHappy.log (null,err,'workMap');
      throw(err);
    }
    // update task status
    tHappy.finished (result);
    tHappy.log(null, ' finished mapping');
  }
  
  
  return {handleError: result.handleError, handleCode: result.handleCode};
          
  function generateRandomObject  (n) {
    var ob = {};
    for (var i=0;i<n;i++){
      ob['x'+i] = null;
    }
    return ob;
  }
  function generateRandomValues  (ob) {
    return Object.keys(ob).reduce(function(p,c) {
      p[c] = tHappy.arbitraryString(tHappy.randBetween(5,20));
      return p;
    },{});

  }
  
}

the reduceFunction

This is the reduce stage. A reduce will automatically be scheduled if all the mapping tasks of the job are completed. It's a fairly straightforward process, and your reduce function will almost certainly use the provided .reduce() method although you could do some special processing if you really wanted to. All that happens here is that all the independent results of the mapping tasks are combined into a single result.

although you could do some special things if you needed to.
function workReduce () {
     
  // need this for each function that might be triggered
  var tHappy = new cTriggerHappy.TriggerHappy (getControl()); 
  
  // bring all the results together
  tHappy.log(null, 'starting reduction','workReduce');
  tHappy.reduce();
  tHappy.log(null, 'finishing reduction','workReduce');

}

the processFunction

Once the reduce function has completed, you can now continue and finish the work. In our example, the random objects that we created in each of the chunks have been combined by taskReduce, and now the whole thing can be written to a sheet. 

Note that there are a few mandatory requirements here. 

  • if there is anything to do, this will return the reduced data.
var reduced = tHappy.somethingToProcess ();

  • signal that we are done
           tHappy.processed(reduced);
  • clean up all triggers when done - very important to avoid running out of trigger space
      tHappy.cleanupAllTriggers();

function workProcess() {

     
  // need this for each function that might be triggered
  var tHappy = new cTriggerHappy.TriggerHappy (getControl()); 
  
  // all is over, we get the reduced data and do something with it.
  var reduced = tHappy.somethingToProcess ();
  tHappy.log( null, 'starting processing for job ' + (reduced ? JSON.stringify(reduced) : ' - but nothing to do'),'workProcess');
  if (reduced) {
   
    // do something with the data - for this example we're going to copy it to a spreadsheet
    var sheetHandler = new cDataHandler.DataHandler (
        'thappytest',
        cDataHandler.dhConstants.DB.SHEET,
        undefined,
        '12pTwh5Wzg0W4ZnGBiUI3yZY8QFoNI8NNx_oCPynjGYY');
      
    if (!sheetHandler.isHappy()) {
      throw ('failed to get handler for sheet processing');
    }  
  
    // delete current sheet
    tHappy.handledOk(sheetHandler.remove());
    
    // add new data
    tHappy.handledOk(sheetHandler.save(reduced.result));
  
    // mark it as processed
    tHappy.processed(reduced);
    
    // we'll use the logger too
    tHappy.log( null, 'finished processing','workProcess');
    
    // clean up any triggers we know we're done
    tHappy.cleanupAllTriggers();
    
  }
}

Debugging

As mentioned, debugging is tricky. It's better to have a function that runs your scripts on a subset of data serially as part of a script before moving on to running by triggers.

Set control.triggers = false, then create a function like the one below. This will run though the mapping of tasks one by one, then the reduce function, then the processing function.  Once you have the result you want reliably you can move on to trying it in parallel by setting control.triggers = true;

function endTest () {

  // divide up the work
  var control = getControl();
  var n = splitJobIntoTasks();
    
  if (control.triggers) {
   
  }
  else {
    // this is just a direct test end to end test, - no triggers

    // do a couple of tasks
    for (var i=0; i < n; i++) {
      workMap();
    }
    // reduce
    workReduce();
    // do something with the result
    workProcess();
  }
}

Logging

a .log() method is provided to allow you to log whatever you want. Various logging is done by default, but you can add your own, for example

tHappy.log(null, 'finishing reduction','workReduce');


Here's an example of a fragment of a log file - with triggering disabled


now the same thing with triggering enabled


Reporting


It's sometimes useful to take a look inside the orchestration files. If you've used Drive as your database, you can just open them. However, there is a .report() method to give a summary view like this.

function report () {   
  // need this for each function that might be triggered
  new cTriggerHappy.TriggerHappy (getControl()).report(); 
}


Keys and instances

Each task, job and reduction has a unique key. This will help you track down problems if you need to. You'll also notice and instance id on the logger. Each triggered task also has a unique instance id so you can track its progress in the logger. Note that this is independent of the trigger ID, which is allocated by GAS. This instance id can be used on both triggered and inline operation.

Cleaning up

TriggerHappy does not automatically clean up its files. It may be that the reduce data, or even the individual task data needs to be reused. I'm also considering enabling a rescheduler so that entire jobs can be run multiple times - that would mean that the job files could also be useful. However, if you don't need any of that, there are pre-baked methods for cleaning up. This function will clear everything.

function cleanupAll () {

     

  // need this for each function that might be triggered

  var tHappy = new cTriggerHappy.TriggerHappy (getControl()); 

  tHappy.cleanupAllTriggers();

  tHappy.cleanupTasks ();

  tHappy.cleanupJobs();

  tHappy.cleanupReduction();

  tHappy.cleanupLog();

}

 

Summary

This approach is probably not for everyone, but it does exercise a number of interesting ideas such as Using named locks with Google Apps Scripts and triggers and of course the concept of using multiple threads - since it is the cloud after all. I have found that triggers are a little fragile, and that work executed in the context of a trigger executes more slowly that the same task as a regular script. For more like this see Google Apps Scripts snippets

Here's a substantial example, copying from one database format to another - convertingfromscriptb

The library code.


For help and more information join our forum,follow the blog or follow me on twitter . For more stuff like this see Google Apps Scripts snippets