Searching and cataloging Apps Script projects on Github

In Every Google Apps Script project on Github visualized I demonstrated an app that could be used to explore what every Apps Script developer who has shared their code is working on. It can be used to find samples, libraries and generally browse what everyone is up to. Creating the source data from Github had a few challenges, so I thought I might write about them, and how I got around them here.

Page Content hide

1 Finding Apps Script projects

2 GitHub API

2.1 Inconsistent results

3 Workarounds and code

3.1 Maximum results

3.2 Queue management

3.3 Using an iterator

Finding Apps Script projects

One way would have been to use the script service on Apps Script itslef, but it would have meant everyone in the world giving oauth access to their projects, so that was a non started. Next I looked at the public github table on BigQuery, but found that it had barely any Apps Script in it – so either it’s way out of date or it’s only a sample, so that left using the github API to either

occassionally consolidate Github apps script projects and cache the results
use the API live from the App.

GitHub API

Here are some of the problems I hit with the API

the GraphQL version doesn’t support searches across multiple repositories – so that was out from the start. So, on to the Rest API
It has strict rate limit on searches (30 per second max when authenticated, 10 when not)
a small page size (max 100)
an absolute maximum of 1000 results

All of this meant it wouldn’t return enough results, would be too slow, and would require authentication for everyone using the App. I wanted it to be public without fuss, so I decided to go for the caching option.

Inconsistent results

Another problem I came across was that, even if you keep within the rate limits, there is a background hidden quota, presumably related to resources used where even if you ask for a page size of 50, you might only receive 45. And this happens silently without being reported as incomplete.

Testing on a small sample is awkward, as you don’t always get back results in the same order, even if you explicitly ask for a sorted result.

Finally, the total count of matches returned doesn’t match the actual results returned, so you can’t rely on that to know if you’re finished. Finished is just defined when you don’t get any more.

Client

I used the Node OcotoKit client, which is the official one, and it has some nice features, but of course is subject to all the constraints mentioned above. It has built in pagination, and an iterator – however I noticed that it doesn’t respect Rate limits – so I built my own iterator which we’ll look at later in this article.

Cache

In the end I decided to simply put the packaged results in a gist. It won’t make any sense as it’s compressed, but it turned out to be a simple and effective solution. The cache is updated using the app we’ll be looking at here from time to time – and that’s the approach I settled on.

Rate limiting

The Github API helpfully returns how many goes you’ve got left along with how long to wait till the rate limit window is reset. There are actually 2 types of access I do with the client – search and get – each of which have different constraints. I used qottle to manage asynchronous queuing and created an iterator that returned results according to the rate limit responses from each request.

Page randomness

As noted, the API doesn’t reliably return the pagesize you request, but I found that keeping the page size down to 15 gave a consistent result. Anything above started to drop results.

Workarounds and code

Here’s a few extracts from the code on the workarounds for all this. You can see this complete app on github

Maximum results

To get round the problem of 1000 maximum per search, I broke the search into pieces by filtering on various sizes. The simple query required to find apps script projects is this.

q: "filename:appsscript extension:.json"

search github for apps script projects

By making the query multiple times, but further constraining the query by size, it could be made to return less than 1000 results for each request – in other words adding each of these terms to the query

ranges: [
  "size:<=100",
  "size:101..250",
  "size:251..400",
  "size:401..550",
  "size:>550",
],

split query into sizes

Queue management

To minimize rate limit problems, it’s important to keep the number of concurrent requests down. Searching has a very strict limit, but getting data is much more generous. Because of the info returned in the header from OctoKit I didn’t need to use Qottle‘s rate limiting capabilities, but did need it for concurrency management using these 2 queues.

const Qottle = require("qottle")
const decorateQueue = new Qottle({
  concurrent: 3
})
const searchQueue = new Qottle({
  concurrent: 1,
});

concurrent request management with qottle

Using that approach, all the requests could be queued up without worrying about handling rate limits

const fetchAllCode = async (options, max) => {
  const gd = new GitData();
  return Promise.all(queryDefinition.ranges.map(range => {
    return searchQueue.add(() => {
      return fetchAllCodePart({ gd, options, range, max })
    })
  })).then (()=> gd)

};

queuing search requests

  const po = Promise.all(
    gd.items("owners").map((f) => decorateQueue.add(() => decorateOwner(f)))
  ).then(()=>console.log(`....decorated ${gd.items("owners").length} owners`));

queueing get requests

Using an iterator

This is a great way to abstract away the details of rate limiting and paging. The idea is that the entire processing is centered around a for await … of loop, and each iteration of the loop delivers a single result. It’s up to the iterator to figure out how to long to wait between requests, and whether to refresh a page. It also features a transformer, which processes each row before delivering the iteration value.

const fetchAllCodePart = async ({ gd, options, max , range}) => {
  
  const transformer = ({ data }) => {
    return gd.add(data);
  };

  // iterator to go through the whole thing
  const grate = giterator({
    options: {
      ...options,
      q: `${options.q} ${range}`,
    },
    max,
    transformer,
    fetcher: gitSearcher,
    keepAll: false,
  });

  for await (let { index, data, pack } of grate) {
    //for logging
    //console.log(index,data.repository.full_name, pack.waitTime)
  }

  return Promise.resolve(gd);
};

looping using an iterator

The iterator

This is called from with the for await .. of and is reponsible for delivering the next item to that loop. The next item might simply be the next item from a page it already has from a previous fetch, or it may mean going off to get some more data and respecting the Rate Limit while doing it. It’s a bit of a read, because it’s written for general purpose use so I can use it other unrelated projects.

/**
 * make an iterator
 * @param {object} args
 * @param {function} args.fetcher how to get some more
 * @param {object} [args.options] the options
 * @param {number} [args.keepAll= true] whether to keep all the items received
 * @param {number} [args.max= Infinity] the max number to retrieve in total
 * @param {number} [args.initialWaitTime= 0] the initial wait time before starting
 * @param {function} [args.transformer] a transfornatino to apply before returning anything
 * @return {object} the response
 */
const giterator = ({
  fetcher,
  options,
  keepAll = true,
  max = Infinity,
  transformer,
  initialWaitTime = 100,
  minWait = 200
}) => {
  return {
    [Symbol.asyncIterator]() {
      return {

        // whenther the last chunk has been received
        finished: false,

        // which page we are on
        page: 0,

        // what we got in the last fetch
        pack: null,

        // the full set of items kept if keepAll is true
        items: [],

        // how long to wait before going
        waitTime() {
          return this.pack ? this.pack.waitTime : initialWaitTime;
        },

        // index into items to return on next()
        itemIndex: 0,

        // the overall index
        index: 0,

        // for reporting
        stats: {
          startedAt: null,
          finishedAt: null,
          numberOfFetches: 0,
          totalWaitTime: 0,
        },

        // wait for some amout of time before next
        waiter() {
          const waitTime = this.waitTime()   minWait;
          this.stats.totalWaitTime  = waitTime;
          // if (waitTime > minWait) console.log("...waiting for", waitTime);
          return waitTime
            ? new Promise((resolve) =>
                setTimeout(() => resolve(waitTime), waitTime)
              )
            : Promise.resolve(0);
        },

        // report might be called for progress reporting
        report() {
          return {
            finished: this.finished,
            keepAll,
            index: this.index - 1,
            stats: this.stats,
            max,
          };
        },
        // get another chunk
        getMore() {
          if (this.finished) {
            throw new Error("attempt to get more after finish");
          }
          // record that we've started
          if (!this.pack) {
            this.stats.startedAt = new Date().getTime();
          }

          return this.waiter().then(() =>
            fetcher({ options, page: this.page }).then((pack) => {
              // if we didnt get anything, then assume its all over
              this.pack = pack;
              const { items } = pack;
              this.stats.numberOfFetches  ;

              if (!items.length) {
                this.wrapup();
              } else {
                // this is whether we need to keep all the results ever got
                if (keepAll) {
                  Array.prototype.push.apply(this.items, items);
                } else {
                  this.items = items;
                  this.itemIndex = 0;
                }
              }
              // ready for next page
              this.page  ;
            })
          );
        },

        wrapup() {
          this.finished = true;
        },

        // checking hasnext, will potentially involve a get
        hasNext() {
          // definitely finished
          if (this.finished) {
            return Promise.resolve(false);
          }
          // finished because we've had enough
          if (this.index >= max) {
            this.wrapup();
            return Promise.resolve(false)
          }

          // havent done with those we already have
          if (this.itemIndex < this.items.length) return Promise.resolve(true);

          // we don't know if its finished so get some more and find out
          return this.getMore().then((r) => !this.finished);
        },

        // get next item
        async next() {
          // see if there are any - this will fetch some if needed
          const hasNext = await this.hasNext();
          this.stats.finishedAt = new Date().getTime();

          // wrap up
          if (!hasNext) {
            return Promise.resolve({
              done: true,
            });
          }

          // construct the result to deliver
          const value = {
            // these are like the args returned by [].forEach (data, index, items)
            data: this.items[this.itemIndex  ],
            index: this.index  ,
            items: this.items,

            // this is the response to the last fetch - could be useful for things like total items
            pack: this.pack,
            nextPage: this.page,

            // this is the progress reprt
            report: this.report(),
          };

          // if there.s a transformer, add it
          if (transformer) {
            value.transformation = transformer(value);
          }
          return {
            done: false,
            value,
          };
        },
      };
    },
  };
};

github search iterator

Querying Github

The iterator needs to know how to get pages, and whether to wait before getting them, so we pass over a fetcher for OctoKit.

/**
 * do a search
 * @param {object} [args]
 * @param {object} args.options the search options
 * @param {number} [args.page=0] the page number to get
 * @return {object} the response
 */
const gitSearcher = ({ options, page = 0 } = {}) => {
  // noticed that
  // sort doesnt work
  // you get back things in a random order - run the same query twice - get different results
  // make the per page too big and it silently drops some results

  const searchOptions = {
    per_page: 15,
    ...options,
    page,
  };
 
  return octokit.search
    .code(searchOptions)
    .then((response) => gitUntangle(response, searchOptions));
};

searcher for github

And something to untangle the results from OctoKit

/**
 * sort out reponse from octokit
 * @param {object} response from ocotokit
 */
const gitUntangle = (response, options) => {
  const { data, headers } = response;
  const { total_count } = data;
  const { status } = headers;
  const items = data.items || data;
  const ratelimitRemaining = headers["x-ratelimit-remaining"];
  const ratelimitReset = headers["x-ratelimit-reset"];

  // sometimes there's items, sometimes not
  if (total_count) {
    console.log({
      total_count,
      item_count: items.length,
      incomplete: data.incomplete_results,
    });
  }
  if (options.per_page !== items.length && total_count) {
    console.log("...asked for", options.per_page, "got", items.length);
  }
  return {
    // these are needed
    items,
    total: total_count,
    // this is how long the next attempt will have to wait before trying again
    // wait an additional time to allow for missynced times
    waitTime:
      ratelimitRemaining > 1
        ? 0
        : Math.max(2500, ratelimitReset * 1000 - new Date().getTime()),
    // this might be handy
    ratelimitReset,
    ratelimitRemaining,
    incomplete: data.incomplete_results,
    status,
    response,
  };
};