Searching and cataloging Apps Script projects on Github
In Every Google Apps Script project on Github visualized I demonstrated an app that could be used to explore what every Apps Script developer who has shared their code is working on. It can be used to find samples, libraries and generally browse what everyone is up to. Creating the source data from Github had a few challenges, so I thought I might write about them, and how I got around them here.
One way would have been to use the script service on Apps Script itslef, but it would have meant everyone in the world giving oauth access to their projects, so that was a non started. Next I looked at the public github table on BigQuery, but found that it had barely any Apps Script in it – so either it’s way out of date or it’s only a sample, so that left using the github API to either
occassionally consolidate Github apps script projects and cache the results
use the API live from the App.
GitHub API
Here are some of the problems I hit with the API
the GraphQL version doesn’t support searches across multiple repositories – so that was out from the start. So, on to the Rest API
It has strict rate limit on searches (30 per second max when authenticated, 10 when not)
a small page size (max 100)
an absolute maximum of 1000 results
All of this meant it wouldn’t return enough results, would be too slow, and would require authentication for everyone using the App. I wanted it to be public without fuss, so I decided to go for the caching option.
Inconsistent results
Another problem I came across was that, even if you keep within the rate limits, there is a background hidden quota, presumably related to resources used where even if you ask for a page size of 50, you might only receive 45. And this happens silently without being reported as incomplete.
Testing on a small sample is awkward, as you don’t always get back results in the same order, even if you explicitly ask for a sorted result.
Finally, the total count of matches returned doesn’t match the actual results returned, so you can’t rely on that to know if you’re finished. Finished is just defined when you don’t get any more.
Client
I used the Node OcotoKit client, which is the official one, and it has some nice features, but of course is subject to all the constraints mentioned above. It has built in pagination, and an iterator – however I noticed that it doesn’t respect Rate limits – so I built my own iterator which we’ll look at later in this article.
Cache
In the end I decided to simply put the packaged results in a gist. It won’t make any sense as it’s compressed, but it turned out to be a simple and effective solution. The cache is updated using the app we’ll be looking at here from time to time – and that’s the approach I settled on.
Rate limiting
The Github API helpfully returns how many goes you’ve got left along with how long to wait till the rate limit window is reset. There are actually 2 types of access I do with the client – search and get – each of which have different constraints. I used qottle to manage asynchronous queuing and created an iterator that returned results according to the rate limit responses from each request.
Page randomness
As noted, the API doesn’t reliably return the pagesize you request, but I found that keeping the page size down to 15 gave a consistent result. Anything above started to drop results.
Workarounds and code
Here’s a few extracts from the code on the workarounds for all this. You can see this complete app on github
Maximum results
To get round the problem of 1000 maximum per search, I broke the search into pieces by filtering on various sizes. The simple query required to find apps script projects is this.
q: "filename:appsscript extension:.json"
search github for apps script projects
By making the query multiple times, but further constraining the query by size, it could be made to return less than 1000 results for each request – in other words adding each of these terms to the query
To minimize rate limit problems, it’s important to keep the number of concurrent requests down. Searching has a very strict limit, but getting data is much more generous. Because of the info returned in the header from OctoKit I didn’t need to use Qottle‘s rate limiting capabilities, but did need it for concurrency management using these 2 queues.
const Qottle = require("qottle") const decorateQueue = new Qottle({ concurrent: 3 }) const searchQueue = new Qottle({ concurrent: 1, });
concurrent request management with qottle
Using that approach, all the requests could be queued up without worrying about handling rate limits
This is a great way to abstract away the details of rate limiting and paging. The idea is that the entire processing is centered around a for await … of loop, and each iteration of the loop delivers a single result. It’s up to the iterator to figure out how to long to wait between requests, and whether to refresh a page. It also features a transformer, which processes each row before delivering the iteration value.
// iterator to go through the whole thing const grate = giterator({ options: { ...options, q: `${options.q} ${range}`, }, max, transformer, fetcher: gitSearcher, keepAll: false, });
for await (let { index, data, pack } of grate) { //for logging //console.log(index,data.repository.full_name, pack.waitTime) }
return Promise.resolve(gd); };
looping using an iterator
The iterator
This is called from with the for await .. of and is reponsible for delivering the next item to that loop. The next item might simply be the next item from a page it already has from a previous fetch, or it may mean going off to get some more data and respecting the Rate Limit while doing it. It’s a bit of a read, because it’s written for general purpose use so I can use it other unrelated projects.
/** * make an iterator * @param {object} args * @param {function} args.fetcher how to get some more * @param {object} [args.options] the options * @param {number} [args.keepAll= true] whether to keep all the items received * @param {number} [args.max= Infinity] the max number to retrieve in total * @param {number} [args.initialWaitTime= 0] the initial wait time before starting * @param {function} [args.transformer] a transfornatino to apply before returning anything * @return {object} the response */ const giterator = ({ fetcher, options, keepAll = true, max = Infinity, transformer, initialWaitTime = 100, minWait = 200 }) => { return { [Symbol.asyncIterator]() { return {
// whenther the last chunk has been received finished: false,
// which page we are on page: 0,
// what we got in the last fetch pack: null,
// the full set of items kept if keepAll is true items: [],
// how long to wait before going waitTime() { return this.pack ? this.pack.waitTime : initialWaitTime; },
// index into items to return on next() itemIndex: 0,
// wait for some amout of time before next waiter() { const waitTime = this.waitTime() minWait; this.stats.totalWaitTime = waitTime; // if (waitTime > minWait) console.log("...waiting for", waitTime); return waitTime ? new Promise((resolve) => setTimeout(() => resolve(waitTime), waitTime) ) : Promise.resolve(0); },
// report might be called for progress reporting report() { return { finished: this.finished, keepAll, index: this.index - 1, stats: this.stats, max, }; }, // get another chunk getMore() { if (this.finished) { throw new Error("attempt to get more after finish"); } // record that we've started if (!this.pack) { this.stats.startedAt = new Date().getTime(); }
return this.waiter().then(() => fetcher({ options, page: this.page }).then((pack) => { // if we didnt get anything, then assume its all over this.pack = pack; const { items } = pack; this.stats.numberOfFetches ;
if (!items.length) { this.wrapup(); } else { // this is whether we need to keep all the results ever got if (keepAll) { Array.prototype.push.apply(this.items, items); } else { this.items = items; this.itemIndex = 0; } } // ready for next page this.page ; }) ); },
wrapup() { this.finished = true; },
// checking hasnext, will potentially involve a get hasNext() { // definitely finished if (this.finished) { return Promise.resolve(false); } // finished because we've had enough if (this.index >= max) { this.wrapup(); return Promise.resolve(false) }
// havent done with those we already have if (this.itemIndex < this.items.length) return Promise.resolve(true);
// we don't know if its finished so get some more and find out return this.getMore().then((r) => !this.finished); },
// get next item async next() { // see if there are any - this will fetch some if needed const hasNext = await this.hasNext(); this.stats.finishedAt = new Date().getTime();
// wrap up if (!hasNext) { return Promise.resolve({ done: true, }); }
// construct the result to deliver const value = { // these are like the args returned by [].forEach (data, index, items) data: this.items[this.itemIndex ], index: this.index , items: this.items,
// this is the response to the last fetch - could be useful for things like total items pack: this.pack, nextPage: this.page,
// this is the progress reprt report: this.report(), };
// if there.s a transformer, add it if (transformer) { value.transformation = transformer(value); } return { done: false, value, }; }, }; }, }; };
github search iterator
Querying Github
The iterator needs to know how to get pages, and whether to wait before getting them, so we pass over a fetcher for OctoKit.
/** * do a search * @param {object} [args] * @param {object} args.options the search options * @param {number} [args.page=0] the page number to get * @return {object} the response */ const gitSearcher = ({ options, page = 0 } = {}) => { // noticed that // sort doesnt work // you get back things in a random order - run the same query twice - get different results // make the per page too big and it silently drops some results
// sometimes there's items, sometimes not if (total_count) { console.log({ total_count, item_count: items.length, incomplete: data.incomplete_results, }); } if (options.per_page !== items.length && total_count) { console.log("...asked for", options.per_page, "got", items.length); } return { // these are needed items, total: total_count, // this is how long the next attempt will have to wait before trying again // wait an additional time to allow for missynced times waitTime: ratelimitRemaining > 1 ? 0 : Math.max(2500, ratelimitReset * 1000 - new Date().getTime()), // this might be handy ratelimitReset, ratelimitRemaining, incomplete: data.incomplete_results, status, response, }; };
untangle octokit response
Summary and links
That’s the main workarounds for dealing with some of the gotchas with the github API.
bruce mcpherson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.mcpher.com. Permissions beyond the scope of this license may be available at code use guidelines