In Every Google Apps Script project on Github visualized  I demonstrated an app that could be used to explore what every Apps Script developer who has shared their code is working on. It can be used to find samples, libraries and generally browse what everyone is up to. Creating the source data from Github had a few challenges, so I thought I might write about them, and how I got around them here.

Finding Apps Script projects

One way would have been to use the script service on Apps Script itslef, but it would have meant everyone in the world giving oauth access to their projects, so that was a non started. Next I looked at the public github table on BigQuery, but found that it had barely any Apps Script in it – so either it’s way out of date or it’s only a sample, so that left using the github API to either

  • occassionally consolidate Github apps script projects and cache the results
  • use the API live from the App.

GitHub API

Here are some of the problems I hit with the API

  • the GraphQL version doesn’t support searches across multiple repositories – so that was out from the start. So, on to the Rest API
  • It has strict rate limit on searches (30 per second max when authenticated, 10 when not)
  • a small page size (max 100)
  • an absolute maximum of 1000 results

All of this meant it wouldn’t return enough results, would be too slow, and would require authentication for everyone using the App. I wanted it to be public without fuss, so I decided to go for the caching option.

Inconsistent results

Another problem I came across was that, even if you keep within the rate limits, there is a background hidden quota, presumably related to resources used where even if you ask for a page size of 50, you might only receive 45. And this happens silently without being reported as incomplete.

Testing on a small sample is awkward, as you don’t always get back results in the same order, even if you explicitly ask for a sorted result.

Finally, the total count of matches returned doesn’t match the actual results returned, so you can’t rely on that to know if you’re finished. Finished is just defined when you don’t get any more.

Client

I used the Node OcotoKit client, which is the official one, and it has some nice features, but of course is subject to all the constraints mentioned above. It has built in pagination, and an iterator – however I noticed that it doesn’t respect Rate limits – so I built my own iterator which we’ll look at later in this article.

Cache

In the end I decided to simply put the packaged results in a gist. It won’t make any sense as it’s compressed, but it turned out to be a simple and effective solution. The cache is updated using the app we’ll be looking at here from time to time – and that’s the approach I settled on.

Rate limiting

The Github API helpfully returns how many goes you’ve got left along with how long to wait till the rate limit window is reset. There are actually 2 types of access I do with the client – search and get – each of which have different constraints. I used qottle  to manage asynchronous queuing and created an iterator that returned results according to the rate limit responses from each request.

Page randomness

As noted, the API doesn’t reliably return the pagesize you request, but I found that keeping the page size down to 15 gave a consistent result. Anything above started to drop results.

Workarounds and code

Here’s a few extracts from the code on the workarounds for all this. You can see this complete app on github

Maximum results

To get round the problem of 1000 maximum per search, I broke the search into pieces by filtering on various sizes. The simple query required to find apps script projects is this.

q: "filename:appsscript extension:.json"
search github for apps script projects

By making the query multiple times, but further constraining the query by size, it could be made to return less than 1000 results for each request – in other words adding each of these terms to the query

ranges: [
  "size:<=100",
  "size:101..250",
  "size:251..400",
  "size:401..550",
  "size:>550",
],
split query into sizes

Queue management

To minimize rate limit problems, it’s important to keep the number of concurrent requests down. Searching has a very strict limit, but getting data is much more generous. Because of the info returned in the header from OctoKit I didn’t need to use Qottle‘s rate limiting capabilities, but did need it for concurrency management using these 2 queues.

const Qottle = require("qottle")
const decorateQueue = new Qottle({
concurrent: 3
})
const searchQueue = new Qottle({
concurrent: 1,
});
concurrent request management with qottle

Using that approach, all the requests could be queued up  without worrying about handling rate limits

const fetchAllCode = async (options, max) => {
const gd = new GitData();
return Promise.all(queryDefinition.ranges.map(range => {
return searchQueue.add(() => {
return fetchAllCodePart({ gd, options, range, max })
})
})).then (()=> gd)

};
queuing search requests
  const po = Promise.all(
gd.items("owners").map((f) => decorateQueue.add(() => decorateOwner(f)))
).then(()=>console.log(`....decorated ${gd.items("owners").length} owners`));
queueing get requests

Using an iterator

This is a great way to abstract away the details of rate limiting and paging. The idea is that the entire processing is centered around a for await … of loop, and each iteration of the loop delivers a single result. It’s up to the iterator to figure out how to long to wait between requests, and whether to refresh a page. It also features a transformer, which processes each row before delivering the iteration value.

const fetchAllCodePart = async ({ gd, options, max , range}) => {

const transformer = ({ data }) => {
return gd.add(data);
};

// iterator to go through the whole thing
const grate = giterator({
options: {
...options,
q: `${options.q} ${range}`,
},
max,
transformer,
fetcher: gitSearcher,
keepAll: false,
});

for await (let { index, data, pack } of grate) {
//for logging
//console.log(index,data.repository.full_name, pack.waitTime)
}

return Promise.resolve(gd);
};
looping using an iterator

The iterator

This is called from with the for await .. of and is reponsible for delivering the next item to that loop. The next item might simply be the next item from a page it already has from a previous fetch, or it may mean going off to get some more data and respecting the Rate Limit while doing it. It’s a bit of a read, because it’s written for general purpose use so I can use it other unrelated projects.

/**
* make an iterator
* @param {object} args
* @param {function} args.fetcher how to get some more
* @param {object} [args.options] the options
* @param {number} [args.keepAll= true] whether to keep all the items received
* @param {number} [args.max= Infinity] the max number to retrieve in total
* @param {number} [args.initialWaitTime= 0] the initial wait time before starting
* @param {function} [args.transformer] a transfornatino to apply before returning anything
* @return {object} the response
*/
const giterator = ({
fetcher,
options,
keepAll = true,
max = Infinity,
transformer,
initialWaitTime = 100,
minWait = 200
}) => {
return {
[Symbol.asyncIterator]() {
return {

// whenther the last chunk has been received
finished: false,

// which page we are on
page: 0,

// what we got in the last fetch
pack: null,

// the full set of items kept if keepAll is true
items: [],

// how long to wait before going
waitTime() {
return this.pack ? this.pack.waitTime : initialWaitTime;
},

// index into items to return on next()
itemIndex: 0,

// the overall index
index: 0,

// for reporting
stats: {
startedAt: null,
finishedAt: null,
numberOfFetches: 0,
totalWaitTime: 0,
},

// wait for some amout of time before next
waiter() {
const waitTime = this.waitTime() minWait;
this.stats.totalWaitTime = waitTime;
// if (waitTime > minWait) console.log("...waiting for", waitTime);
return waitTime
? new Promise((resolve) =>
setTimeout(() => resolve(waitTime), waitTime)
)
: Promise.resolve(0);
},

// report might be called for progress reporting
report() {
return {
finished: this.finished,
keepAll,
index: this.index - 1,
stats: this.stats,
max,
};
},
// get another chunk
getMore() {
if (this.finished) {
throw new Error("attempt to get more after finish");
}
// record that we've started
if (!this.pack) {
this.stats.startedAt = new Date().getTime();
}

return this.waiter().then(() =>
fetcher({ options, page: this.page }).then((pack) => {
// if we didnt get anything, then assume its all over
this.pack = pack;
const { items } = pack;
this.stats.numberOfFetches ;

if (!items.length) {
this.wrapup();
} else {
// this is whether we need to keep all the results ever got
if (keepAll) {
Array.prototype.push.apply(this.items, items);
} else {
this.items = items;
this.itemIndex = 0;
}
}
// ready for next page
this.page ;
})
);
},

wrapup() {
this.finished = true;
},

// checking hasnext, will potentially involve a get
hasNext() {
// definitely finished
if (this.finished) {
return Promise.resolve(false);
}
// finished because we've had enough
if (this.index >= max) {
this.wrapup();
return Promise.resolve(false)
}

// havent done with those we already have
if (this.itemIndex < this.items.length) return Promise.resolve(true);

// we don't know if its finished so get some more and find out
return this.getMore().then((r) => !this.finished);
},

// get next item
async next() {
// see if there are any - this will fetch some if needed
const hasNext = await this.hasNext();
this.stats.finishedAt = new Date().getTime();

// wrap up
if (!hasNext) {
return Promise.resolve({
done: true,
});
}

// construct the result to deliver
const value = {
// these are like the args returned by [].forEach (data, index, items)
data: this.items[this.itemIndex ],
index: this.index ,
items: this.items,

// this is the response to the last fetch - could be useful for things like total items
pack: this.pack,
nextPage: this.page,

// this is the progress reprt
report: this.report(),
};

// if there.s a transformer, add it
if (transformer) {
value.transformation = transformer(value);
}
return {
done: false,
value,
};
},
};
},
};
};
github search iterator

Querying Github

The iterator needs to know how to get pages, and whether to wait before getting them, so we pass over a fetcher for OctoKit.

/**
* do a search
* @param {object} [args]
* @param {object} args.options the search options
* @param {number} [args.page=0] the page number to get
* @return {object} the response
*/
const gitSearcher = ({ options, page = 0 } = {}) => {
// noticed that
// sort doesnt work
// you get back things in a random order - run the same query twice - get different results
// make the per page too big and it silently drops some results

const searchOptions = {
per_page: 15,
...options,
page,
};

return octokit.search
.code(searchOptions)
.then((response) => gitUntangle(response, searchOptions));
};
searcher for github

And something to untangle the results from OctoKit

/**
* sort out reponse from octokit
* @param {object} response from ocotokit
*/
const gitUntangle = (response, options) => {
const { data, headers } = response;
const { total_count } = data;
const { status } = headers;
const items = data.items || data;
const ratelimitRemaining = headers["x-ratelimit-remaining"];
const ratelimitReset = headers["x-ratelimit-reset"];

// sometimes there's items, sometimes not
if (total_count) {
console.log({
total_count,
item_count: items.length,
incomplete: data.incomplete_results,
});
}
if (options.per_page !== items.length && total_count) {
console.log("...asked for", options.per_page, "got", items.length);
}
return {
// these are needed
items,
total: total_count,
// this is how long the next attempt will have to wait before trying again
// wait an additional time to allow for missynced times
waitTime:
ratelimitRemaining > 1
? 0
: Math.max(2500, ratelimitReset * 1000 - new Date().getTime()),
// this might be handy
ratelimitReset,
ratelimitRemaining,
incomplete: data.incomplete_results,
status,
response,
};
};
untangle octokit response

Summary and links

That’s the main workarounds for dealing with some of the gotchas with the github API.

All about scrviz