In Every Google Apps Script project on Github visualized I demonstrated an app that could be used to explore what every Apps Script developer who has shared their code is working on. It can be used to find samples, libraries and generally browse what everyone is up to. Creating the source data from Github had a few challenges, so I thought I might write about them, and how I got around them here.
Finding Apps Script projects
One way would have been to use the script service on Apps Script itslef, but it would have meant everyone in the world giving oauth access to their projects, so that was a non started. Next I looked at the public github table on BigQuery, but found that it had barely any Apps Script in it – so either it’s way out of date or it’s only a sample, so that left using the github API to either
- occassionally consolidate Github apps script projects and cache the results
- use the API live from the App.
Here are some of the problems I hit with the API
- the GraphQL version doesn’t support searches across multiple repositories – so that was out from the start. So, on to the Rest API
- It has strict rate limit on searches (30 per second max when authenticated, 10 when not)
- a small page size (max 100)
- an absolute maximum of 1000 results
All of this meant it wouldn’t return enough results, would be too slow, and would require authentication for everyone using the App. I wanted it to be public without fuss, so I decided to go for the caching option.
Another problem I came across was that, even if you keep within the rate limits, there is a background hidden quota, presumably related to resources used where even if you ask for a page size of 50, you might only receive 45. And this happens silently without being reported as incomplete.
Testing on a small sample is awkward, as you don’t always get back results in the same order, even if you explicitly ask for a sorted result.
Finally, the total count of matches returned doesn’t match the actual results returned, so you can’t rely on that to know if you’re finished. Finished is just defined when you don’t get any more.
I used the Node OcotoKit client, which is the official one, and it has some nice features, but of course is subject to all the constraints mentioned above. It has built in pagination, and an iterator – however I noticed that it doesn’t respect Rate limits – so I built my own iterator which we’ll look at later in this article.
In the end I decided to simply put the packaged results in a gist. It won’t make any sense as it’s compressed, but it turned out to be a simple and effective solution. The cache is updated using the app we’ll be looking at here from time to time – and that’s the approach I settled on.
The Github API helpfully returns how many goes you’ve got left along with how long to wait till the rate limit window is reset. There are actually 2 types of access I do with the client – search and get – each of which have different constraints. I used qottle to manage asynchronous queuing and created an iterator that returned results according to the rate limit responses from each request.
As noted, the API doesn’t reliably return the pagesize you request, but I found that keeping the page size down to 15 gave a consistent result. Anything above started to drop results.
Workarounds and code
Here’s a few extracts from the code on the workarounds for all this. You can see this complete app on github
To get round the problem of 1000 maximum per search, I broke the search into pieces by filtering on various sizes. The simple query required to find apps script projects is this.
By making the query multiple times, but further constraining the query by size, it could be made to return less than 1000 results for each request – in other words adding each of these terms to the query
To minimize rate limit problems, it’s important to keep the number of concurrent requests down. Searching has a very strict limit, but getting data is much more generous. Because of the info returned in the header from OctoKit I didn’t need to use Qottle‘s rate limiting capabilities, but did need it for concurrency management using these 2 queues.
Using that approach, all the requests could be queued up without worrying about handling rate limits
Using an iterator
This is a great way to abstract away the details of rate limiting and paging. The idea is that the entire processing is centered around a for await … of loop, and each iteration of the loop delivers a single result. It’s up to the iterator to figure out how to long to wait between requests, and whether to refresh a page. It also features a transformer, which processes each row before delivering the iteration value.
This is called from with the for await .. of and is reponsible for delivering the next item to that loop. The next item might simply be the next item from a page it already has from a previous fetch, or it may mean going off to get some more data and respecting the Rate Limit while doing it. It’s a bit of a read, because it’s written for general purpose use so I can use it other unrelated projects.
The iterator needs to know how to get pages, and whether to wait before getting them, so we pass over a fetcher for OctoKit.
And something to untangle the results from OctoKit
That’s the main workarounds for dealing with some of the gotchas with the github API.