A few years ago I released a visualizer of all the scripts that Google Apps Script developers have shared on Github. There’s a totally unscripted episode all about it here, and lots of detail on this site (search for scrviz). However, the github api, which is how I obtained the data was becoming more and more cumbersome to get data out of, so I stopped feeding it with updated data. I’ve decided to resurrect the resource, but with different strategies to defeat the github api limitations and quota, and share apps script on github data publicly on bigquery so that others might be able to use it.

In this article I’ll cover some of the data I’ve found and how the I extracted and cataloged the data. I’ll update the data regularily from now. In related articles I’ll cover the technical aspects and code.

The Apps Script on github data

Since Apps Script has its own development environment, what’s on github will only be a subset of global Apps Script activity. However, it will give an insight into the kinds of things it’s being used for.

Here’s the top line scorecard.

Looker

A number of reports are available in Google Looker – You can use the embed below to explore them, or find them here.

Hosting apps script on github data on bigquery

This bigquery public data set contains all the normalized data and reporting views. If you come up with interesting analyses, ping me and I’ll add them to the reports and bigquery views.

Currently you’ll find these tables and views

Some notes on the methodology

The github api and its client (octokit) are quite restrictive in the data you can retrieve and how often you can retrieve it. Ideally I would have just done a query on repos containing appsscript.json manifest files, and that would be it. However in addition to strict rate limits, with very long delays between query attempts, there are also other limitations.

  • The API only discovers files updated in the past year
  • The API only searches up to 1000 files and a restricted number of repos

You can see from the stats above that I was able to come up with a few strategies to get round those. Here are a some which might help if you are working with octokit for some other projects. I use both the graphql and rest endpoints for various parts of the data acquisition process.

Searching for manifest files.

The first step is look for manifest files. We can use the code search rest endpoint to return details about particular filenames, but the limitations mean we wouldn’t get them all with a single query. Instead I looked for appscript.json filenames, but filtered by size – this meant that a series of queries including size filters keeps below the file limitiations, but with the tradeoff of multiple queries. I found this set of size filters allowed me to get everything available.

    [
"<50",
"50..75",
"76..85",
"86..95",
"96..100",
"101..115",
"116..125",
"126..150",
"151..200",
"201..225",
"226..250",
"251..275",
"276..300",
"301..350",
"351..400",
"401..450",
"451..500",
"551..600",
"601..650",
"651..700",
"701..800",
">800",
]
size filters

Looking for older files

Of course this just gave me appscript manifests that had been updated in the last year, and once created they seldom change. The next strategy was to look at all the repositories owned by anyone who had a recent appsscript.json found in the first step. I found about 1400 developers in this phase. Mostly these repos were not apps script related, but I needed to look at them all to see if they were. There were around 88,000 repos and 59million files to examine.

Looking for inactive developers

As we can see from the stats I found about 2700 apps script developers (repository owners) in the end, so clearly only about half the owners had updated manifest files in the past year. Luckily I had data from my previous visualizer which had a the ids of about 2500 owners harvested a few years ago when the github api was more forgiving. I merged these with the 1400 recently active owners, and got to a total of 2700. Repeating the process of examining all their repos I ended up with about 7000 manifests (apps script projects) in 5000 repos. To get to this I had to analyze a total of 137,000 repos and 78 million files.

To build up a cumulative record of all apps script activity ongoing, I can use the owners table from bigquery to seed the queries with owners not found in the initial manifest search

The content

I downloaded and parsed the content from these 7000 manifests and loaded this into bigquery along with the repo and developer data. I found that the octokit graphql endpoint was better for this, as I could batch many queries in a single query to help defeat rate limit annoyances.

Strategies for deducing the scriptId

Normally the scriptId for a project is not available in the files pushed to github, but some developers who commit their projects to github also often use clasp (about 24%), which leaves a signature file in the repo which contains the scriptId. Gasgit does the same thing, so finding a .clasp.json or an info.json in the same folder as an appscript.json allowed me to assign a scriptId to a number of these projects.

Another issue is that many developers clone other repos, taking the claspid of the original repo with them – even though that’s not actually their version of the script. When a duplicate scriptId is detected across multiple repos, I use the one in repo with the earliest creation date. There are still a few other edge cases to sort out so the scriptId info should be taken with a heavy pinch of salt.

Libraries

All the library dependencies are parseable from the manifest file, and I wanted to be able to find out which developers had commited their libraries to github, and who was using them. For this I needed the scriptId deduced earlier which matched the libraryId mentioned in the manifest file. Another strategy to find scriptId for projects using neither clasp nor github, was to match the library dependency userSymbol to any repo names that I found. This was a clue that the matched repo project was probably that scriptId.

Using this info together, we can report not only the libraries used by projects, but also who created and maintains them (if their source code is on github)

Because this relies on the accuracy of the scriptid, library identification still needs some work to improve it.

Hiring developers

One important use for all of this is as a resource for finding developers to work on projects. Since the looker reports are live, clicking on some resource filters the rest of the page. For example if you were looking for someone hireable with a specific set of skills, in a particular location you could find developers easily and vist their repos to see what we’d they’d been up to.

Here’s an example of looking for hireable developers in Japan who’ve had experience of creating webapps.

Of course this assumes that their github profile accurately reflects their location and hireability. If you think you might benefit from this, get your profile up to date! And of course, make sure you commit your apps script code to github.

Run times

Running this from scratch takes an astonishly long time – over 24 hours. The octokit client handles throttling and rate limit issues – and the number of API calls per minute is severely restricted. Rate limit violations are penalized by waits of several minutes. Clearly ,developing under these conditions is challenging, so I’ve come up with a number of strategies to optimize this ongoing. As a comparison, when data is fully cached, it runs the whole thing in just over a minute. I’ll cover some of those strategies in future articles along with code samples.

Next steps

I’ll have missed developers that haven’t been active in many years, since my seed list only goes back to those that have posted a manifest since about 2015. In addition, manifests didn’t exist in the early days of apps script. I can’t recall exactly when they were introduced, and also some developers have them in their .gitignore file.

All this means that there’s probably a bunch more stuff I’d like to find, but without the manifest file we’ll miss the characteristics of the project. Nevertheless there’s definitely scope to catalog more Workspace activity on github, including searching package.json files for API references (so we can see who is using workspace APIS outside of Apps Script), and searching code for references to common Apps Script objects.

Finally, we can use pretty much the same techniques to find and catalog any other domains which have a signature file.

In the meantime please drill into the manifest reports and bigquery apps script on github data set. I’d love to see what you build.