I’ve had a lot of fun with this one – getting headless Pupeteer on Cloud Run with graphQL endpoint so you can do pupeteer things such as web scraping, taking screenshots and so on without having the bother of hosting it and making requests via GraphQL.

In this article I’ll show you how to use this (I’ve called it gql-puppet ) with Apps Script and Node examples. In future articles I’ll write up how to make and deploy your own server. All the code is on Github – links at end of article.

What is puppeteer

Since you’re reading this you probably already know that puppeteer allows you programatically access the dom of a web page. Normally you’d do this via a web app running in a browser, but puppeteer allows you to do from a server based app (such as Node or Apps Script). In fact, it runs a headless (no UI) version of chrome behind the scenes.

What is GraphQL

If you follow this site, you may have noticed lots of articles about GraphQL. As a powerful alternative to the usual REST API approach, GraphQL is a strongly typed query language for an API. It also has a built in visual UI – GraphIql which allows you test queries before incorporating them into your App. I’ve added GraphIql to this implementation too – so you can even run gql-puppet with writing any code if you want.

Normally I use Apollo/Express as a GraphQL server, but this time I’m using Mercurius/fastify which I highly recommend if you’re starting out on your GQL journey – it’s quite a bit lighter than the alternative. I’m also using Redis to manage caching and track rate limits, apikeys and usage.

What is Cloud Run

It’s a ‘serverless’ managed platform that allows you run workloads on Google infrastructure. It’s not free, but it does have a free tier. I’m letting you use my GCP platform to try it out, but to protect from promiscuous costs you’ll need to email me to get an API key and the endpoint address – and there are also rate limits and usage tracking. I haven’t got round to creating a web app to allow you apply for a key online yet but if there’s a demand, I’ll make one.

How to use gql-puppet

These examples will be Apps script, but I’ll give a couple of Node examples later as well. The Graphql queries are exactly the same no matter the platform. Take a copy of this script or get the code from github. It contains a number of examples for how to interact with the API

Secrets

I’ll start with managing secrets – in Apps Script we use the PropertyService and in Node I use GCP Secret manager. For more in integrating bash and Node with Secret Manager see Sharing secrets between Doppler, GCP and Kubernetes. If you are just using Apps Script, or you have some other preferred way of handing secrets in Node, there’s no need to worry about the Secret manager at this time.

Once you have applied for and received an endpoint and api key from me, you can store these in the Apps Script property store of the copy of the script you’ve taken. These are the keys you’ll need to create, substituting in your own values.

Puppeteer Page

The main component we’ll be starting off with is the Page. You provide a url, the page is rendered by gql-puppet – then you can start playing withthe content.

Here’s a query and payload to take a screenshot of this wikipedia page.

const url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
const briefScreenshotPayload = {
variables: {
url,
},
query: `
query ($url: URL!) {
page (url: $url) {
url { href }
screenshot {
mimeType
base64Bytes
}
}
}`,
};
brief screenshot

To execute and write to Drive

I’m using some reusable functions to do the work here – we’ll use them all over so this is the last time you’ll see them – all the code is in the script you’ve taken a copy of.

const testBriefScreenshot = () => {
const { url, screenshot } = test({ payload: briefScreenshotPayload, prop: 'screenshot' })
return toDrive("brief screenshot", url.href, screenshot)
}

//-- reusable functions used in all tests
const test = ({ payload, prop }) => {

const getApiKey = () => getProp("gql-puppet-api-key")
const getApiEndpoint = () => getProp("gql-puppet-endpoint")

const headers = {
"x-gql-puppet-api-key": getApiKey()
}

// do the fetch
const response = UrlFetchApp.fetch(getApiEndpoint(), {
payload: JSON.stringify(payload),
contentType: "application/json",
muteHttpExceptions: true,
headers,
method: "POST"
})

// check we got some data
const data = getData(response)

if (prop && !data?.page?.[prop]) {
throw ("no data received:");
}

return data.page

};

// make sure we have a success code
const checkResponse = (response) => {
const code = response.getResponseCode()
if (code !== 200) {
if (code === 429) {
console.log ('rate limit exceeeded')
console.log (response.getHeaders())
}
throw 'failed:' response.getContentText()
}
}

// extract and parse the gql response
const getData = (response) => {
checkResponse(response)
const { data, errors } = JSON.parse(response.getContentText())
if (errors) {
throw 'failed query:' JSON.stringify(errors)

}
if (!data) {
throw 'no data from query:' JSON.stringify(query)
}
return data
}


// api key & endpoint for cloud run
const getStore = () => PropertiesService.getScriptProperties()
const getProp = (prop) => {
const value = getStore().getProperty(prop)
if (isNU(value)) throw 'expected value for property ' prop
return value
}

const toDrive = (prop, url , data) => {
const file = DriveApp.createFile (blobber (cleanerName(url), data))
console.log ('wrote',prop,'to',file.getName())
}

const isUndefined = (value) => typeof value === typeof undefined
const isNull = (value) => value === null
const isNU = (value) => isNull(value) || isUndefined(value)

// make blobs from base64
const makeBlob = ({ base64Bytes, mimeType }) =>
Utilities.newBlob(Utilities.base64Decode(base64Bytes), mimeType)

// tuen b64 to blob and derive a name from url && mimetype
const blobber = (name, { base64Bytes, mimeType }) =>
makeBlob({ base64Bytes, mimeType })
.setName(`${name}.${mimeType.replace(/.*\//, "")}`)

// derive a name from url
const cleanerName = (name) => name.replace(/\//g, "-").replace(/\./g, "-")
execute and write to Driv

Here’s the screenshot

Using GraphIQL

There are many puppeteer options implemented in gql-puppet, and to find out what they are and to try out queries, you can use GraphIQL. The normal API endpoint is https://YOURENDPOINT/graphql, but you’ll find the visual endpoint for GraphIQL at https://YOURENDPOINT/graphiql.

Here’s an example of a simple query to find all the tables at a given web page, and destructure the content into rows ready to write to a spreadsheet. In the GraphIQLheaders section you’ll need to enter your api key as below. The selector “table” will pick up any tables in the web site. You can use any selector accepted by the dom function document.querySelectorAll() to be as specific as you like in identifying the tables required. This example just finds all the tables.

Notice that the documentation window on the left hand side gives details of which queries and arguments are supported, and drilling down further will show which fields can be returned. The UI will validate your query as you construct it and (alt/space) will bring up the possible values you can enter at any stage.

Using GraphIql is a great way to practise your queries before committing them to code, and it’s already built in to gql-puppet.

GraphIql variables

Instead of hard coding the query arguments, I prefer to use the variables feature. Here’s the same query rewritten with variables. You’ll find this more convenient when you start calling the API from within your script.

Using Postman

Postman also understands GraphQL, so if you prefer you can use it to construct your queries. Use the regular ENDPOINT/graphql and remember to enter your api key in the Postman header.

Consuming data

In an earlier example, I showed how to write a screenshot to drive. Let’s use a similar query above to create a spreadsheet, with a sheet for each table in the web site.

const url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

const scratchSsId = '1b6i5PEZ2IYStML3r9161u1dXioylYCp1stWvwB7k1qo'

const tablesPayload = {
variables: {
url,
"selector": "table"
},
query: `
query ($url: URL!, $selector: Value!) {
page(url: $url) {
url { href }
tables(selector: $selector) {
count
tables {
headers
rows
}
}
}
}
`
}

const testTables = () => {
const { tables, url } = test({ payload: tablesPayload, prop: 'tables' })
toSheet ({url, tables, id: scratchSsId})
}

const toSheet = ({url, tables, id}) => {

const ss = SpreadsheetApp.openById(id)
return tables.tables.map ((table,i)=> {
const name = `${cleanerName(url.href)}-${i}`
const sheet = ss.getSheetByName(name) ||
ss.insertSheet().setName(name)
sheet.clearContents()
const values = table.headers.concat(table.rows)
const maxWidth = values.reduce ((p,c)=>Math.max(p,c.length),0)
const paddedValues = values.map (
v=>v.concat(Array.from({length: maxWidth - v.length}))
)
sheet
.getRange (1,1)
.offset(0,0,paddedValues.length,maxWidth)
.setValues(paddedValues)
return sheet
})

}
write tables from a web site to a spreadsheet

Here’s the result

Create a pdf from a web site

This is similar to the screenshot method, except this time we’ll use a few options to set the margins, paper type and landscape mode. You’ll see from the graphiql document explorer some of the options gql-puppet supports, and find more detail here.

Now, let’s do the query in Apps Script and write it to Drive.


const pdfMarginsPayload = {
variables: {
url,
"options": {
"paperFormat": "letter",
"landscape": true,
"pdfMargin": {
"top": 6,
"bottom": 6,
"left": 10,
"right": 10
}
}
},
query: `
query ($options: PdfOptionsInput, $url: URL!) {
page(url: $url) {
pdf(options: $options ) {
base64Bytes
mimeType
}
}
}`
}

const testPdfMargins = () => {
const { pdf } = test({ payload: pdfMarginsPayload, prop: 'pdf' })
return toDrive("pdf-with-margins", "margins-" url, pdf)
}
pdf with options

Here’s the result

There’s a number of other examples in the example script for you to play with.

How does it all work.

Puppeteer can do many things, and it would be a life’s work to expose prebaked methods to access them all. Behind the scenes, gql-puppet is sending code to puppeteer which runs in the headless instance of Chrome that puppeteer manages – so you can send and execute any puppeteer supported code there to be evaluated.

Eval method

gql-puppet provides a way for you to send any code that would execute in the Dom for puppeteer to run on your behalf.

Lets do a little bit of custom coding and have gql-puppet submit it for execution, capture the result and return it as a response to a gql query.

Custom code to get all the image links on a page

In this example, the code property in the query variables contains the text of a function that I’d like puppeteer to run on the web page specified by the url property. The arg property will be passed to the function when it runs – so in this case it’ll pick up the src property of all images on the page.

const evalPayload = {
variables: {
code: `(selector) => {
const elements = document.querySelectorAll (selector)
return Array.from(elements)
.map (element=>({
src: element.src
}))
}
`,
arg: 'img',
url
},
query: `
query ($url: URL!, $arg: JSON, $code: String!) {
page (url: $url) {
eval (code: $code, arg: $arg) {
result
}
}
}
`
}
const testEval = () => {
const { eval } = test({ payload: evalPayload, prop: 'eval' })
console.log(eval.result)
}
eval query

and a snippet of the result

[ { src: 'https://en.wikipedia.org/static/images/icons/wikipedia.png' },
{ src: 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg' },
{ src: 'https://en.wikipedia.org/static/images/mobile/copyright/wikipedia-tagline-en.svg' },
{ src: 'https://upload.wikimedia.org/wikipedia/en/thumb/1/1b/Semi-protection-shackle.svg/20px-Semi-protection-shackle.svg.png' },
......]
eval result

This eval method gives ultimate flexibility to run any puppeteer targeted code from the comfort of graphiql, postman or your own app. All you need is to be able to send a post request.

Prebaked evals

A number of code snippets are already prebaked into gql-puppet (for example tables, elements). It uses the exact same method above to have puppeteer run code snippets, then formats the result it gets back. If you check out the plugins folder on github you’ll see how these are written.

I’ll probably add some other common prebaked snippets over time, but you are welcome to contribute any of your own you think might be useful to others via the github repo for gql-puppet.

Rate limiting

Since Cloud Run is not free, I have to rate limit it to help defend against abuse – and of course it only works if you have a valid API key – which I’ll let you have to try out gql-puppet if you contact me.

A rate limit error returns an HTTP status code of 429, and the response headers contains something like this (the actual values will probably be different). You can wait the retry-after value to try again.

  'x-ratelimit-limit': '6',
'x-ratelimit-remaining': '0',
'x-ratelimit-reset': '60',
'retry-after': '60',
rate limit reponse headers

Usage

You can query your usage like this.

Node fetching

You can use the exact same query payload examples in Node (or anything else) as shown in the Apps Script samples. The only change is the fetch method – I normally use got for fetching in Node, so an example query would look like this.

  const result = await got
.post(app, {
json: payload,
headers
})
.json();
Node version

Hosting your own gql-puppet service

The code for gql-puppet is on github, but you’ll probably need some help with deploying your own service to cloud run as there are various gotchas.

Here’s the end to end story on howto deploy this service to cloud run – Setting up a GraphQL server on Cloud Run

Links

Puppeteer: https://pptr.dev/

Apps script demo scripts: take a copy or github

gql-puppet: github

graphql: https://graphql.org/

Setting up a GraphQL server on Cloud Run