Analyzing site content with GAS

One the things I wanted to play around with was visualizing this site content with d3. But first I needed to create something which would generate data from the site (any domain site actually). It's easy enough to modify for non-domain sites but I'm starting with domains, since that's what I have.

To do this we'll use a couple of script services.
  • Google Drive - results will either be served up as JSON, JSONP or written as JSON file on google drive for later consumption.
  • Content service - to serve up either data results, or file location results

Objective

Ultimately this data will be used for visualization. I'll cover that in a separate section. First of all I'm going to scrape the site, looking for and counting occurrences of specific tags and reporting them. That way we can generate some visualizations showing which topics are related and where to find them. The web app - tagsite - will take these URL arguments

 parameter  example  purpose    
 tagdomain  tagdomain=mcpher.com  the name of the domain to which the site is mapped
 tagsite  tagsite=share  the name of the site 
 tagoutput  tagoutput=drive|rest  whether to output the result to a drive or as a rest response. default is drive
 tagfile  tagfile=tagsite.json  name of file to write to drive if tagoutput=drive
 callback  callback=somefunction  name of a callback function. if specified then jsonp rather than json will be returned
 tag descriptions  &d3=d3,js,d3js,d3&excel=excel,xl  this is just a list of tag=synonym1,synonym2... Each tag specified as a parameter will count each occurrence of each of its synonyms. You can use regex syntax if you need to for a synonym

An example

REST
Let's take an example (it does take a while to run - there's a lot of content). This will create some relationship data for each page on the site for the given tags, and return straight json.


https://script.google.com/macros/s/AKfycbz4Q0o4R3Kq9KubpgOSU5iy4eY6rcN2KcqGzo6GHi6hxZUM0bA/exec?tagdomain=mcpher.com&tagoutput=rest&tagsite=share&d3=d3js,d3.js,d3&vba=vba,vb&excel=excel,xl&gas=gas,script


Results
You are going to get back an array, one item for each page in the web site, that starts like this first element. The counts are the number of times that each synonym is encountered on a given page.
{
    "data": [
        {
            "parent": "gassites",
            "name": "gastags",
            "url": "https://sites.google.com/a/mcpher.com/share/Home/excelquirks/gassites/gastags",
            "tags": {
                "tagmap": [
                    {
                        "name": "gas",
                        "values": [
                            "gas",
                            "script"
                        ],
                        "counts": [
                            0,
                            1
                        ]
                    },
                    {
                        "name": "d3",
                        "values": [
                            "d3js",
                            "d3.js",
                            "d3"
                        ],
                        "counts": [
                            1,
                            1,
                            4
                        ]
                    },
                    {
                        "name": "vba",
                        "values": [
                            "vba",
                            "vb"
                        ],
                        "counts": [
                            0,
                            0
                        ]
                    },
                    {
                        "name": "excel",
                        "values": [
                            "excel",
                            "xl"
                        ],
                        "counts": [
                            2,
                            1
                        ]
                    }
                ]
            }
        },

DRIVE
In this case, we want to do the same thing, but this time write the result to gDrive

https://script.google.com/macros/s/AKfycbz4Q0o4R3Kq9KubpgOSU5iy4eY6rcN2KcqGzo6GHi6hxZUM0bA/exec?tagdomain=mcpher.com&tagoutput=drive&tagfile=play.json&tagsite=share&d3=d3js,d3.js,d3&vba=vba,vb&excel=excel,xl&gas=gas,script

Results
What gets returned is a description of the drive file. The "hosted" property is a link to the created json file and is the one you should use for getting data into your web app. Here's the link to the live data.

{
    "data": [],
    "file": {
        "url": "https://docs.google.com/a/mcpher.com/file/d/0B92ExLh4POiZTFgwcWtXUG1qVU0/edit?usp=drivesdk",
        "name": "play.json",
        "id": "0B92ExLh4POiZTFgwcWtXUG1qVU0",
        "download": "https://docs.google.com/a/mcpher.com/uc?id=0B92ExLh4POiZTFgwcWtXUG1qVU0&export=download",
        "hosted": "https://googledrive.com/host/0B92ExLh4POiZTFgwcWtXUG1qVU0"
    }
}

Dependencies

Normally I reference a shared library for GAS stuff ( see Using the mcpher library in your code ), but this is very straightforward and all the code is below. There are no library references needed. 

main code


Now let's do something with the data  - see Site data to sheets
For help and more information join our forum,follow the blog or follow me on twitter .

For more stuff see my book - Going Gas.  All formats are available now from O'Reilly,Amazon and all good bookshops. You can also read a preview on O'Reilly.





Comments