One the things I wanted to play around with was visualizing this site content with d3. But first I needed to create something which would generate data from the site (any domain site actually). It’s easy enough to modify for non-domain sites but I’m starting with domains, since that’s what I have.
To do this we’ll use a couple of script services.
- Google Drive – results will either be served up as JSON, JSONP or written as JSON file on google drive for later consumption.
- Content service – to serve up either data results, or file location results
Objective
Ultimately this data will be used for visualization. I’ll cover that in a separate section. First of all I’m going to scrape the site, looking for and counting occurrences of specific tags and reporting them. That way we can generate some visualizations showing which topics are related and where to find them. The web app – tagsite – will take these URL arguments
parameter | example | purpose |
tagdomain | tagdomain=mcpher.com | the name of the domain to which the site is mapped |
tagsite | tagsite=share | the name of the site |
tagoutput | tagoutput=drive|rest | whether to output the result to a drive or as a rest response. default is drive |
tagfile | tagfile=tagsite.json | name of file to write to drive if tagoutput=drive |
callback | callback=somefunction | name of a callback function. if specified then jsonp rather than json will be returned |
tag descriptions | &d3=d3,js,d3js,d3&excel=excel,xl | this is just a list of tag=synonym1,synonym2… Each tag specified as a parameter will count each occurrence of each of its synonyms. You can use regex syntax if you need to for a synonym |
An example
RESTLet’s take an example (it does take a while to run – there’s a lot of content). This will create some relationship data for each page on the site for the given tags, and return straight json. https://script.google.com/macros/s/AKfycbz4Q0o4R3Kq9KubpgOSU5iy4eY6rcN2KcqGzo6GHi6hxZUM0bA/exec?tagdomain=mcpher.com&tagoutput=rest&tagsite=share&d3=d3js,d3.js,d3&vba=vba,vb&excel=excel,xl&gas=gas,script ResultsYou are going to get back an array, one item for each page in the web site, that starts like this first element. The counts are the number of times that each synonym is encountered on a given page.
{ "data": [ { "parent": "gassites", "name": "gastags", "url": "https://ramblings.mcpher.com/gas-and-sites/analyzing-site-content-with-gas/", "tags": { "tagmap": [ { "name": "gas", "values": [ "gas", "script" ], "counts": [ 0, 1 ] }, { "name": "d3", "values": [ "d3js", "d3.js", "d3" ], "counts": [ 1, 1, 4 ] }, { "name": "vba", "values": [ "vba", "vb" ], "counts": [ 0, 0 ] }, { "name": "excel", "values": [ "excel", "xl" ], "counts": [ 2, 1 ] } ] } },
DRIVEIn this case, we want to do the same thing, but this time write the result to gDrive https://script.google.com/macros/s/AKfycbz4Q0o4R3Kq9KubpgOSU5iy4eY6rcN2KcqGzo6GHi6hxZUM0bA/exec?tagdomain=mcpher.com&tagoutput=drive&tagfile=play.json&tagsite=share&d3=d3js,d3.js,d3&vba=vba,vb&excel=excel,xl&gas=gas,script ResultsWhat gets returned is a description of the drive file. The “hosted” property is a link to the created json file and is the one you should use for getting data into your web app.
{ "data": [], "file": { "url": "https://docs.google.com/a/mcpher.com/file/d/0B92ExLh4POiZTFgwcWtXUG1qVU0/edit?usp=drivesdk", "name": "play.json", "id": "0B92ExLh4POiZTFgwcWtXUG1qVU0", "download": "https://docs.google.com/a/mcpher.com/uc?id=0B92ExLh4POiZTFgwcWtXUG1qVU0&export=download", "hosted": "https://googledrive.com/host/0B92ExLh4POiZTFgwcWtXUG1qVU0" } }
Dependencies
Normally I reference a shared library for GAS stuff ( see Using the mcpher library in your code ), but this is very straightforward and all the code is below. There are no library references needed. Main code here
Now let’s do something with the data – see Site data to sheets.