• Linkedin
  • GitHub

Desktop Liberation

The definitive resource for Google Apps Script and Microsoft Office automation

  • Home
    • About Desktop Liberation
    • Reusing code from this site
    • My Public GAS Library
    • Privacy and Usage Policy
  • Blog
    • Access all published posts
  • Downloads
  • APIS
    • Google API
      • Slides
      • Chrome
      • Docs
      • Drive
      • Execution
      • KnowledgeGraph
      • People
      • Sheets
    • Microsoft API
      • OneDrive
    • REST
    • Video Intelligence
    • Vimeo
    • Vision
  • Apps Script & Java Script
    • Apps Script v8
    • Add-ons
    • Javascript
    • From VBA to Google Apps Script
    • Office JavaScript API
    • Snippets
  • Cloud Platform
    • BigQuery
    • Cloud Functions
    • Firebase
    • Kubernetes
    • Google Cloud Platform
    • Google Cloud Storage
  • Databases
    • BigQuery
    • CockroachDB
    • Database abstraction
    • Elastic
    • Firebase
    • GraphQL
  • Office & VBA
    • From VBA to Google Apps Script
    • Optimization
  • Elastic
  • Full Stack
  • GraphQL
  • Kubernetes
  • Node.js
  • Redis- Vuejs
  • Linkedin
  • GitHub
HomeGoogle Cloud PlatformGoogle Vision and OCR

Google Vision and OCR

This is the first step in Making sense of Ocr – getting a pdf turned into a JSON map.

Google Vision OCR

Page Content hide
1 Mechanics
2 Organization
3 index.js
4 ocrserver.js
5 ocrorchestrate.js
6 ocrvision.js
7 Related

Mechanics

First of all just load your pdf file somewhere in cloud storage, then I’ll use this code to retrieve and ocr it. Notice I’ll be referring to a secrets file in the code – this is where I store my credentials and other parameters. All the code will be on github – I’ll give you the repository later – but you’ll need to make your own secrets file, and you’ll also need to download a service account credentials json file with the ability to read and write json.

Organization

The project looks like this
and the main topic of this segment is the ocr folder, which looks like this
The source file is on a bucket in gcs, and will be processed like this
node ocr --path='a.pdf'

This will create one or more files (depending on the document complexity), like this

index.js

The mode is just a code used throughout each of the processes we’ll be looking at to indicate which credentials and which version of the database and api to use (production, staging etc…)
const argv = require('yargs').argv;
const ocrServer = require('./ocrserver');
ocrServer.init({
  mode: process.env.FIDRUNMODE || 'lv',
  argv
});

ocrserver.js

Each process follows roughly the same structure. There’s not a lot going here, but when we turn this into a cloud function later, there will be a bit more.
const ocrOrchestrate = require('./ocrorchestrate');
// wrapper to run the whole thing
const init = async ({mode, argv}) => {
  ocrOrchestrate.init({mode});
  await ocrOrchestrate.start({mode, argv});
};
module.exports = {
  init,
};

ocrorchestrate.js

There is come common code shared across each step. getPaths construct standard gcs path uris from the base source file for each step. In this case we are interested in the gs://bucketname/filename to use for the source data and the folder to place the ocr results in (derived from the --path argument). It’ll be covered separately on a section on the common code.

const ocrVision = require('./ocrvision');
const {getPaths} = require('../common/vihandlers');

// manages orchestration of vision api
const init = ({mode}) => {
  ocrVision.init({mode});
};

const start = async ({mode, argv}) => {
  // for convenience - there's a common way to get all the file names/paths
  // the bucket is specified in visecrets and the initial source path here
  const { path } = argv;
  const {gcsSourceUri, gcsContentUri} = getPaths({
    pathName: path,
    mode,
  });

  await ocrVision.start({
    gcsSourceUri,
    gcsContentUri,
  });
};

module.exports = {
  init,
  start,
};

ocrvision.js

This does the work with the vision API. Note it uses the google ong running workflow, which I covered previously here.
const secrets = require('../private/visecrets');
const vision = require('@google-cloud/vision').v1;

// does the vision annotation
let client = null;
const init = ({ mode }) => {
  client =  new vision.ImageAnnotatorClient({
    credentials: secrets.getGcpCreds({mode}).credentials,
  });
};
const start = async ({gcsSourceUri, gcsContentUri}) => {
  const inputConfig = {
    mimeType: 'application/pdf',
    gcsSource: {
      uri: gcsSourceUri,
    },
  };
  const outputConfig = {
    gcsDestination: {
      uri: gcsContentUri
    },
  };
  const features = [{type: 'DOCUMENT_TEXT_DETECTION'}];
  const request = {
    requests: [
      {
        inputConfig: inputConfig,
        features: features,
        outputConfig: outputConfig,
      },
    ]
  };
  // OCR it
  console.log('starting ', features, ' on ', inputConfig, ' to ', outputConfig);
  const [operation] = await client.asyncBatchAnnotateFiles(request);
  const [filesResponse] = await operation.promise();
  const destinationUri =
    filesResponse.responses[0].outputConfig.gcsDestination.uri;
  return filesResponse;

};


module.exports = {
  init,
  start
};

The result will be a folder on cloud storage in which a collection of json files, with each file holding the analysis of multiple pdf pages. Since this initial example is small, there will only be one file, with one page in it.

Structure of ocr result.
The result is long and complicated, so I won’t reproduce it here – it starts like this and goes on and on.
"responses": [{
 "fullTextAnnotation": {
 "pages": [{
 "property": {
 "detectedLanguages": [{
 "languageCode": "en",
 "confidence": 0.8
 }, {
 "languageCode": "es",
 "confidence": 0.05
 }]
 },
 "width": 792,
 "height": 612,
 "blocks": [{
 "boundingBox": {
 "normalizedVertices": [{
 "x": 0.5997475,
 "y": 0.01633987
 }, {
 "x": 0.7121212,
 "y": 0.01633987
 }, {
 "x": 0.7121212,
 "y": 0.1127451
 }, {
 "x": 0.5997475,
 "y": 0.1127451
 }]
 },
 "paragraphs": [{
 "boundingBox": {
 "normalizedVertices": [{
 "x": 0.5997475,
 "y": 0.01633987
 }, {

The slightly surprising thing to note is that the concept of ‘rows’ and ‘columns’ don’t exist. Instead there are blocks, paragraphs, symbols and bounding boxes to deal with to be able to reconstruct a tabular like format.

v
Applying this back to the original document, it’s something like this that needs to be unravelled and interpreted.
google vision OCR
Deciding what is on the same row is fairly complex, as the bounding box cor-oridinates wont exactly match, since the input is neither lined up and is also skewed. Columns are even more complex, as spaces between words is the only clue as to whether something is a real column or just 2 words. Additionally, if some rows have missing columns, as in this example, how to know which column to assign the data to?
After doing all that, we next need to decide what type of data is in each block of text to be able to put meaning to the data – in other words detecting that ‘sean connery’ is a name.
All that will be revealed in the next few sections, where we look at how to normalize the data, and use various clues in the natural language api and the knowledge graph api and various other pieces to assign meaning to each of the data entities.
See Making sense of Ocr for next steps

Related

This example is actual part of a much larger workflow which makes use of a range of APIS. If you found this useful, you may also like some of the pages below.
More Google cloud platform topics
  • Blistering fast file streaming between Drive and Cloud Storage using Cloud Run
    • Supercharging copying files between Drive and Cloud Storage from Apps Script with Cloud Run
  • Chunking promises using the Knowledge Graph API as an example
  • Cloud Storage and Apps Script
    • Enabling APIs and OAuth2
    • GcsStore examples
    • GcsStore overview - Google Cloud Storage and Apps Script
    • Google cloud storage and CORS
    • Setting up or creating a console project
    • Using the service account to enable access to cloud storage
  • Connecting to cockroachdb
  • Firebase auth for graphql clients
  • FTP server on Kubernetes with cloud storage and pubsub
  • Getting an API running in Kubernetes
    • Bringing up an ingress controller
    • Building your App ready for Kubernetes deployment
    • Creating a Kubernetes deployment
    • Creating a microservice on Kubernetes
    • Digging around on the Kubernetes cluster
    • Getting an ssl certificate for Kubernetes ingress
    • HTTPS ingress for Kubernetes service
    • Kubernetes ingress with cert-manager
    • Managing ssl for ingress certificates with cert-manager
  • Getting cockroachdb running on google cloud platform
  • Getting cockroachDB running with Kubernetes
  • Getting memcache up and running on Kubernetes
    • Creating a test app for memcache on Kubernetes
    • Exposing a memcache loadbalancer
    • Getting a simple app running on Kubernetes
    • Installing memcache with Kubernetes
    • Using mcrouter with memcached on Kubernetes
  • Google Cloud Run on Kubernetes
  • Google Video Intelligence API film labelling
  • Long running cloud platform operations and node apis
  • Making sense of OCR - Google Vision
    • Orchestrating APIS to structure and interpret OCR data
  • More cloud streaming
  • Orchestrating APIS to analyze OCR data
  • Queuing asynchronous tasks to defeat rate limits
  • Secure CockroachDB and Kubernetes
  • Securing Graphql with firebase login
  • Service account impersonation for Google APIS with Nodejs client
  • Sharing secrets between Doppler, GCP and Kubernetes
  • Stream content to Google Cloud Storage
  • Video transcription with Video Intelligence API
  • Your own free linux VM
Since G+ is closed, you can now star and follow post announcements and discussions on github, here

 

My Books and Videos

Manning liveproject create ai powered Google Workspace Add-ons

  • create, implement and test a Workspace Add-on
  • add functionality from a collection of Google Cloud APIs such as Cloud Storage, CardService, Drive and others
  • organize extracted data from a variety of documents using the Google AI Platform 
Manning LiveProject
going gas

All formats are available from O'Reilly, Amazon and all good bookshops. You can also read a preview on O'Reilly

A video course over about 8 hours and 70 lessons taking you through the basics of Apps Script and JavaScript. Available at O'Reilly, Amazon, Infinite Skills & all good media outlets

Google Apps Script for Beginners: A video course over about 8 hours and 70 lessons taking you through the basics of Apps Script and JavaScript. Available from O'Reilly, Infinite Skills and all good media outlets

bruce mcpherson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.mcpher.com. Permissions beyond the scope of this license may be available at code use guidelines

I am supporting CandidateX

CandidateX is a startup that focuses on creating inclusion-focused hiring solutions, designed to increase access to job opportunities for underestimated talent. Check them out if you have a few minutes to spare. They need visibility!

Book : Going Gas

Whether you’re moving from Microsoft Office to Google Docs or simply want to learn how to automate Docs with Google Apps Script, this practical guide shows you by example how to work with each of the major Apps Script services. The book introduces JavaScript basics for experienced developers unfamiliar with the language, and demonstrates ways to build real-world apps using all of the Apps Script services previously covered.

Video: Apps Script for developers

In this Google Apps Script for Developers training course, expert author Bruce Mcpherson will teach you how to customize, enhance, and automate your Google Docs, Sheets, and Gmail with Google Apps Script. This course is designed for the absolute beginner, meaning no previous experience with Google Apps Script is required.

Once you have completed this computer based training course, you will have learned how to customize, enhance, and automate your Docs, Sheets, and Gmail with Google Apps Script.

Video: Apps Script for beginners

The key to using Google Apps Script is understanding its underlying language – JavaScript. The course teaches you enough of the concepts and syntax of JavaScript that you’ll come away with the ability to confidently code Google Apps Script tasks on your own.

Create ai powered Google Workspace Add-ons
Manning liveproject create ai powered Google Workspace Add-ons

create, implement and test a Workspace Add-on

add functionality from a collection of Google Cloud APIs such as Cloud Storage, CardService, Drive and others

organize extracted data from a variety of documents using the Google AI Platform 

Manning LiveProject
  • Linkedin
  • GitHub

bruce mcpherson is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.mcpher.com. Permissions beyond the scope of this license may be available at code use guidelines