Google Vision and OCR - Desktop Liberation

This is the first step in Making sense of Ocr – getting a pdf turned into a JSON map.

Page Content hide

Mechanics

First of all just load your pdf file somewhere in cloud storage, then I’ll use this code to retrieve and ocr it. Notice I’ll be referring to a secrets file in the code – this is where I store my credentials and other parameters. All the code will be on github – I’ll give you the repository later – but you’ll need to make your own secrets file, and you’ll also need to download a service account credentials json file with the ability to read and write json.

Organization

The project looks like this

and the main topic of this segment is the ocr folder, which looks like this

The source file is on a bucket in gcs, and will be processed like this

node ocr --path='a.pdf'

This will create one or more files (depending on the document complexity), like this

index.js

The mode is just a code used throughout each of the processes we’ll be looking at to indicate which credentials and which version of the database and api to use (production, staging etc…)

const argv = require('yargs').argv;
const ocrServer = require('./ocrserver');
ocrServer.init({
  mode: process.env.FIDRUNMODE || 'lv',
  argv
});

ocrserver.js

Each process follows roughly the same structure. There’s not a lot going here, but when we turn this into a cloud function later, there will be a bit more.

const ocrOrchestrate = require('./ocrorchestrate');
// wrapper to run the whole thing
const init = async ({mode, argv}) => {
  ocrOrchestrate.init({mode});
  await ocrOrchestrate.start({mode, argv});
};
module.exports = {
  init,
};

ocrorchestrate.js

There is come common code shared across each step. getPaths construct standard gcs path uris from the base source file for each step. In this case we are interested in the gs://bucketname/filename to use for the source data and the folder to place the ocr results in (derived from the --path argument). It’ll be covered separately on a section on the common code.

const ocrVision = require('./ocrvision');
const {getPaths} = require('../common/vihandlers');

// manages orchestration of vision api
const init = ({mode}) => {
  ocrVision.init({mode});
};

const start = async ({mode, argv}) => {
  // for convenience - there's a common way to get all the file names/paths
  // the bucket is specified in visecrets and the initial source path here
  const { path } = argv;
  const {gcsSourceUri, gcsContentUri} = getPaths({
    pathName: path,
    mode,
  });

  await ocrVision.start({
    gcsSourceUri,
    gcsContentUri,
  });
};

module.exports = {
  init,
  start,
};

ocrvision.js

This does the work with the vision API. Note it uses the google ong running workflow, which I covered previously here.

const secrets = require('../private/visecrets');
const vision = require('@google-cloud/vision').v1;

// does the vision annotation
let client = null;
const init = ({ mode }) => {
  client =  new vision.ImageAnnotatorClient({
    credentials: secrets.getGcpCreds({mode}).credentials,
  });
};
const start = async ({gcsSourceUri, gcsContentUri}) => {
  const inputConfig = {
    mimeType: 'application/pdf',
    gcsSource: {
      uri: gcsSourceUri,
    },
  };
  const outputConfig = {
    gcsDestination: {
      uri: gcsContentUri
    },
  };
  const features = [{type: 'DOCUMENT_TEXT_DETECTION'}];
  const request = {
    requests: [
      {
        inputConfig: inputConfig,
        features: features,
        outputConfig: outputConfig,
      },
    ]
  };
  // OCR it
  console.log('starting ', features, ' on ', inputConfig, ' to ', outputConfig);
  const [operation] = await client.asyncBatchAnnotateFiles(request);
  const [filesResponse] = await operation.promise();
  const destinationUri =
    filesResponse.responses[0].outputConfig.gcsDestination.uri;
  return filesResponse;

};


module.exports = {
  init,
  start
};

The result will be a folder on cloud storage in which a collection of json files, with each file holding the analysis of multiple pdf pages. Since this initial example is small, there will only be one file, with one page in it.

Structure of ocr result.

The result is long and complicated, so I won’t reproduce it here – it starts like this and goes on and on.

"responses": [{
 "fullTextAnnotation": {
 "pages": [{
 "property": {
 "detectedLanguages": [{
 "languageCode": "en",
 "confidence": 0.8
 }, {
 "languageCode": "es",
 "confidence": 0.05
 }]
 },
 "width": 792,
 "height": 612,
 "blocks": [{
 "boundingBox": {
 "normalizedVertices": [{
 "x": 0.5997475,
 "y": 0.01633987
 }, {
 "x": 0.7121212,
 "y": 0.01633987
 }, {
 "x": 0.7121212,
 "y": 0.1127451
 }, {
 "x": 0.5997475,
 "y": 0.1127451
 }]
 },
 "paragraphs": [{
 "boundingBox": {
 "normalizedVertices": [{
 "x": 0.5997475,
 "y": 0.01633987
 }, {

The slightly surprising thing to note is that the concept of ‘rows’ and ‘columns’ don’t exist. Instead there are blocks, paragraphs, symbols and bounding boxes to deal with to be able to reconstruct a tabular like format.

Applying this back to the original document, it’s something like this that needs to be unravelled and interpreted.

Deciding what is on the same row is fairly complex, as the bounding box cor-oridinates wont exactly match, since the input is neither lined up and is also skewed. Columns are even more complex, as spaces between words is the only clue as to whether something is a real column or just 2 words. Additionally, if some rows have missing columns, as in this example, how to know which column to assign the data to?

After doing all that, we next need to decide what type of data is in each block of text to be able to put meaning to the data – in other words detecting that ‘sean connery’ is a name.

All that will be revealed in the next few sections, where we look at how to normalize the data, and use various clues in the natural language api and the knowledge graph api and various other pieces to assign meaning to each of the data entities.

See Making sense of Ocr for next steps

This example is actual part of a much larger workflow which makes use of a range of APIS. If you found this useful, you may also like some of the pages below.