The Video Intelligence API allows you to analyze the content of videos. I covered basic labelling in Google Video Intelligence API film labelling. This section will look at how to get a transcript of a film. It turns out this is a little trickier than you might imagine, as there’s a couple of gotchas which I’ll cover here. None of these things are documented well (if at all), so it was pretty much trial and error to figure them out
Turns out this doesn’t work with the stable version – you  have to instead use the beta version. At time of writing this was v1p3beta

Standalone Feature

Most of the analysis features can be run at the same time. I discovered that Transcription needs to be run seperately, otherwise you get back no results. So analyzing a film for each of the features I’m interested in needs these parameters.

Speaker Diarization

This is a great feature for me because it allows the allocation of lines to particular voices. Great for assigning dialog to particular actors. However the implementation is a little peculiar. The Speech Transcription response is an array of sentences, each containing a summary transcript and each word broken out with the timestamp of when it appears. There is also a speakerTag property – however it only appears in the very last item in the array – which is the all the words and timestamps repeated for the whole dialog as a single entity, but with a speaker tag attached. In order to retain both the information on ‘sentences’ as well as the speaker tags, you have to match items the first set of items with the very last one that contains the speakertags by using the timestamps.

The code

Most of the wrapper for this has been covered in Google Video Intelligence API film labelling, so I’ll just focus on the speech transcription section here. The main steps are
  • Do the annontation with a transcription feature (configs.transcription mentioned previously)
  • Convert the timestamps to usuable segments
  • Split out the final dialog, which contains a repeat of all the words, this time with a speakerTag attached 1-n.
  • Revisit the dialog split into sentences, attach the speaker tags by comparing timestamps, and further plsit the dialog if multiple speakers are detected in the same ‘sentence’


This is a general purpose function to do all flavours of annotation.

Long running operation

Annotation is a long running operation (these are covered in Long running cloud platform operations and node apis)

Time measurement

The Video intelligence APIS uses a time offset which consists of (Long) number of seconds (note this is not a Number), along with a number of nanseconds. Here’s how to convert it to seconds.

Splitting the response

As previously mentioned, the response will contain 1 entire summary dialog, preceded by a number of dialog snippets. Here’s how to split them out

Tagging speakers

Once you have the summary dialog containing the speaker tags, you need to go back and assign these to the original dialogs, using a function like this


Since G+ is closed, you can now star and follow post announcements and discussions on github, here