The Video Intelligence API allows you to analyze the content of videos. I covered basic labelling in Google Video Intelligence API film labelling. This section will look at how to get a transcript of a film. It turns out this is a little trickier than you might imagine, as there’s a couple of gotchas which I’ll cover here. None of these things are documented well (if at all), so it was pretty much trial and error to figure them out
Turns out this doesn’t work with the stable version – you have to instead use the beta version. At time of writing this was v1p3beta
const video = require('@google-cloud/video-intelligence').v1p3beta
Standalone Feature
Most of the analysis features can be run at the same time. I discovered that Transcription needs to be run seperately, otherwise you get back no results. So analyzing a film for each of the features I’m interested in needs these parameters.
const configs = { transcription: { features: [ 'SPEECH_TRANSCRIPTION' ], videoContext: { speechTranscriptionConfig: { languageCode: 'en-US', // this will enable better formatting enableAutomaticPunctuation: true, enableSpeakerDiarization: true, // this is a problematic one - too many leads to false splitting - too few means we miss splitting diarizationSpeakerCount: 6 } } }, all: { features: [ 'SHOT_CHANGE_DETECTION', 'LABEL_DETECTION', 'TEXT_DETECTION', 'LOGO_RECOGNITION', ], videoContext: { } } };
Speaker Diarization
This is a great feature for me because it allows the allocation of lines to particular voices. Great for assigning dialog to particular actors. However the implementation is a little peculiar. The Speech Transcription response is an array of sentences, each containing a summary transcript and each word broken out with the timestamp of when it appears. There is also a speakerTag property – however it only appears in the very last item in the array – which is the all the words and timestamps repeated for the whole dialog as a single entity, but with a speaker tag attached. In order to retain both the information on ‘sentences’ as well as the speaker tags, you have to match items the first set of items with the very last one that contains the speakertags by using the timestamps.
The code
Most of the wrapper for this has been covered in Google Video Intelligence API film labelling, so I’ll just focus on the speech transcription section here. The main steps are
- Do the annontation with a transcription feature (configs.transcription mentioned previously)
- Convert the timestamps to usuable segments
- Split out the final dialog, which contains a repeat of all the words, this time with a speakerTag attached 1-n.
- Revisit the dialog split into sentences, attach the speaker tags by comparing timestamps, and further plsit the dialog if multiple speakers are detected in the same ‘sentence’
const processTranscription = async ({ fileName, description, videoFile }) => { const gcsFile = `gs://${viBucket}/${videoFile}`; // do the annotation const annotationResult = await annotate ({ featurePack: configs.transcription, description, gcsFile }); const { annotations, runId, elapsed, runAt } = annotationResult; // get the data for this type const { speechTranscriptions, error } = annotations; // package up const cleanSpeech = speechTranscriptions.map(g => { const {languageCode, alternatives} = g; // start by interpreting the start and finish times - always use the first alternative const {transcript, confidence, words} = alternatives[0]; return { transcript, confidence, segments:(words || []).map(w=>{ return { startTime: getTimeOffset(w.startTime), endTime: getTimeOffset(w.endTime), word: w.word, speakerTag: w.speakerTag } }), languageCode } }) // now split out the diarization part const { speakerTagged , transcriptions } = splitTranscriptions ({ cleanSpeech, speechTranscriptionConfig: configs.transcription.videoContext.speechTranscriptionConfig }); // attach speakertags const taggedTranscriptions = tagSpeakers ({transcriptions, speakerTagged}); // now split either at natural break or if speaker changes const mapSpeech = taggedTranscriptions.reduce((s,t) => { t.segments.forEach((w,wi) => { // if its a new section, force a new item const {speakerTag, word, endTime} = w; const lastItem = wi && s[s.length-1].speakerTag === speakerTag && s[s.length-1]; if(!lastItem) { s.push({startTime: w.startTime, confidence: t.confidence, languageCode: t.languageCode, words:[], speakerTag}) } const item = s[s.length-1]; // add the word and update the finish time item.words.push(word); item.endTime = endTime; }) return s; },[]).map(s => { return {...s, description: s.words.join(" ")}; }); const result = { errorCode: error ? error.code : null, errorMessage: error ? error.message : 'success', description, runId, runAt, elapsed, gcsFile, fileName, transcript: mapSpeech }; return result; };
Annotate
This is a general purpose function to do all flavours of annotation.
const annotate = async({ featurePack, description, gcsFile }) => { const startTime = new Date().getTime(); const runId = startTime.toString(32); const runAt = new Date(startTime).toISOString(); // type(s) of annotations const { features, videoContext } = featurePack; console.debug('....initializing', features.join(',')); // add video context to this for speech const request = { features, videoContext, inputUri: gcsFile }; console.log('....starting', runId, description, runAt); // the result of the long running operation will resolve here const operationResult = await doLong(request); // get the annotations const [annotations] = operationResult.annotationResults; const elapsed = new Date().getTime() - startTime; console.log('....annotation done after ', elapsed / 1000, features.join(',')); return { annotations, runId, runAt, elapsed, gcsFile, description }; };
Long running operation
Annotation is a long running operation (these are covered in Long running cloud platform operations and node apis)
// manage a long running annotation operation const doLong = async (request) => { // its a long running operation const { result, error } = await till(viClient.annotateVideo(request)); const [operation] = result; // console.debug ('annotating', request, { error } ); // when done, retrieve the result const { result: oResult, error: oError } = await till(operation.promise()); const [operationResult] = oResult; // console.debug('getting result', { oError }); return operationResult; };
Time measurement
The Video intelligence APIS uses a time offset which consists of (Long) number of seconds (note this is not a Number), along with a number of nanseconds. Here’s how to convert it to seconds.
const getTimeOffset = (timeOffset) => { if(!timeOffset) { console.log('missing timeoffset'); return 0; } const { seconds, nanos } = timeOffset; // seconds is actually a Long object, and nano is a Number. const timeoffset = parseFloat(seconds || 0) + (nanos || 0) / 1e9; return timeoffset };
Splitting the response
As previously mentioned, the response will contain 1 entire summary dialog, preceded by a number of dialog snippets. Here’s how to split them out
// if speaker diarization is on, then the very last item in the speech transscripttions will be a summary of // all transcriptions with speaker tags attached const splitTranscriptions = ({cleanSpeech, speechTranscriptionConfig})=> { const speakerTagged = cleanSpeech.slice(-1); if (speechTranscriptionConfig.enableSpeakerDiarization) { if(!speakerTagged) { console.error('....speaker diarization missing - skipping ') return { transcriptions: cleanSpeech } } else { if (cleanSpeech.length < 2) { console.error('....speaker diarization item suspect - skipping') return { transcriptions: cleanSpeech } } else { return { transcriptions: cleanSpeech.slice(0,-1), speakerTagged } } } } };
Tagging speakers
Once you have the summary dialog containing the speaker tags, you need to go back and assign these to the original dialogs, using a function like this
// attach speaker tags to transcriptions const tagSpeakers = ({transcriptions, speakerTagged}) => speakerTagged ? transcriptions.map(f => ({ ...f, segments: f.segments.map(g => { const segment = speakerTagged.map(h=>h.segments.find(s=>s.startTime === g.startTime))[0]; if(!segment) { console.error('....couldnt find speakertag item for ', g) } else { g.speakerTag = segment.speakerTag; } return g; }) })) : transcriptions;
More
Since G+ is closed, you can now star and follow post announcements and discussions on github, here