Video transcription with Video Intelligence API

The Video Intelligence API allows you to analyze the content of videos. I covered basic labelling in Google Video Intelligence API film labelling. This section will look at how to get a transcript of a film. It turns out this is a little trickier than you might imagine, as there’s a couple of gotchas which I’ll cover here. None of these things are documented well (if at all), so it was pretty much trial and error to figure them out

Page Content hide

1 Beta capability

2 Standalone Feature

3 Speaker Diarization

4 The code

5 Annotate

6 Long running operation

7 Time measurement

8 Splitting the response

9 Tagging speakers

10 More

11 Share with your network

Beta capability

Turns out this doesn’t work with the stable version – you have to instead use the beta version. At time of writing this was v1p3beta

const video = require('@google-cloud/video-intelligence').v1p3beta

Standalone Feature

Most of the analysis features can be run at the same time. I discovered that Transcription needs to be run seperately, otherwise you get back no results. So analyzing a film for each of the features I’m interested in needs these parameters.

  const configs = {
    transcription: {
      features: [
        'SPEECH_TRANSCRIPTION'
      ],
      videoContext: {
        speechTranscriptionConfig: {
          languageCode: 'en-US',
          // this will enable better formatting
          enableAutomaticPunctuation: true,
          enableSpeakerDiarization: true,
          // this is a problematic one - too many leads to false splitting - too few means we miss splitting
          diarizationSpeakerCount: 6
        }
      }
    },
    all: {
      features:  [
        'SHOT_CHANGE_DETECTION',
        'LABEL_DETECTION',
        'TEXT_DETECTION',
        'LOGO_RECOGNITION',
      ],
      videoContext: {
 
      }
    }
  };

Speaker Diarization

This is a great feature for me because it allows the allocation of lines to particular voices. Great for assigning dialog to particular actors. However the implementation is a little peculiar. The Speech Transcription response is an array of sentences, each containing a summary transcript and each word broken out with the timestamp of when it appears. There is also a speakerTag property – however it only appears in the very last item in the array – which is the all the words and timestamps repeated for the whole dialog as a single entity, but with a speaker tag attached. In order to retain both the information on ‘sentences’ as well as the speaker tags, you have to match items the first set of items with the very last one that contains the speakertags by using the timestamps.

The code

Most of the wrapper for this has been covered in Google Video Intelligence API film labelling, so I’ll just focus on the speech transcription section here. The main steps are

Do the annontation with a transcription feature (configs.transcription mentioned previously)
Convert the timestamps to usuable segments
Split out the final dialog, which contains a repeat of all the words, this time with a speakerTag attached 1-n.
Revisit the dialog split into sentences, attach the speaker tags by comparing timestamps, and further plsit the dialog if multiple speakers are detected in the same ‘sentence’

  const processTranscription = async ({ fileName, description, videoFile }) => {
    const gcsFile = `gs://${viBucket}/${videoFile}`;
    // do the annotation
    const annotationResult = await annotate ({
      featurePack: configs.transcription,
      description, gcsFile
    });
    const { annotations, runId, elapsed, runAt } = annotationResult;
 
    // get the data for this type
    const {
      speechTranscriptions,
      error
    } = annotations;
 
    // package up
    const cleanSpeech = speechTranscriptions.map(g => {
 
      const {languageCode, alternatives} = g;
      // start by interpreting the start and finish times - always use the first alternative
      const {transcript, confidence, words} = alternatives[0];
 
      return {
        transcript,
        confidence,
        segments:(words || []).map(w=>{
          return {
            startTime: getTimeOffset(w.startTime),
            endTime: getTimeOffset(w.endTime),
            word: w.word,
            speakerTag: w.speakerTag
          }
        }),
        languageCode
      } 
    })
 
 
    // now split out the diarization part
    const { speakerTagged , transcriptions } = splitTranscriptions ({
      cleanSpeech,
      speechTranscriptionConfig: configs.transcription.videoContext.speechTranscriptionConfig
    });
 
 
    // attach speakertags
    const taggedTranscriptions = tagSpeakers ({transcriptions, speakerTagged});
 
    // now split either at natural break or if speaker changes
    
    const mapSpeech = taggedTranscriptions.reduce((s,t) => {
 
      t.segments.forEach((w,wi) => {
        // if its a new section, force a new item
        const {speakerTag, word, endTime} = w;
        const lastItem =  wi && s[s.length-1].speakerTag === speakerTag && s[s.length-1];
        if(!lastItem) {
          s.push({startTime: w.startTime, confidence: t.confidence, languageCode: t.languageCode, words:[], speakerTag})
        }
        const item = s[s.length-1];
        // add the word and update the finish time
        item.words.push(word);
        item.endTime = endTime;
      })
 
      return s;
    },[]).map(s => {
      return {...s, description: s.words.join(" ")};
    });
 
    const result = {
      errorCode: error ? error.code : null,
      errorMessage: error ? error.message : 'success',
      description,
      runId,
      runAt,
      elapsed,
      gcsFile,
      fileName,
      transcript: mapSpeech
    };
 
    return result;
  };

Annotate

This is a general purpose function to do all flavours of annotation.

  const annotate = async({ featurePack, description, gcsFile }) => {
    const startTime = new Date().getTime();
    const runId = startTime.toString(32);
    const runAt = new Date(startTime).toISOString();
    // type(s) of annotations
    const { features, videoContext } = featurePack;
    console.debug('....initializing', features.join(','));
    // add video context to this for speech
    const request = {
      features,
      videoContext,
      inputUri: gcsFile
    };
    console.log('....starting', runId, description, runAt);
    // the result of the long running operation will resolve here
    const operationResult = await doLong(request);
    // get the annotations
    const [annotations] = operationResult.annotationResults;
    const elapsed = new Date().getTime() - startTime;
    console.log('....annotation done after ', elapsed / 1000, features.join(','));
    return {
      annotations,
      runId,
      runAt,
      elapsed,
      gcsFile,
      description
    };
  };

Long running operation

Annotation is a long running operation (these are covered in Long running cloud platform operations and node apis)

  // manage a long running annotation operation
  const doLong = async (request) => {
    // its a long running operation
    const { result, error }  = await till(viClient.annotateVideo(request));
    const [operation] = result;
 
    // console.debug ('annotating', request, { error } );
    // when done, retrieve the result
    const { result: oResult, error: oError } = await till(operation.promise());
    const [operationResult] = oResult;
 
    // console.debug('getting result', { oError });
    return operationResult;
  };

Time measurement

The Video intelligence APIS uses a time offset which consists of (Long) number of seconds (note this is not a Number), along with a number of nanseconds. Here’s how to convert it to seconds.

  const getTimeOffset = (timeOffset) => {
    if(!timeOffset) {
      console.log('missing timeoffset');
      return 0;
    }
    const { seconds, nanos } = timeOffset;
    // seconds is actually a Long object, and nano is a Number.
    const timeoffset = parseFloat(seconds || 0) + (nanos || 0) / 1e9;
    return timeoffset
  };

Splitting the response

As previously mentioned, the response will contain 1 entire summary dialog, preceded by a number of dialog snippets. Here’s how to split them out

// if speaker diarization is on, then the very last item in the speech transscripttions will be a summary of
  // all transcriptions with speaker tags attached
  const splitTranscriptions = ({cleanSpeech, speechTranscriptionConfig})=> {
 
    const speakerTagged = cleanSpeech.slice(-1);
    if (speechTranscriptionConfig.enableSpeakerDiarization) {
      if(!speakerTagged) {
        console.error('....speaker diarization missing - skipping ')
        return {
          transcriptions: cleanSpeech
        }
      } else {
        if (cleanSpeech.length < 2) {
          console.error('....speaker diarization item suspect - skipping')
          return {
            transcriptions: cleanSpeech
          }
        } else {
          return {
            transcriptions: cleanSpeech.slice(0,-1),
            speakerTagged
          }
        }
      }
    }
  };

Tagging speakers

Once you have the summary dialog containing the speaker tags, you need to go back and assign these to the original dialogs, using a function like this

  // attach speaker tags to transcriptions
  const tagSpeakers =  ({transcriptions, speakerTagged}) => 
    speakerTagged 
      ? transcriptions.map(f => ({
          ...f,
          segments: f.segments.map(g => {
            const segment = speakerTagged.map(h=>h.segments.find(s=>s.startTime === g.startTime))[0];
            if(!segment) {
              console.error('....couldnt find speakertag item for ', g)
            } else {
              g.speakerTag = segment.speakerTag;
            }
            return g;
          })
        }))
      : transcriptions;

Since G+ is closed, you can now star and follow post announcements and discussions on github, here