Automated video editing will very soon be ‘good enough’

Tuesday, 30 May 2017

A team of Stanford researchers have published a paper on automatic editing of dialogue scenes. Their system may not automatically edit well, but it can now create edits that the majority of people will see as ‘good enough.’ This means editors have new competition. As well as being technically proficient, and be able to handle all sorts of political and psychological situations, they will have baseline edits to improve on.

Computational Video Editing for Dialogue-Driven Scenes describes a system where different combinations of editing priorities (which kind of shots to favour, which kind of performances to prioritise) are defined as editing idioms. These idioms can then be applied to footage of dialogue scenes when accompanied by a script.

Identify clips

Their system takes a script formatted in an industry standard way, analyses multiple takes of multiple camera setups and divides ranges of each take into candidate clips. These shots are assigned automatically generated labels defining

  • The name of the character speaking the line of script
  • The emotional sentiment of the line of script (ranging from negative to positive via neutral)
  • The number of people in the clip
  • The zoom level of the clip, i.e. the framing of the clip (long, wide, medium, closeup, extreme closeup)
  • Who is in the clip
  • The volume of the audio in the clip
  • The length of the clip (as part of a much longer take - the speed a given line is said)

Editing idioms

The researchers then analysed multiple editing idioms (pieces of editing advice) and worked out what combination of clip styles would result in edits that match a given style:

Change zoom gradually: Avoid large changes in zoom level

Emphasize character: Avoid cutting away from an important character during short lines from the other characters Favor two kinds of transitions; (1) transitions in which the length of both clips is long, and (2) transitions in which one of the clips is short and the important character is in the set of visible speakers for the other clip and both clips are from the same take.

Mirror position: Transition between 1-shots of performers that mirror one another’s horizontal positions on screen.

Peaks and valleys: Encourage close ups when the emotional intensity of lines is high, wide shots when the emotional intensity is low, and medium shots when it is in the middle.

Performance fast: Select the shortest clip for each line

Performance slow: Select the longest clip for each line

Performance loud: Select the loudest clip for each line

Performance quiet: Select the quietest clip for each line

Short lines: Avoid cutting away to a new take on short lines

Zoom consistent: Use a consistent zoom level throughout the scene

Zoom in/out: Either zoom in or zooming out over the scene

Combine idioms to make a custom style

Using an application the researchers showed how individual idioms (or pieces of editing advice) and specific instructions (“start on a wide shot’ or ‘keep speaker visible’) can be combined to make an editing style. Each element can be given a weight ranging from ‘always follow instruction’ to ‘always do opposite of instrcution.’

This UI mockup shows how an editing style can be built where the elements are ‘start with a wide shot, avoid jump cuts, show speaker’:


The paper comes with a demo video that explains the process and give examples of a scene professionally edited and the same scene automatically edited using different editing styles.

To see more example videos and source footage visit the paper’s site at Stamford.

Time savings

The impetus behind developing system was to save time, and to save the cost of hiring a professional editor. 

For multiple dialogue scenes the researchers timed how long it took for an professional editor to review all footage and come up with an edited scene. As this method is at the research stage, the kind of analysis that the tools need to do on the video takes a long time. In the case of the scene shown in the demo and in the screenshot, a 27 line scene with 15 takes (of varying shot size and angle) amounting to 18 minutes of rushes took 3 hours and 20 minutes to analyse. The professional editor took 3 hours to come up with an edit.

The advantage came when changes needed to be made in editing style. The automated system could re-edit the scene in 3 seconds. It would take many times longer for an editor to re-edit a scene following new instructions. The analysis stage was done on a 3.1 GHz MacBook Pro with 16GB of RAM. With software and hardware improvements the time it takes to turn multiple takes into labelled clips will reduce significantly.

What does ‘good enough’ mean for editors and post production?

To me this method marks a tipping point. For productions with many hours of rushes, these kind of automated pre-edits are good enough. Good enough to release (with a few minutes of tidying up) in some cases. Good enough to based production decisions on (such as ‘We can now strike this set’). Good enough so that a skilled editor can spend a short time tidying some of the automated edits and preparing it to be shared with the world.

Although the researchers haven't encoded the kind of editing idioms many good editors actually follow, the ones they have chosen will do for many situations. There are two reasons for this: the researchers don’t know these practices, or they don't yet have a way to detect elements of scripts and source footage that editors currently base their personal editing idioms on.

One of the great things about the job of being an editor is that it is hard for others to compare your editing abilities with other editors. Up until now, a person would have to look at all the footage and all the versions of the script for a given production to judge whether the editor got the best possible result. Even then, that judgement would only be one more person’s opinion.

Now an editor’s take can be compared with automated edits like the ones described in this paper. Their style will soon be able to be detected and encoded as an editing style for automated edits. Could I sell a plugin based on my editing idiom? 0.1% of receipts would be big enough royalty of me!

The good news for editors who are worried about being replaced is that once your skills get to the level of ‘not obviously bad’ - which is the ability to do edits that aren't jarring, that flow from moment to moment and scene to scene - other factors take over: to be the kind of person who fits into the wider organisation, to be the person who you can share a small space with for hours on end, a person who can judge the politics and psychology of situations with collaborators at all levels.

Who knows when this kind of technology will be available outside academia? For now it is worth bearing in mind that alongside the three researchers from Stanford University, Mackenzie Leake, Abe Davis and Maneesh Agrawala the authorship of the paper was also shared with Any Truong of Adobe Research.