Photoshopping Audio…

By now, we’re all familiar with the idea that images can be manipulated – “photoshopped” – to modify a depicted scene in some way (for example, Even if the Camera Never Lies, the Retouched Photo Might…). Vocal representations can be modified using audio process techniques such as pitchshifting, and traditional audio editing techniques such as cutting and splicing can be used to edit audio files and create “spoken” sentences that have never been uttered before by reordering separately cut words.

But what if we could identify both the words spoken by an actor, and model their voice, so that we could edit out their mistakes, or literally put our own words in their mouths, by changing a written text that is then used to generate the soundtrack?

A demonstration of a new technique for editing audio was demonstrated by Adobe in late 2016 that does exactly that. An audio track is used to generate both a speech generating model and a text to speech track. This allows the text track to edited, not just in terms of rearranging the order of the originally spoken words, but also inserting new words.

Not surprisingly, the technique could raise concern about the “evidential” quality of recorded speech.

EXERCISE: Read the contemporaneous report of the Adoboe VoCo demonstration from the BBC News website “Adobe Voco ‘Photoshop-for-voice’ causes concern“. What concerns are raised in the report? What other concerns, if any, do you think this sort of technology raises?

The technique was reported in more detail in a SIGGRAPH 2017 paper:

An associated paper – Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, and Adam Finkelstein. VoCo: Text-based Insertion and Replacement in Audio Narration. ACM Transactions on Graphics 36(4): 96, 13 pages, July 2017 – describes the technique as follows:

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Voice Capture and Modelling

A key part of the Adobe VoCo approach is the creation of a voice model that can be used to generate utterances that sound like the spoken words of the person whose voice has been modelled, a technique we might think of in terms of “voice capture and modelling”. As the algorithms improve, the technique is likely to become more widely available, as suggested by other companies developing demonstrations in this area.

For example, start-up company Lyrebird have already demonstrated a service that will model a human voice from one minute’s worth of voice capture, and allow you to create arbitrary utterances from text spoken using that voice.

Read more about Lyrebird in the Scientific American article New AI Tech Can Mimic Any Voice by Bahar Gholipour.

Lip Synching Video – When Did You Say That?

The ability to use captured voice models to generate narrated tracks works fine for radio, but what about if you wanted to actually see the actor “speak” those words? By generating a facial model of a speaker, it is possible to use a video representation of an individual as a puppet whose facial movements can be acted by someone else, a technique described as facial re-enactment (Thies, Justus, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. “Real-time expression transfer for facial reenactment“, ACM Trans. Graph. 34, no. 6 (2015): 183-1).

Facial re-enactment involves morphing features or areas from one face onto corresponding elements of another, and then driving a view of the second face from motion capture of the first.

But what if we could generate a model of the face that allowed facial gestures, such as lip movements, to be captured at the same time as an audio track, and then use the audio (and lip capture) from one recording to “lipsync” the same actor speaking those same words in another setting?

The technique, described in Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. “Synthesizing Obama: learning lip sync from audio,” ACM Transactions on Graphics (TOG) 36.4 (2017): 95, describes the process as follows: audio and sparse mouth shape features from one video are associated using a neural network. The sparse mouth shape is then used to synthesize a texture for the mouth and lower region of the face that can be blended onto a second, stock video of the same person, and the jaw shapes aligned.

For now, the approach is limited to transposing the spoken words from a video recording of a person speaking one time to a second video of them. As one of the researchers, Steven Seitz, is quoted in Lip-syncing Obama: New tools turn audio clips into realistic video, “[y]ou can’t just take anyone’s voice and turn it into an Obama video. We very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”

	Motorsport Stats… on Behind the Scenes of Sports Br…
	Motorsport Stats… on Augmented TV Sports Coverage…
	Game Dev \| Pearltree… on The Process of Game Creation…
	Fragments – Lo… on Augmented TV Sports Coverage…
	Interlude – AR… on From Magic Lenses to Magic Mir…

Digital Worlds – Distorted Reality