Archive for the 'Audio' Category

Photoshopping Audio…

By now, we’re all familiar with the idea that images can be manipulated – “photoshopped” – to modify a depicted scene in some way (for example, Even if the Camera Never Lies, the Retouched Photo Might…). Vocal representations can be modified using audio process techniques such as pitchshifting, and traditional audio editing techniques such as cutting and splicing can be used to edit audio files and create “spoken” sentences that have never been uttered before by reordering separately cut words.

But what if we could identify both the words spoken by an actor, and model their voice, so that we could edit out their mistakes, or literally put our own words in their mouths, by changing a written text that is then used to generate the soundtrack?

A demonstration of a new technique for editing audio was demonstrated by Adobe in late 2016 that does exactly that. An audio track is used to generate both a speech generating model and a text to speech track. This allows the text track to edited, not just in terms of rearranging the order of the originally spoken words, but also inserting new words.

Not surprisingly, the technique could raise concern about the “evidential” quality of recorded speech.

EXERCISE: Read the contemporaneous report of the Adoboe VoCo demonstration from the BBC News website “Adobe Voco ‘Photoshop-for-voice’ causes concern“. What concerns are raised in the report? What other concerns, if any, do you think this sort of technology raises?

The technique was reported in more detail in a SIGGRAPH 2017 paper:

An associated paper –  Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, and Adam Finkelstein. VoCo: Text-based Insertion and Replacement in Audio Narration. ACM Transactions on Graphics 36(4): 96, 13 pages, July 2017 – describes the technique as follows:

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Voice Capture and Modelling

A key part of the Adobe VoCo approach is the creation of a voice model that can be used to generate utterances that sound like the spoken words of the person whose voice has been modelled, a technique we might think of in terms of “voice capture and modelling”. As the algorithms improve, the technique is likely to become more widely available, as suggested by other companies developing demonstrations in this area.

For example, start-up company Lyrebird have already demonstrated a service that will model a human voice from one minute’s worth of voice capture, and allow you to create arbitrary utterances from text spoken using that voice.

Read more about Lyrebird in the Scientific American article New AI Tech Can Mimic Any Voice by Bahar Gholipour.

Lip Synching Video – When Did You Say That?

The ability to use captured voice models to generate narrated tracks works fine for radio, but what about if you wanted to actually see the actor “speak” those words? By generating a facial model of a speaker, it is possible to use a video representation of an individual as a puppet whose facial movements can be acted by someone else, a technique described as facial re-enactment (Thies, Justus, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. “Real-time expression transfer for facial reenactment“, ACM Trans. Graph. 34, no. 6 (2015): 183-1).

Facial re-enactment involves morphing features or areas from one face onto corresponding elements of another, and then driving a view of the second face from motion capture of the first.

But what if we could generate a model of the face that allowed facial gestures, such as lip movements, to be captured at the same time as an audio track, and then use the audio (and lip capture) from one recording to “lipsync” the same actor speaking those same words in another setting?

The technique, described in Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. “Synthesizing Obama: learning lip sync from audio,” ACM Transactions on Graphics (TOG) 36.4 (2017): 95, describes the process as follows: audio and sparse mouth shape features from one video are associated using a neural network. The sparse mouth shape is then used to synthesize a texture for the mouth and lower region of the face that can be blended onto a second, stock video of the same person, and the jaw shapes aligned.

For now, the approach is limited to transposing the spoken words from a video recording of a person speaking one time to a second video of them. As one of the researchers, Steven Seitz, is quoted in Lip-syncing Obama: New tools turn audio clips into realistic video“[y]ou can’t just take anyone’s voice and turn it into an Obama video. We very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”

Advertisements

Smart Hearing

As we have already seen, there are several enabling technologies that need to be in place in order to put together an effective mediated reality system. In a visual augmented reality system, this includes having some sort of device for recording the visual scene, tracking objects within it, rendering augmented features in the scene and the some means of displaying the scene to the user. We reviewed a range of approaches for rendering augmented visual scenes in the post Taxonomies for Describing Mixed and Alternate Reality Systems, but how might we go about implementing an audio based mediated reality?

In Noise Cancellation – An Example of Mediated Audio Reality?, we saw how headphone based systems could be used to present a processed audio signal to a subject directly – a proximal form of mediation such as a head mounted display – or a speaker could be used to provide a more environmental form of mediation rather more akin to a projection based system in the visual sense.

Whilst enabling technologies for video based proximal AR systems are still at the clunky prototype stage, at best, discreet solutions for realtime, daily use, audio based mediation already exist, complemented in recent years by advanced digital signal processing techniques, in the form of hearing aids.

The following promotional video shows how far hearing aids have developed in recent years, moving from simple amplifiers to complex devices combing digital signal processing of the audio environment with integration with other audio generating devices, such as phones, radios and televisions.

To manage the range of features offered by such devices, they are complemented by full featured remote control apps that allow the user to control what they hear, as well as how they hear it – audio hyper-reality:

The following video review of the here “Active Listening” earbuds further demonstrates how “audio wearables” can provide a range of audio effects – and capabilities – that can augment the hearing of a wearer who does not necessarily suffer from hearing loss or some other form of hearing impairment. (If you’d rather read a review of the same device, the Vice Motherboard blog has one – These Earbuds Are Like Instagram Filters for Live Music.)

SAQ: What practical challenges face designers of in-ear, wirelessly connected audio devices?
Answer: I can think of two immediately: how is the wireless signal received (what sort of antenna is required?) and how is the device powered?

Customised frequency response profiles are also supported in some mobile phones. For example, top-end Samsung Android phones include a feature known as  Adapt Sound that allows a user to calibrate their phone’s headphone’s based on a frequency based hearing test (video example).

Hearing aids are typically comprised of several elements: an earpiece that transmits sound into the ear; a microphone that receives the sound; and amplifier that amplifies the sound; and a battery pack that powers the device. Digital hearing aids may also include remote control circuitry to allow the hearing aid to be controlled remotely; circuitry to support digital signal processing of the received sound; and even a wireless receiver capable of receiving and then replaying sound files or streams from a mobile phone or computer.

Digital hearing aids can be configured to tune the frequency response of the device to suit the needs of each individual user as the following video demonstrates.

Hearing aids come in a range of form factors – NHS Direct describes the following:

  • Behind-the-ear (BTE): rests behind the ear and sends sound into the ear through either an earmould or a small, soft tip (called an open fitting)
  • In-the-ear (ITE): sits in the ear canal and the shell of the ear
  • In-the-canal (ITC): working parts in the earmould, so the whole hearing aid fits inside the ear canal
  • Completely-in-the-canal (CIC): fits further into your ear canal than an ITC aid

Age UK further identify two forms of spectacle hearing aid systemsbone conduction devices and air conduction devices – that are suited to different forms of hearing loss:

With a conductive hearing loss there is some physical obstruction to conducting the sound through the outer ear, eardrum or middle ear (such as a wax blockage, or perforated eardrum). This can mean that the inner ear or nerve centre on that ear is in good shape, and by sending sound straight through the bone behind a patient’s ear the hearing loss can effectively be bypassed. Bone Conduction or “BC” spectacle hearing aids are ideal for this because a transducer is mounted in the arm of the glasses behind the ear that will transmit the sound through the bone to the inner ear instead of along the ear canal.

Sensorineural hearing loss occurs when the anatomical site responsible for the deficiency is the inner ear or further along the auditory pathway (such as age related loss or noise induced hearing loss). Delivering the sound via a route other than the ear canal will not help in these cases, so Air Conduction “AC” spectacle hearing aids are utilised with a traditional form of hearing aid discreetly mounted in the arm of the glasses and either an earmould or receiver with a soft dome in the ear canal.

The following video shows how the frames of digital hearing glasses can be used to package the components required to implement to hearing aid.

And the following promotional video shows in a little more detail how the glasses are put together – and how they are used in everyday life (with a full range of digital features included!).

EXERCISE: Read the following article from The Atlantic – “What My Hearing Aid Taught Me About the Future of Wearables”. What does the author think wearable devices need to offer to make the user want to wear them? How does the author’s experience of wearing a hearing aid colour his view of how wearable devices might develop in the near future?

Many people wear spectacles and/or hearing aids as part of their everyday life, “boosting” the perception of reality around them in particular ways in order to compensate for less than perfect eyesight or vision. Advances in hearing aids suggest that many hearing aid users may already be benefiting from reality augmentations that people without hearing difficulties may also value. And whilst wearing spectacles to correct for poor vision is a commonplace, it is possible to wear eyewear without a corrective function as a fashion item or accessory. Devices such as hearing spectacles already provide a means of combining battery powered, wifi connected audio as well as “passive” visual enhancements (corrective lenses). So might we start to see those sorts of device evolving as augmented reality headwear?

Can You Really Believe Your Ears?

In Even if the Camera Never Lies, the Retouched Photo Might… we saw how photos could be retouched to provide an improved version of a visual reality, and in the interlude activity on Cleaning Audio Tracks With Audacity we saw how a simple audio processing tool could be used to clean up a slightly noisy audio track. In this post, we’ll see how particular audio signals can be modified in real time, if we have access to them individually.

Audio tracks recorded for music, film, television or radio are typically multi-track affairs, with each audio source having its own microphone and its own recording track. This allows each track to be processed separately, and then mixed with the other tracks to produce the final audio track. In a similar way, many graphic designs, as well as traditional animations, are constructed of multiple independent, overlaid layers.

Conceptually, the challenge of augmented reality may be likened to adding an additional layer to the visual or audio scene. In order to achieve an augmented reality effect, we might need to separate out a “flat” source such as mixed audio track of a video image into separate layers, one for each item of interest. The layer(s) corresponding to the item(s) of interest may then be augmented through the addition of an overlay layer onto each as required.

One way of thinking about visual augmented reality is to consider it in terms of inserting objects into the visual field, for example adding an animated monster into a scene, overlaying objects in some way, such as re-coloring or re-texturing them, or transforming them, for example by changing their shape.

EXERCISE: How might you modify an audio / sound based perceptual environment in each of these ways?

ANSWER: Inserted – add a new sound into the audio track, perhaps trying to locate it spatially in the stereo field. Overlaid – if you think of this in terms of texture, this might be like adding echo or reverb to a sound, although this is actually more like changing how we perceive the space the sound is located in. Transformed might be something like pitch-shifting the voice in real time, to make it sound higher pitched, or deeper. I’m not sure if things like noise cancellation would count as a “negative insertion” or a “cancelling overlay”?!

When audio sources are recorded using separate tracks, adding additional effects to them becomes a simple matter. It also provides us with an opportunity to “improve” the appearance of the audio track just as we might “improve” a photograph by retouching it.

Consider, for example, the problem of a singer who can’t sing in tune (much like the model with a bad complexion that needs “fixing” to meet the demands of a fashion magazine…). Can we fix that?

Indeed we can – part of the toolbox in any recording studio will be something that can correct for pitch and help retune an out-of-tune vocal performance.

For an interesting read on Auto-Tune, an industry standard pitch correction tool, see The Mathematical Genius of Auto-Tune.

But vocal performances can also be transformed in other ways, with an actor providing a voice performance, for example, that can then be transformed so that it sounds like a different person. For example, the MorphBox Voice Changer application allows you to create a range of voice profiles that can transform your voice into a range of other voice types.

Not surprising, as the computational power of smartphones increases, this sort of effect has made its way into novelty app form. Once again, it seems as if augmented reality novelty items are starting to appear all around us, even if we don’t necessarily think of them as such as first.

DO: if you have a smart-phone, see if you can find an voice modifying application for it. What features does it offer? TO what extent might you class it as an augmented reality application, and why?

Diminished Audio Reality – Removing a Vocal from a Musical Jingle

In the post Noise Cancellation – An Example of Mediated Audio Reality? we saw how background or intrusive environmental noise could be removed using noise cancelling headphones. In this post, you’ll learn a simple trick for diminishing an audio reality by removing a vocal track from a musical jingle.

Noise cancellation may be thought of adding the complement of everything that is not the desired signal component to an audio feed in order to remove the unwanted noise component. This same idea can be used as the basis of a crude attempt to remove a mono vocal signal from a stereo audio track by creating our own inverse of the vocal track and then subtracting it from the original mix.

SAQ: Describe an algorithm corresponding to the first part of  method suggested in the How to Remove Vocals from a Song Using Audacity video for removing a vocal track from stereo music track. How does the algorithm compare to the algorithm you described for the noise cancelling system?

SAQ: The technique described in the video relies on the track having a mono vocal signal and stereo backing track. The simple technique also lost some of the bass when the vocals were removed. How was the algorithm modified to try to preserve the bass component? How does the modification preserve the bass component? 

Recovering Audio from Video – But Not How You Might Expect…

 In The Art of Sound – Algorithmic Foley Artists?, we saw how researchers from MIT’s CSAIL Lab were able to train a system to try to recreate the sound of a silently videoed object being hit by a drumstick using a model based on video+sound recordings of lots of different sorts of objects being hit by a drumstick. In this post, we’ll see another way of recovering audio information from a purely visual capture of a visual scene, also developed at CSAIL.

Fans of Hollywood thrillers or surveillance-themed TV series may be familiar with the idea of laser microphones, in which laser light projected onto and reflected from a window can be used to track the vibrations of the window pane and record the audio of people talking behind the window.

Once the preserve of surveillance agencies, such devices can today be cobbled together in your garage using components retrieved from commodity electronics devices.

The technique used by the laser microphone is based on measuring vibrations caused by sound waves relating to the sound you want to record. Which suggests that if you can find other ways of tracking the vibrations, you should similarly be able to retrieve the audio. Which is exactly what the MIT CSAIL researchers did: by analysing video footage of objects that vibrated in sympathy (albeit minutely) to sounds in their environment, they were able to generate a recovered audio signal.

As the video shows, in the case of capturing a background musical track, whilst the audio was not necessarily the highest fidelity, by feeding the input into another application – such as Shazam, an application capable of recognising music tracks – the researchers were at least able to identify it automatically.

So not only can we create videos from still photographs, as described  in Hyper-reality Offline – Creating Videos from Photos, we can also recover audio from otherwise silent videos.

Interlude – Cleaning Audio Tracks With Audacity

Noise cancelling headphones remove background noise by comparing a desired signal to a perceived signal and removing the unwanted components. So for noisy situations where we don’t have access to the clean signal, are we stuck with just the noisy signal?

Not necessarily.

Audio editing tools like Audacity can also be used to remove constant background noise from an audio track by building a simple model of the noise component and then removing it from the audio track.

The following tutorial shows how a low level of background noise may be attenuated by generating a model of the baseline noise on a supposedly quiet part of an audio track and then removing it from the whole of the track. (The effect referred to as Noise Removal in the following video has been renamed Noise Reduction in more recent versions of Audacity.)

SAQ: As the speaker records his test audio track, we see Audacity visualising the waveform in real time. To what extent might we consider this a form of augmented reality?

Other filters can be used to remove noise components with a different frequency profile such as the “pops” and “clicks” you might hear on a recording made from a vinyl record.

In each of the above examples, Audacity’s visual representation of the audio waveform, creating a visual reality from an audio one. This reinforces through visualisation what the original problems were with the audio signals and the consequences of applying the particular audio effect when trying to clean them.

DO: if you have a noisy audio file to hand and fancy trying to clean it up, why not try out the techniques shown in the videos above – or see if you can find any more related tutorials.

Noise Cancellation – An Example of Mediated Audio Reality?

Whilst it is tempting to focus on the realtime processing of visual imagery when considering augmented reality, notwithstanding the tricky problem of inserting a transparent display between the viewer and the physical scene when using magic lens approaches, it may be that the real benefits of augmented reality will arise from the augmentation or realtime manipulation of another modality such as sound.

EXERCISE: describe two or three examples of how audio may be used, or transformed, to alter a user’s perception or understanding of their current environment.

ANSWER: car navigation systems augment spatial location with audio messages describing when to turn and audio guides in heritage settings, where you can listen to a story that “augments” a particular location. Noise cancelling earphones transform the environment by subtracting, or tuning out, background noise and modern digital hearing aids process the audio environment at a personal level in increasingly rich ways.

Noise Cancellation

As briefly described in Blurred Edges – Dual Reality, mediated reality is a general term in which information may be added to or subtracted from a real world scene. In many industrial and everyday settings, intrusive environmental noise may lead to an unpleasant work environment, or act as an obstacle to audio communication. In such situations, it might be convenient to remove the background noise and expose the subjects within it to a mediated audio reality.

Noise cancellation provides one such form of mediated reality, where the audio environment is actively “cleaned” of an otherwise intrusive noise component. Noise cancellation technology can be use to cancel out intrusive noise in noisy environments, such as cars or aircraft. By removing noisy components from the real world audio, noise cancellation may be thought of as producing a form of diminished reality, in the sense that environmental components have ben lost, rather than added to, even though the overall salient signal to noise ration may have increased.

Noise cancelled environments might also be considered as a form of hyper-reality, in the sense that no information other than that contained within, or derived from, the original signal is presented as part of the “augmented” experience.

EXERCISE: watch the following videos that demonstrate the effect of noise cancelling headphones and that describe how they work, then answer the following questions:

  • how does “active” noise cancellation differ from passive noise cancellation?
  • what sorts of noise are active noise cancellation systems most effective at removing, and why?
  • what sort of system can be used to test or demonstrate the effectiveness of noise cancelling headphones?

Finally, write down an algorithm that describes, in simple terms, the steps involved in a simple noise cancelling system.

EXERCISE: Increasingly, top end cars may include some sort of noise cancellation system to reduce the effects of road noise. How might noise cancellation be used, or modified, to cancel noise in an enclosed environment where headphones are not typically worn, such as when sat inside a car?

Rather than presenting the mixed audio signal to a listener via headphones, under some circumstances speakers may be used to cancel the noise as experienced within a more open environment.

As well as improving the experience of someone listening to music in a noisy environment, noise cancellation techniques can also be useful as part of a hearing aid for hard of hearing users. One of the major aims of hearing aid manufacturers is to improve the audibility of speech – can noise cancellation help here?

EXERCISE: read the articles – and watch/listen to the associated videos – Noise Reduction Systems and Reverb Reduction produced by hearing aid manufacturer Sonic. What sorts of audio reality mediation are described?

It may seem strange to you to think of hearing aids as augmented, or more generally, mediated, reality devices, but their realtime processing and representation of the user’s current environment suggests this is exactly what they are!

In the next post on this theme, we will explore what sorts of physical device or apparatus can be used to mediate audio realities. But for now, let’s go back to the visual domain…


Categories