As well as being a visual medium, films also rely on a rich audio environment to communicate emotion and affect (sic). In some cases, it may not be possible to capture the sound associated with a particular action, either because of noise in the environment (literally), or because the props themselves do not have the physical properties of the thing they portray. For example, two wooden swords used in a sword fight that are painted to look like metal would not sound like metal swords when coming in contact to each other. When a film is dubbed, and the original speech recording replaced by a post-production recording, any original sound effects also need to be replaced.
Foley artists add sounds to a film in post-production (that is, after the film has been shot). As foley artist John Roesch describes, “whatever we see on that screen, we are making the most honest representation thereof, sonically“ (“Where the Sounds From the World’s Favorite Movies Are Born“, Wired, 0m42s).
One of the aims of the foley artist is to represent the sounds that the viewer expects to hear when watching a particular scene. As Roesch says of his approach, “when I look at a scene, I hear the sounds in my head” (0m48s). So can a visual analysis of the scene be used to identify material interactions and then automatically generate sounds corresponding to our expectations of what those interactions should sound like?
This question was recently asked by a group of MIT researchers (Owens, Andrew, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, and William T. Freeman. “Visually Indicated Sounds.” arXiv preprint arXiv:1512.08512 [PDF] (2015)) and summarised in the MIT News article “Artificial intelligence produces realistic sounds that fool humans“.
“On many occasions, … sounds are not just statistically associated with the content of the images – the way, for example, that the sounds of unseen seagulls are associated with a view of a beach – but instead are directly caused by the physical interaction being depicted: you see what is making the sound. We call these events visually indicated sounds, and we propose the task of predicting sound from videos as a way to study physical interactions within a visual scene. To accurately predict a video’s held-out soundtrack, an algorithm has to know about the physical properties of what it is seeing and the actions that are being performed. This task implicitly requires material recognition…”
In their study, the team trained an algorithm using thousands of videos of a drum stick interacting with a wide variety of material objects in an attempt to associate particular with sounds with different materials, as well as the mode of interaction (hitting, scraping, and so on).
The next step was then to show the algorithm a silent video, and see if it could generate an appropriate soundtrack, in effect acting as a synthetic foley artist (Visually-Indicated Sounds, MITCSAIL).
SAQ: to what extent do you think foley artists like John Roesch might be replaced by algorithms?
Answer: whilst the MIT demo is an interesting one, it is currently limited to a known object – the drumstick – interacting with an arbitrary object. The video showed how even then, the algorithm occasionally misinterpreted the sort of interaction being demonstrated (e.g. mistaking a hit). For a complete system, the algorithm would have to identify both materials involved in the interaction, as well as the sort of interaction, and synthesize an appropriate sound. If the same sort of training method was used for this more general sort of system, I think it would be unlikely that a large enough corpus of training videos could be created (material X interacts with material Y in interaction Z) to provide a reliable training set. In addition, as foley artist John Roesch pointed out, “what you see is not necessarily what you get” (1m31s)…!
This is a good example of doing the same thing in reverse – the Dancing in the Street video with approximations of the actual sounds made by the dancers https://www.youtube.com/watch?v=BHkhIjG0DKc