Archive Page 3

Diminished Audio Reality – Removing a Vocal from a Musical Jingle

In the post Noise Cancellation – An Example of Mediated Audio Reality? we saw how background or intrusive environmental noise could be removed using noise cancelling headphones. In this post, you’ll learn a simple trick for diminishing an audio reality by removing a vocal track from a musical jingle.

Noise cancellation may be thought of adding the complement of everything that is not the desired signal component to an audio feed in order to remove the unwanted noise component. This same idea can be used as the basis of a crude attempt to remove a mono vocal signal from a stereo audio track by creating our own inverse of the vocal track and then subtracting it from the original mix.

SAQ: Describe an algorithm corresponding to the first part of  method suggested in the How to Remove Vocals from a Song Using Audacity video for removing a vocal track from stereo music track. How does the algorithm compare to the algorithm you described for the noise cancelling system?

SAQ: The technique described in the video relies on the track having a mono vocal signal and stereo backing track. The simple technique also lost some of the bass when the vocals were removed. How was the algorithm modified to try to preserve the bass component? How does the modification preserve the bass component? 

Advertisements

Recovering Audio from Video – But Not How You Might Expect…

 In The Art of Sound – Algorithmic Foley Artists?, we saw how researchers from MIT’s CSAIL Lab were able to train a system to try to recreate the sound of a silently videoed object being hit by a drumstick using a model based on video+sound recordings of lots of different sorts of objects being hit by a drumstick. In this post, we’ll see another way of recovering audio information from a purely visual capture of a visual scene, also developed at CSAIL.

Fans of Hollywood thrillers or surveillance-themed TV series may be familiar with the idea of laser microphones, in which laser light projected onto and reflected from a window can be used to track the vibrations of the window pane and record the audio of people talking behind the window.

Once the preserve of surveillance agencies, such devices can today be cobbled together in your garage using components retrieved from commodity electronics devices.

The technique used by the laser microphone is based on measuring vibrations caused by sound waves relating to the sound you want to record. Which suggests that if you can find other ways of tracking the vibrations, you should similarly be able to retrieve the audio. Which is exactly what the MIT CSAIL researchers did: by analysing video footage of objects that vibrated in sympathy (albeit minutely) to sounds in their environment, they were able to generate a recovered audio signal.

As the video shows, in the case of capturing a background musical track, whilst the audio was not necessarily the highest fidelity, by feeding the input into another application – such as Shazam, an application capable of recognising music tracks – the researchers were at least able to identify it automatically.

So not only can we create videos from still photographs, as described  in Hyper-reality Offline – Creating Videos from Photos, we can also recover audio from otherwise silent videos.

Hyper-reality Offline – Creating Videos from Photos

In Mediating the Background and the Foreground – From Green Screen and Chroma-Key Effects to Virtual Sets we saw how green screen/chroma key effects could be used to mask out part of one image so that it could be composited with another. In this post, you’ll see how we can also generate animation effects from a single image.

Many of you will recognise the following effect from television documentaries, as well as screen savers or photo-stories:

Know as the Ken Burns effect, named after the documentary maker who made extensive use of the technique, it allows a moving image to be generated from a still photograph by panning and zooming across the image.

But what happens if you take a flat, static image, separate out the foreground and background elements, and then apply the effect, panning and zooming foreground and background elements differentially to create a “2.5D” parallax effect?

These views can be created from a single, flat image by cutting the foreground component out into its own layer, and then inpainting the background layer;  when the foreground component moves relative to the background, the inpainted area hides the fact that that part of the original image was taken up by the foreground component.

The inpainting effect can be achieved by applying an image processing technique that works from the edge of a cropped area inwards, trying to predict what value each missing neighbouring pixel should be based on the actual set values of the surrounding pixels. More elaborate techniques allow for “content aware” fills, in which the patterns generated from the surrounding texture are used to fill in the missing area. The following video show how to apply such as content aware effect in a popular photo-editing tool.

An extension of the technique – content aware crop – automatically inpaints whitespace around the edge of an image when changing the aspect ration of an image, such as following a straightening of the horizon.

Developing algorithms for improved content aware fills is an active area of academic, as well as commercial, research (eg (Pathak, Deepak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. “Context Encoders: Feature Learning by Inpainting.arXiv preprint arXiv:1604.07379 (2016).)).

Related techniques can be used to improve the quality of images, as demonstrated by the Magic Pony Technology company (MIT Technology Review – Artificial Intelligence Can Now Design Realistic Video and Game Imagery) or deep learning neural networks. For example, a project by David Garcia shows how deep learning can “upscale 16×16 images by a 4x factor. The resulting 64×64 images display sharp features that are plausible based on the dataset that was used to train the neural net.” Here’s an example of what the networks can do (“the first column is the 16×16 input image, the second one is what you would get from a standard bicubic interpolation, the third is the output generated by the neural net, and on the right is the ground truth”):

srez_sample_output.png

Additional 2.5d effects can be created by animating both the foreground and background elements. Alternatively, by associating a mesh with particular points in a photo, translating those points appropriately results in the animation of the meshed element .

These effects are all based on the manipulation of pixels within a static image. But as you’ll see in another post, flat images can also be used as the basis for generating three dimensional models.

Tuning the colour palette of an image can is another technique that can be used to make it feel hyper-real, or somehow sharper than the captured reality. Similar techniques can also be applied to video to create a stylised hyper-real video effect.

As you are perhaps beginning to realise, many mediated reality effects rely on a whole stack of technologies, techniques or other effects being available first. But this in turn means that many of today’s yet-to-be invented techniques are likely to be built from a novel combination of techniques that already exist, or that can be built on; and once those new techniques are identified, and tools built to implement them efficiently, they in turn will provide the basis for yet more techniques.

Mediating the Background and the Foreground – From Green Screen and Chroma-Key Effects to Virtual Sets

It may be hard to remember now, but the first digital cameras only started to appear on the shelves in 1990, to be replaced for many just a decode later by camera replacing smartphones. Prior to that, cameras were film based, or produced “self-developing” polaroid photos, printed by the camera itself. Many film based cameras required the film to be manually “wound on” between taking one photograph and the next. Failure to do this could result in a particular piece of the film being double exposed, with the result that two photographs could be superimposed. Such tricks were well know to photographers and film makers alike, and multiple exposure techniques, along with other tricks of the photographer’s trade, were widely used for creating otherwise impossible to record scenes.

In Behind the Scenes of Sports Broadcasting – Virtual Sets, Virtual Signage and Virtual Advertising, we saw how sports broadcasters could make use of virtual sets to enhance outside, on-location settings. Virtual sets are increasingly used by broadcasters for a wide range of other live television formats, such as news and politics, with digital objects often appearing in front of the presenters. This contrasts with a traditional green screen effect where the background behind the presenter is replaced. In film studios, virtual “backlots” may make extensive use of green screens to replace the need for unwieldy physical sets with digital ones that are rendered in post production.

So how do green screen, or “chroma key” effects work?

Green Screen Effects

Green screen style effects have a long history in film and television and were available long before digital green screen effects became available. The effect relies on producing a matte, or travelling matte, from an image that allows elements from two separate images to be combined in a single image, a process referred to as compositing. By making part of one image transparent, it can be layered on top of another background image.

So what happened next?

(The earlier parts of the above video also retrace the history of the green screen effect.)

Exercise: for a simple demonstration of how green screen compositing works, the post Simple Demo of Green Screen Principle in a Jupyter Notebook Using MyBinder links to an interactive activity using the python programming language (no coding/programming skills required!) showing how to add a background apparently behind a greenscreened television newscaster.

Even without digital technologies and the introduction of virtual digital objects, compositing multiple takes of a few human actors can be used to generated a visual scene that appears to include a cast of thousands:

The following showreel provides some examples of how the green screen effect has been put to use as a virtual backlot for movies:

Exercise: see if you can find behind the scenes footage of the visual effects – VFX – used to create one or two recent of your most recent favourite films.

One problem with the chroma key approach is that much of the magic is done in post-production, and not in real time. But as you will know from watching TV weather reports, chroma-key key effects can be used for real time mediation of video imagery. And increasingly, green screen technique can be used to produce a virtual studio or virtual set in real time, with no post production required:

SAQ: What are the similarities and differences between virtual studio or virtual set and chroma key techniques?

The virtual set itself is a 3D digital model that is rendered around the human presenter(s).

An important part of the system is the ability to track the location of the camera, as well as physical objects within the set.

Replacing part of the visual scene using a chroma-key effect is a tried and trusted technique, and as work on virtual sets shows can be used to support real-time mediated reality effects. Tracking objects within the set allows digital objects to be overlaid on those tracked physical objects. But object tracking in the form of motion capture, and the even more refined performance capture, can be used as the basis for far more elaborate visual effects, as we’ll see in another post.

But first, let’s step aside for a moment, and see how the notion of image layers can be used to transform a single photograph into a short video…

Behind the Scenes of Sports Broadcasting – Virtual Sets, VIrtual Signage and Virtual Advertising

In the post Augmented TV Sports Coverage & Live TV Graphics, we saw how live TV graphics could be used to overlay sports events in order to highlight particular elements of the sports action.

One of the thing things you may have noticed in some of the broadcasts was that as well as live “telestrator” style effects, such as highlighting the trajectory of a ball, or participant tracking effects, many of the scenes also included on pitch advertising. So was the pitch really painted with large adverts, or were they digital effects? The following showreel from Namadgi Systems (which in its full form demonstrates many of the effects shown in the previously mentioned post) suggests that the on pitch advert are, in fact, digital creations. Other vendors of similar services include Broadcast Virtual and BrandMagic.

So-called virtual advertising allows digitally rendered adverts to be embedded into live broadcast feeds in a way that makes the adverts appear as if they are situated on or near the field of play. As such, to the viewer of the broadcast, it may appear as if the advert would be visible to the spectators present at the event. In fact, it may be the case that the insert is an entirely digital creation, an overlay on top some sort of distinguished marker or location (determined relative to an easily detected pitch boundary, for example), or a replacement of a static, easily recognised and masked local advert.

EXERCISE: Watch the following video and see how many different forms of virtual advertising you can detect.

So how many different ways of delivering mediated reality ads did you find?

The following marketing video from Supponor advertises their “digital billboard replacement” (DBRLive) product that is capable of identifying and tracking track or pitched advertising hoardings and replacing them with custom adverts.

EXERCISE: what do you think are the advantages of using digital signage over fixed advertising billboards? What further advantages do “replacement” techniques such as DBRLive have over traditional digital signage? To what extent do you think DBRLive is a mediated reality application?

As well as transforming the perimeter, and event the playing area, with digital adverts, sports broadcasters often present a mediated view of the studio set inhabited by the host and selected pundits to provide continuity during breaks in the action, as the following corporate video from vizrt describes:

So how do virtual sets work and how do they compare with the “chroma key” effects used in TV and film production since the 1940s? We’ll need another post for that…

From Sports Tracking to Surveillance Tracking…

In the post Augmented TV Sports Coverage & Live TV Graphics, we saw how sports broadcasters increasingly make use of effects that highlight tracked elements in a sporting event, from the players in a football match to the ball they are playing with. So how else might we apply such tracking technologies?

According to Melvin Kranzberg’s first law of technology, “Technology is neither good nor bad; nor is it neutral”. In the sports context, we may be happy to thing that cameras can be used to track – and annotate – each player’s every move. But what if we take such technological capabilities and apply them elsewhere?

EXERCISE: As well as being used to support referees making decisions about boundary line events, such as whether a tennis ball landed “in” or “out”, or whether a football crossed the goal line, how might virtual boundaries be used as part of a video surveillance system? To what extent could image tracking systems also be used as part of a video surveillance system?

One way of using virtual boundaries as part of a video based surveillance system might be to use them as virtual trip wires, where breaches of a virtual boundary or fence can be used to flag a warning about a possible physical security breach and perhaps start a detailed recording of the scene.

ASIDE: The notion of virtual tripwires extends into other domains too. For example, for objects tracked using GPS, “geo-fences” can be defined that raise an alert when a tracked object enters, or leaves, a particular geographic area. The AIS ship identification system used to uniquely identify ships – and their locations – can be used as part of a geofenced application to raise an alert whenever a particular boat, such as a ferry, enters or leaves a port.

Video surveillance might also be used to track individuals through a videoed scene. For example, if a person of interest has been detected in a particular piece of footage, they might be automatically tracked through that scene. If multiple cameras cover the same area, persons of interest may be tracked across multiple video feeds, as described by Khan, Sohaib, Omar Javed, Zeeshan Rasheed, and Mubarak Shah. “Human tracking in multiple cameras.” In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 1, pp. 331-336. IEEE, 2001.

Where the environment is rather more constrained, such as an office block, tools such as the FXPAL DOTS Video Surveillance System allow for individuals to be tracked throughout the building. Optional filters also allow tracking or identification based on the colour of clothing, which may be meaningful in an environment where different colour uniforms or protective clothing are used to identify people by role – and perhaps by different access permission levels.

Once a hard computer science problem to solve, a wide variety of programming libraries and tools now support object identification and tracking. There are even Javascript libraries available, such as tracking.js, that are capable of tracking objects and faces streamed from a laptop camera using code that runs just in your browser.

Tracking is one thing – but identification of tracked entities is another. In some situations, however, tracked entities may carry clearly seen identifiers – such as car number plates. Automatic Number Plate Recognition (ANPR) is now a mature technology and is widely deployed against moving, as well as stationary, vehicles.

With technology firmly in place for tracking objects, and perhaps even identifying them, analysts are now turning their attention to systems that are capable of automatically identifying different events, or behaviours, within a visual scene, a step up from the simple “threshold crossing” behaviours used to implement virtual tripwires.

Once behaviours have been automatically identified, the visual scene may be overlaid with a statement of, or interpretation of, those behaviours.

Many technologies are developed for a particular purpose, but that does not prevent them being adopted for other purposes. When new technologies emerge, there are often many opportunities for businesses and entrepreneurs to find ways of using those technologies either on their own or in combination with other technologies. However, there are also risks, not least that the technology is used for a harmful purpose, or that that we do not approve of. More difficult is to try to predict what the consequences of using such technologies widely may be. As technologists, it’s our job to try to think critically about how emerging technologies may be used, whether for good, or evil, and contribute to debates about whether we want to approve the use of such technologies, or limit them in some way.

Augmented TV Sports Coverage & Live TV Graphics

In the post From Magic Lenses to Magic Mirrors and Back Again we saw how magic lenses allow users to look through a screen at a mediated view of the scene in front of them, and magic mirrors allow users to look at a mediated view of themselves. In this post, we will look at how remote viewer might capture a scene that is then mediated in some way before being presented to the viewer in near-real-time. In particular, we will consider how live televised sporting events may be augmented to enhance the viewer’s understanding or appreciation of the event.

Ever since the early days of television, TV graphics have been used to overlay information – often in the “lower third” of the screen – to provide a mediated view of the scene being displayed. For example, one of the most commonly scene lower third effects is to display a banner giving the name and affiliation of a “talking head”, such as a politician being interviewed in a news programme.

But in recent years, realtime annotation of elements within the visual scene have become possible, providing the producers of sports television in particular with a very rich and powerful way of enhancing the way that a particular event is covered with live TV graphics.

EXERCISE: from your own experience, try to recall two or three examples of how “augmented reality” style effects can be used to enhance televised sporting events in a real-time or near-realtime way.

Educators often use questions to focus the attention of the learner onto a particular matter. For example, an educator reading an academic paper may identify things of interest (to them) that they want the learner to pick up on. The educator then needs to find a way of twisting the attention of the learner to those points of interests. This is often what motivates the questions they set around a resource (its purpose is to help the students learn how to focus their attention on a resource and immediately reflect back why something in the paper might be interesting – by casting a question to which the item in the paper is the answer). When addressing a question, the learner also needs to appreciate that they expected to answer the question in an academic way. More generally, when you read something, read it with a set of questions in mind that may have been raised by reading the abstract. You can also annotate the reading with questions which that part of the reading answers. Another trick is to spot when part of the reading answers a question or addresses a topic you didn’t fully understand: “Ah, so that means if this, then that…”. This is  a simple trick, but a really powerful one nonetheless, and can help you develop your own self-learning skills.

EXERCISE: Read through the following abstract taken from a BBC R&D department white paper written in 2012 (Sports TV Applications of Computer Vision, riginally published in ‘Visual Analysis of Humans: Looking at People’, Moeslund, T. B.; Hilton, A.; Krüger, V.; Sigal, L. (Eds.), Springer 2011):

This chapter focuses on applications of Computer Vision that help the sports broadcaster illustrate, analyse and explain sporting events, by the generation of images and graphics that can be incorporated in the broadcast, providing visual support to the commentators and pundits. After a discussion of simple graphics overlay on static images, systems are described that rely on calibrated cameras to insert graphics or to overlay content from other images. Approaches are then discussed that use computer vision to provide more advanced effects, for tasks such as segmenting people from the background, and inferring the 3D position of people and balls. As camera calibration is a key component for all but the simplest applications, an approach to real-time calibration of broadcast cameras is then presented. The chapter concludes with a discussion of some current challenges.

How might the techniques described be relevant to / relate to AR?

Now read through the rest of the paper, and try to answer the following questions as you do so:

  • what is a “free viewpoint”?
  • what is a “telestrator” – to what extent might you claim this is an example of AR?
  • what approaches were taken to providing “Graphics overlay on a calibrated camera image”? How does this compare with AR techniques? Is this AR?
  • what is Foxtrax and how does it work?
  • what effects are possible once you “segment people or other moving objects from the background”? What practical difficulties must be overcome when creating such an effect?
  • how might prior knowledge help when constructing tracking systems? What additional difficulties arise when tracking people?
  • how can environmental features/signals be used to help calibrate camera settings? what does it even mean to calibrate a camera?
  • what difficulties are associated with  Segmentation, identification and tracking?

The white paper also identifies the following challenges to “successfully applying computer vision techniques to applications in TV sports coverage”:

The environment in which the system is to be used is generally out of the control of the system developer, including aspects such as lighting, appearance of the background, clothing of the players, and the size and location of the area of interest. For many applications, it is either essential or highly desirable to use video feeds from existing broadcast cameras, meaning that the location and motion of the cameras is also outside the control of the system designer.

  • The system needs to fit in with existing production workflows, often needing to be used live or with a short turn-around time, or being able to be applied to a recording from a single camera.
  • The system must also give good value-for-money or offer new things compared to other ways of enhancing sports coverage. There are many approaches that may be less technically interesting than applying computer vision techniques, but nevertheless give significant added value, such as miniature cameras or microphones placed in a in cricket stump, a ‘flying’ camera suspended on wires above a football pitch, or a high frame-rate cameras for super-slow-motion.

To what extent do you think those sorts of issues apply more generally to augmented and mediated reality systems?

In the rest of this post, you will some some examples of how computer vision driven television graphics have been used in recent years. As you watch the videos, try to relate the techniques demonstrated with the issues raised in the white paper.

From 2004 to 2010, the BBC R&D department, in association with Red Bee Media, worked on a system known as Piero, now owned by Ericsson, that explored a wide range of augmentation techniques. Watch the following videos and see how many different sorts of “augmentation” effect you can identify. In each case, what sorts of enabling technology do you think are required in order to put together a system capable of generating such an effect?

In the US, SportVision provide a range of real-time enhancements for televised sports coverage. The following video demonstrates car and player tracking in motor-racing and football respectively, ball tracking in baseball and football (soccer), and a range of other “event” related enhancements, such as offside lines or player highlighting in football (soccer).

EXERCISE: watch the SportVision 2012 showreel on the SportVision website. How many different augmented reality style effects did you see demonstrated in the showreel?

For further examples, see the case studies published by vizrt.

Watching the videos, there are several examples of how items tracked in realtime can be visualised, either to highlight a particular object or feature (such as tracking a player, highlighting the position of a ball, puck, or car), or trace out the trajectory followed by the object (for example, highlighting in realtime the path followed by a ball).

Having seen some examples of the techniques in action, and perhaps started to ask yourself “how did they do that?”, skim back over the BBC white paper to see if any of the sections jump out at you in answer to your self-posed questions.

In the UK, Hawk-Eye Innovations is one of the most well known providers of such services to UK TV sports viewers.

The following video describes in a little more detail how the Hawk-Eye system can be used to enhance snooker coverage.

And how Hawk-Eye is used in tennis:

In much the same way as sportsmen compete on the field of play, so too do rival technology companies. In the 2010 Ashes series, Hawk-Eye founder Paul Hawkins suggested that a system provided by rivals VirtualEye could lead to inaccurate adjudications due to human operator error compared to the (at the time) more completely automated Hawk-Eye system (The Ashes 2010: Hawk-Eye founder claims rival system is not being so eagle-eyed).

The following video demonstrates how the Virtual Eye ball tracking software worked to highlight the path of a cricket ball as it is being bowled:

EXERCISE: what are the benefits to sports producers from using augmented reality style, realtime television graphics as part of their production?

The following video demonstrates how the SportVision Liveline effect can be used to help illustrate what’s actually happening in an Americas Cup yacht race, which can often be hard to follow for the casual viewer:

EXERCISE: To what extent might such effects be possible in a magic lens style application that could be used by a spectator actually witnessing a live sporting event?

EXERCISE: review some of the video graphics effects projects undertaken in recent years by the BBC R&D department. To what extent do the projects require: a) the modeling of the world with a virtual representation of it; b) the tracking of objects within the visual scene; c) the compositing of multiple video elements, or the introduction of digital objects within the visual scene?

As a quick review of the BBC R&D projects in this area suggests, the development of on-screen graphics that can track objects in real time may be complemented by the development of 3D models of the televised view so that it can be inspected from virtual camera positions that provide a view of the scene that is reconstrcuted from a model bulit up from the real camera positions.

Once again, though, there may be a blurring of reality – because is the view actually taken from a virtual camera, or a real one such as in the form of a Spidercam?

As well as overlaying actual footage with digital effects, sports producers are also starting to introduce virtual digital objects into the studio to provide an augmented reality style view of the studio to the viewer at home.

The use of 3D graphics in TV studios is increasingly being used to dress other elements of the set. In addition, graphics are also being used to enhance TV sports through the use of virtual advertising. Both these approaches will be discussed in another post.

More generally, digital visual effects are used widely across film and television, as we shall also explore in a later post…


Categories