Digital Worlds – The Blogged Uncourse

Originally published as Digital Worlds – Interactive Media and Game Design, a free learning resource on computer game design, development and culture, authored as part of an experimental approach to the production of online distance learning materials, many of the resources presented in the first incarnation of this blog also found their way into a for credit, formal education course from the UK’s Open University.

The blog was rebooted at the start of summer 2016 to act as a repository for short pieces relating to mixed and augmented reality, and related areas of media/reality distortion, as preparation for a unit on the subject in a forthcoming first level Open University course. Since then, it has morphed into a space where I can collect stories and examples of how representations of the physical world can be digitally captured, and audio, images and video media can in turn be manipulated in order to produce distorted re-presentations of the world that are perhaps indistinguishable from it…

Interlude – Enter the Land of Drawings…

One of the classic children’s British TV programmes from the 1970s was Simon in the Land of Chalk Drawings, a “meta-animation” in which the lead character, Simon, is able to enter the (animated) land of chalk drawings though his magic chalkboard.

On one reading, we can view the land of chalk drawings as a virtual reality experienced by Simon; on another, we can imagine the chalk board as a forerunner of an augmented reality colouring book.

“Drawn” and “real” worlds have also combined in other culturally significant creations, such as in the well known Take on Me music video by Norwegian 80s pop group Ah-ha.

ACTIVITY: what other TV programmes or videos do you remember from the past that either hinted at, or might provide inspiration for, augmented or mixed reality effects and applications?

At the time, the Take on Me video was a masterpiece of video compositing. But as photo- and video-manipulation tools develop, and as augmented reality toolkits become ever more available, the ability to produce similar styled videos may become commonplace.

For example, creating “pencil drawn” images from photos can be easily achieved using a range of filters in applications such as Adobe Photoshop:

And the Pencil Sketch tool in Adobe After Effects will apply a similar effect to videos.

The Pencil Sketch effect is a applied as the result of processing a video image directly. But a similar sort of end effect can also be created by applying a texture transformation to a motion-captured model.

By manipulating a model, rather than a video frame, we are no longer tied to purely re-presenting the captured video image. Instead, the capture can be manipulated and the performance transformed away from the original motions, as well as the original textures.

Photoshopping Audio…

By now, we’re all familiar with the idea that images can be manipulated – “photoshopped” – to modify a depicted scene in some way (for example, Even if the Camera Never Lies, the Retouched Photo Might…). Vocal representations can be modified using audio process techniques such as pitchshifting, and traditional audio editing techniques such as cutting and splicing can be used to edit audio files and create “spoken” sentences that have never been uttered before by reordering separately cut words.

But what if we could identify both the words spoken by an actor, and model their voice, so that we could edit out their mistakes, or literally put our own words in their mouths, by changing a written text that is then used to generate the soundtrack?

A demonstration of a new technique for editing audio was demonstrated by Adobe in late 2016 that does exactly that. An audio track is used to generate both a speech generating model and a text to speech track. This allows the text track to edited, not just in terms of rearranging the order of the originally spoken words, but also inserting new words.

Not surprisingly, the technique could raise concern about the “evidential” quality of recorded speech.

EXERCISE: Read the contemporaneous report of the Adoboe VoCo demonstration from the BBC News website “Adobe Voco ‘Photoshop-for-voice’ causes concern“. What concerns are raised in the report? What other concerns, if any, do you think this sort of technology raises?

The technique was reported in more detail in a SIGGRAPH 2017 paper:

An associated paper –  Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, and Adam Finkelstein. VoCo: Text-based Insertion and Replacement in Audio Narration. ACM Transactions on Graphics 36(4): 96, 13 pages, July 2017 – describes the technique as follows:

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Voice Capture and Modelling

A key part of the Adobe VoCo approach is the creation of a voice model that can be used to generate utterances that sound like the spoken words of the person whose voice has been modelled, a technique we might think of in terms of “voice capture and modelling”. As the algorithms improve, the technique is likely to become more widely available, as suggested by other companies developing demonstrations in this area.

For example, start-up company Lyrebird have already demonstrated a service that will model a human voice from one minute’s worth of voice capture, and allow you to create arbitrary utterances from text spoken using that voice.

Read more about Lyrebird in the Scientific American article New AI Tech Can Mimic Any Voice by Bahar Gholipour.

Lip Synching Video – When Did You Say That?

The ability to use captured voice models to generate narrated tracks works fine for radio, but what about if you wanted to actually see the actor “speak” those words? By generating a facial model of a speaker, it is possible to use a video representation of an individual as a puppet whose facial movements can be acted by someone else, a technique described as facial re-enactment (Thies, Justus, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. “Real-time expression transfer for facial reenactment“, ACM Trans. Graph. 34, no. 6 (2015): 183-1).

Facial re-enactment involves morphing features or areas from one face onto corresponding elements of another, and then driving a view of the second face from motion capture of the first.

But what if we could generate a model of the face that allowed facial gestures, such as lip movements, to be captured at the same time as an audio track, and then use the audio (and lip capture) from one recording to “lipsync” the same actor speaking those same words in another setting?

The technique, described in Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. “Synthesizing Obama: learning lip sync from audio,” ACM Transactions on Graphics (TOG) 36.4 (2017): 95, describes the process as follows: audio and sparse mouth shape features from one video are associated using a neural network. The sparse mouth shape is then used to synthesize a texture for the mouth and lower region of the face that can be blended onto a second, stock video of the same person, and the jaw shapes aligned.

For now, the approach is limited to transposing the spoken words from a video recording of a person speaking one time to a second video of them. As one of the researchers, Steven Seitz, is quoted in Lip-syncing Obama: New tools turn audio clips into realistic video“[y]ou can’t just take anyone’s voice and turn it into an Obama video. We very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”

Augmented Reality and Autonomous Vehicles – Enabled by the Same Technologies?

In Introducing Augmented Reality Apparatus – From Victorian Stage Effects to Head-Up Displays, we saw how the Pepper’s Ghost effect could be used to display information in a car using a head-up display projected onto a car windscreen as a driver aid. In this post, we’ll explore the extent to which digital models of the world that may be used to support augmented reality effects may also be used to support other forms of behaviour…

Constructing a 3D model of an object in the world can be achieved by measuring the object directly, or, as we have seen, measuring the distance to different points on the object from a scanning device and then using these points to construct a model of the surface corresponding to the size and shape of the object. According to IEEE Spectrum’s report describing A Ride In Ford’s Self-Driving Car“Ford’s little fleet of robocars … stuck to streets mapped to within two centimeters, a bit less than an inch. The car compared that map against real-time data collected from the lidar, the color camera behind the windshield, other cameras pointing to either side, and several radar sets—short range and long—stashed beneath the plastic skin. There are even ultrasound sensors, to help in parking and other up-close work.”

Whilst the domain of autonomous vehicles may seem to be somewhat distinct from the world of facial capture on the one hand, and augmented reality on the other, autonomous vehicles rely on having a model of the world around them. One of the techniques currently used in detecting distances to objects surrounding an autonomous vehicle is LIDAR, in which a laser is used to accurately detect the distance to a nearby object. But recognising visual imagery also has an important part to play in the control of autonomous and “AI-enhanced” vehicles.

For example, consider the case of automatic lane detection:

Here, an optical view of the world is used as the basis for detecting lanes on a motorway. The video also shows how other vehicles in the the scene can be detected and tracked, along with the range to them.

A more recent video from Ford shows the model of the world perceived from the range of sensors one of their autonomous vehicles.

Part of the challenge of proving autonomous vehicle technologies to regulators, as well as development engineers, is the ability to demonstrate what the vehicle thinks it can see and what it might do next. To this extent, augmented reality displays may be useful in presenting in real-time a view of a vehicle’s situational awareness of the environment it currently finds itself in.

DO: See if you can find some further examples of the technologies used to demonstrate the operation of self-driving and autonomous vehicles. To what extent do these look like augmented reality views of the world? What sorts of digital models do the autonomous vehicles create? To what extent could such models be used to support augmented reality effects, and what effects might they be?

If, indeed, there is crossover between the technology stack that underpins autonomous vehicles, computational devices developed to support autonomous vehicle operation may also be useful to augmented and mixed reality developers.

DO: read through the description of the NVIDIA DRIVE PX 2 system and software development kit. To what extent do the tools and capabilities described sound as if they may be useful as part of an augmented or mixed reality technology stack? See if you can find examples of augmented or mixed reality developers using such toolkits originally developed or marketed for autonomous vehicle use and share them in comments below.

Using Cameras to Capture Objects as Well as Images

In The Photorealistic Effect… we saw how textures from photos could be overlaid onto 3D digital models as well as how digital models could be animated by human puppeteers: using motion capture to track the movement of articulation points on the human actor, this information could then be used to actuate similarly located points on the digital character mesh; and in 3D Models from Photos, we saw how textured 3D models could be “extruded” from single photograph by associating points on them with a mesh and then deforming the mesh in 3D space. In this post, we’ll explore further how the digital models themselves can be captured by scanning actual physical objects as well as by constructing models from photographic imagery.

We have already seen how markerless motion capture can be used to capture the motion of actors and objects in the real world in real time, and how video compositing techniques can be used to change the pictorial content of a digitally captured visual scene. But we can also use reality capture technologies to scan physical world objects, or otherwise generate three dimensional digital models of them.

Generating 3D Models from Photos

One way of generating a three dimensional model is to take a basis three dimensional mesh model and map it onto appropriate points in a photograph.

The following example shows an application called Faceworx in which textures from a front facing portrait and a side facing portrait are mapped onto a morphable mesh. The Smoothie-3d application described in 3D Models from Photos uses a related approach.

3D Models from Multiple Photos

Another way in which photographic imagery can be used to generate 3D models is to use techniques from  photogrammetry, defined by Wikipedia as “the science of making measurements from photographs, especially for recovering the exact positions of surface points”. By using taking several photographs of the same object and identifying the same features in each of them, and then align the photographs, using the differential distances between features to model the three-dimensional character of the original objects.

DO: read the description of how the PhotoModeler application works: PhotoModeler – how it works. Similar mathematical techniques (triangulation and trilateration) can also be used to calculate distances in a wide variety of other contexts, such as finding the location of a mobile phone based on the signal strengths of three or more cell towers with known locations.

“Depth Cameras”

Peripheral devices such as Microsoft Kinect, the Intel RealSense camera and the Structure Sensor 3D scanner perceive depth directly as well as capturing photographic imagery.

In the case of Intel RealSense devices, three separate camera components work together to capture the imagery (a traditional optical camera) and the distance to objects in the field of view (an infra-red camera and a small infra-red laser projector).

With their ability to capture distance-to-object measures as well as imagery, depth perceiving cameras represent an enabling technology that opens up a range of possibilities for application developers. For example, itseez3d is a tablet based application that works with the Structure Sensor to provide a simple 3D scanner application that can capture a 3D scan of a physical object as both a digital model and a corresponding texture.

Depth Perceiving Cameras and Markerless mocap

Depth perceiving cameras can also be used to capture facial models, as the FaceShift markerless motion capture studio shows.

Activity: according to the FAQ for the FaceShift Studio application shown in the video below, what cameras can be used to provide inputs to the FaceShift application?

Exercise: try to find one or two recent examples of augmented or mixed reality applications that make use of depth sensitive cameras and share links to them in the comments below. To what extent do the examples require the availability of the depth information in order for them to work?

Interactive Dynamic Video

Another approach to use video captures to create interactive models is a new technique developed by researchers Abe Davis, Justin G. Chen, and Fredo Durand at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL)  referred to as interactive dynamic video. In this technique, or few seconds (or minutes) of video are analysed to study the way a foreground object vibrates naturally, or when gently perturbed.

Rather than extracting a 3-dimensional model of the perturbed object, and then rendering that as a digital object, the object in the interactive is perturbed by constructing a “pyramid” mesh over the pixels on the video image itself (Davis, A., Chen, J.G. and Durand, F., 2015. Image-space modal bases for plausible manipulation of objects in video. ACM Transactions on Graphics (TOG), 34(6), p.239). That is, there is no “freestanding” 3D model of the object that can be perturbed. Instead, it exists as a dynamic, interactive model within the visual scene within which it is situated. (For a full list of related papers, see the Interactive Dynamic Video website.)

SAQ: to what extent, if any, is interactive dynamic video an example of an augmented reality technique? Explain your reasoning.

Adding this technique to our toolbox, along with the ability to generate simple videos from still photographs as described in Hyper-reality Offline – Creating Videos from Photos, we see how it is increasingly possibly to bring imagery alive simply through the manipulation of pixels, mapped as textures onto underlying structural meshes.

Interlude – Ginger Facial Rigging Model

Applications such as Faceshift, as mentioned in The Photorealistic Effect…, demonstrate how face meshes can be captured from human actors and used to animate digital heads.

Ginger is a browser based facial rigging demo, originally from 2011, but since updated, that allows you to control the movements of a digital head.

If you enable the Follow on feature, the eyes and head will follow the motion of your mouse cursor about the screen. The demo is listed on the Google Chrome Experiments website and can be found here: https://sv-ginger.appspot.com. (The code, which builds on the three.js 3D javascript library, is available on Github: StickmanVentures/ginger.)

Recap – Enabling the Impossible

One of the recurring themes in this series of posts has been the extent to which particular augmented or mixed reality effects are impossible to achieve without the prior development of one or more enabling technologies.

The following video clip from Cinefix describing “The Top 10 VFX Innovations in the 21st Century” demonstrates how visual effects in blockbuster movies have evolved over several years as new techniques are invented, developed and then combined in new ways.

Here’s a quick breakdown of the top 10.

  • digital color-grading: recoloring films automatically to influence the mood of the film;
  • fluid modelling/water effects: bulk volume mesh vs. droplet (particle by  particle) models -> hybrid simulation
  • AI powered crowd animation: individuals have their own characters and actions that are then played out;
  • motion capture as a basis for photo-realistic animation;
  • universal capture/markerless performance capture;
  • painted face marker capture;
  • digital backlot;
  • imocap – in-camera motion capture – motion capture data captured alongside principal photography;
  • intermeshing of 3D digital backlot, live capture and live rendering, virtual reality camera;
  • lightbox cage rig, compositing of human actor and digital world.

DO: watch the video clip, noting what technologies were developed in order to achieve the effect or how pre-existing technologies were combined in novel ways to achieve the effect. To what extent might such technologies be used in a realtime mixed or augmented reality setting and for what purpose? What technical challenges would need to be overcome in order to use the techniques in such a way?

 

The Photorealistic Effect…

In Even if the Camera Never Lies, the Retouched Photo Might… we saw how photographic images may be manipulated using digital tools to create “hyperreal” imagery in which perceived “imperfections” in the real world artefact are removed using digital tools. In this post, we’ll explore how digital tools can be used to create digital imagery that looks like a photograph but were digitally created from the mind of the artist.

As an artistic style, photorealism refers to artworks in which the artist uses a medium other than photography to try create a representation of a scene that looks as if it was captured as a photograph using a camera.  By extension, photorealism aims toto (re)create something that looks like a photograph and in so doing capture a lifelike representation of the scene, whether or not the scene is imagined or is a depiction of an actual physical reality.

DO: Look through the blog posts Portraits Of The 21st Century: The Most Photorealistic 3D Renderings Of Human Beings (originally posted as an imgur collection shared by Reddit user republicrats) and 15 CGI Artworks That Look Like Photographs. How many of the images included in those posts might you mistake for a real photograph?

According to digital artist and self-proclaimed “BlenderGuru” Andrew Price in his hour long video tutorial Photorealism Explained, which describes some of the principles and tools that can be used in making photorealistic CGI (computer generated imagery), there are four pillars to creating a photorealistic image – modelling, materials, lighting, post-processing:

  • photorealistic modelling – “matching the proportions and form of the real world object”;
  • photorealistic materials – “matching the shading and textures of real world materials”;
  • photorealistic lighting – “matching the color, direction and intensity of light seen in real life”;
  • photorealistic post-processing – “recreating imperfections from real life cameras”.

Photorealistic modelling refers to the creation of a digital model that is then textured and lit to create to the digital image. Using techniques that will be familiar to 3D game developers, 3D mesh models may be constructed from scratch using open-source tools such as Blender or professional commercial tools.

blender_-_modeling_a_human_head_basemesh_-_youtube

The mesh based models can also be transformed in a similar way to the manipulation of 2D photos mapped onto the nodes of a 2D mesh.

Underpinning the model may be a mesh containing many thousands of nodes encompassing thousands of polygons. Manipulating the nodes allows the model to be fully animated in a realistic way.

Once the model has been created, the next step is to apply textures to it. The textures may be created from scratch by the artist, or based on captures from the real world.

In fact, captures provide another way of creating digital models by seeding them with data points captured from a high resolution scan of a real world model. In the following clip about the development of the digital actor “Digital Emily” (2008), we see how how 3D scanning can be used to capture a face pulling multiple expressions, and from these construct a mesh with overlaid textures grabbed from the real world photographs as the basis of the model.

Watch the full video  – ReForm | Hollywood’s Digital Clones – for a more detailed discussion about “digital actors”. Among other things, the video describes the Lightstage X technology used to digitise human faces. Along with “Digital Emily”, the video introduces “Digital Ira” , from 2012. Whereas Emily took 30 mins to render each frame, Ira could be rendered at 30fps (30 renders per second).

Price’s third pillar refers to lighting. Lighting effects are typically based on computationally expensive algorithms, incorporated into the digital artist’s toolchain using professional tools such as Keyshot as well as forming part of more general toolsuites such as Blender. The development of GPUs – graphical processing units – capable of doing the mathematical calculations required in parallel and ever more quickly is one of the reasons why Digital Ira is a far more responsive actor than Digital Emily could be.

The following video reviews some of the techniques used to render photorealistic computer generated imagery.

Finally, we come to Price’s fourth pillar – post-processing – things like motion blur, glare/lens flare and depth of field effects, where the camera can only focus at items a particular distance away and everything else is out of focus. In other words, all the bits that are “wrong” with a photographic image. (A good example of this can be found in the blog post This Image Shows How Camera Lenses Beautify or Uglify Your Pretty Face, which shows the same portrait photograph taken using various different lenses; /via @CharlesArthur.) In professional photography, the photographer may use tools such as Photoshop to create images that are physically impossible to capture using a camera because of the physical properties of the camera. Photo-manipulation is then used to create hyper-real images, closely based on reality but representing a fine tuning of it. According to Price, to create images that are photorealistic using tools that create perfect depictions of a well-textured and well-lit accurate model in a modelled environment, we need to add back in the imperfections that the camera, at least, introduces into the captured scene. To imitate reality, it seems we need to model just not the (imagined) reality of the scene we want to depict, but also the reality of the device we claim to be capturing the depiction with.

VideoRealistic Motion

In addition to the four pillars of photorealism described by Andrew Price when considering photorealistic still imagery, we might add another pillar for photorealistic moving pictures (maybe we should call this videorealistic motion!):

  • photorealistic motion – matching the way things move and react in real life.

When used as the basis of a animated (video) scene, a question arises as to how to actually animate the head in a realistic way. Where the aim is to recreate human like expressions or movements, the answer may simply be to use a person as a puppeteer, using motion capture to capture an actor’s facial expressions and use them to actuate the digital model. Such puppetry is now a commodity application, as the Faceshift markerless motion capture facial animation software demonstrates. (See From Motion Capture to Performance Capture – Sampling Movement in the Real World into the Digital Space for more discussion about motion capture.)

With Hollywood film-makers regularly using virtual actors in their films, the next question to ask is will such renderings be possible in a “live” augmented reality context: will it be possible to add sit a virtual Emily in your Ikea postulated sitting room and talk through the design options with you?

The following clip, which combines many of the techniques we have already seen – uses a 3D registration image within a physical environment as the location point for a digital actor animated using motion capture from a human actor.

In the same way that digital backlots now provide compelling visual recreations of background  – as well as foreground – scenery as we saw in Mediating the Background and the Foreground, it seems that now even the reality of the human actors may be subject to debate. By the end of the clip, I am left with the impression that I have no idea what’s real and what isn’t any more! But does this matter at all? If we can create photorealistic digital actors and digital backlots, does it change our relationship to the real world in any meaningful way? Or does it start to threaten our relationship with reality?


Categories