WO2015052514A2

WO2015052514A2 - Rendering composites/layers for video animations

Info

Publication number: WO2015052514A2
Application number: PCT/GB2014/053028
Authority: WO
Inventors: David Niall Cumming
Original assignee: Digimania Limited
Priority date: 2013-10-08
Filing date: 2014-10-08
Publication date: 2015-04-16
Also published as: GB201317789D0; WO2015052514A3

Abstract

An apparatus (10) and a process for rendering layers (or composites) in the context of video animations are provided. The process involves activating a video rendering engine (12) to produce multiple different versions (Vm1 to Vmn) of an image of a frame; performing comparisons between the multiple versions to deduce desired layers (Cm1 to Cmn'); storing the deduced layers in a memory (25); and repeating these three steps for multiple successive frames (F). For example in a first version the surfaces can reflect light both diffusely and with specular highlights, and in a second version no specular reflections occur; from a comparison between these versions a layer showing just the specular reflections may be deduced. This provides a convenient way of generating layers. Multiple layers can subsequently be composited to produce frames.

Description

Rendering Composites/Layers for Video Animations

The present invention relates to an apparatus and a process for rendering composites or layers in the context of video animations.

A video stream consists of a sequence of discrete frames, each frame usually representing about 1 /50 of a second of animation. In producing video animations, a known process is to prepare "composites", that is to say layers, each of which contains a subset of the visual elements that are to be shown in an image of a scene. Each composite or layer may correspond to an object, a character, background terrain, or a visual effect, or any combination of such features. The layers can be subsequently combined together to produce a full rendering of the image, by a known process referred to as compositing. This process is also applicable to an individual object, for example to separate the lighting and shadow components of the render. For example an object may be lit with a diffuse colour caused by ambient environmental light; or it may be lit by a specular component caused by reflective highlights from a specific light source. Combining the resulting layers would produce the required visual effect.

There are two main reasons why compositing is performed: to reduce the time taken to render an image; and to allow alterations of specific elements without having to re-render the entire image. As an example of the first, in a scene where one character is viewed against a static background, a single bitmap (or layer) could be rendered of the

background, and a set of animated layers could be produced of just the character, and then overlaid on the background. This would save a considerable amount of rendering time, as the lighting and shadowing of the background would only have to be calculated and rendered once, not for each frame in the video. As an example of the second, the individual layers may be adjusted in a number of ways, for example to make the shadows darker, and then re-composited without the need to re-render the entire image. Furthermore

compositing also allows elements from different sources to be combined, for example 3-D characters can be composited onto 2-D hand-painted backgrounds, with the shadows of the characters falling realistically into the painted background.

However, many highly-optimised graphical rendering engines are not designed to output composite frames, i.e. a frame of a layer. They render on the graphics card, and their output is intended to be drawn unmodified on a computer monitor. Consequently they do not deal natively with features such as transparency.

According to the present invention there is provided a process for rendering layers for use in producing frames for a video animation, the process comprising:

- activating a video rendering engine to produce multiple different versions of an image of a frame;

- performing comparisons between the multiple versions to deduce desired layers;

- storing the deduced layers; and

- repeating these three steps for multiple successive frames.

The layers that are thus produced can be subjected to compositing, to be combined into a sequence of multiple frames of a video animation. The video rendering engine may be a rendering engine that can produce frames in real-time. For example a suitable rendering engine is Epic Games' Unreal Engine 3 (trade mark). However the process of the present invention does not necessitate that the frames are produced in real-time. Even when using a rendering engine that can under normal circumstances produce frames in real-time, when performing the present invention the process may not operate in real-time as it requires multiple different versions of an image to be rendered, and then compared, before the layers can be deduced. Nevertheless the process can be orders of magnitude quicker than known processes for producing layers for use in producing frames for a video animation. Furthermore, the invention enables the layers to be rendered while using a rendering engine that is not able to render such layers.

The multiple different versions of the image are specified in accordance with the layers that are required. By way of example, if the image shows a character and background scenery, as would be viewed by a camera, some suitable versions of the image may be as follows:

(a) the character and the background scenery, with surfaces reflecting light both diffusely and with specular highlights;

(b) the character and the background scenery, with all surfaces being matt, so that no specular reflections occur;

(c) the character and the background scenery as in (a), but without shadows; (d) the background scenery alone, with surfaces reflecting light both diffusely and with specular highlights;

(e) the background scenery alone, with all surfaces being matt, so that no specular reflections occur;

(f) the background scenery alone as in (d), but without shadows;

(g) a depth image representing the distance to the camera from the feature corresponding to each pixel within the image, for example on a greyscale;

etc. In some cases no more than 25 different versions are produced for each frame, and the number of different versions may be no more than 12. The more different versions are required the slower the overall process will be, but the more different layers can be deduced. However, the speed of the process clearly also depends on the resolution of the image; and if desired the speed of the process can be increased by sharing the analysis between a number of computers. By way of example, the versions (a) and (b) enable a layer representing the effect of specular highlights to be deduced by subtraction; the versions (a) and (d) enable a layer representing the character alone to be deduced; and the versions (b) and (e) enable a layer representing the character alone, without specular highlights; and these last two layers, by subtraction, enable a layer representing the specular highlights on the character alone, to be deduced. These are given only by way of example, and would clearly be dependent on what features are expected to be present within the image.

It will be appreciated that although it is easy to remove a single component from the rendering process, simply by telling the engine not to draw it, it is much harder to tell it to draw only that component in isolation, because of the complexity of altering the existing rendering pathway to perform the desired extraction precisely. For example the shadow of a tree falling on a character would not be rendered if the tree is not rendered. This problem is eliminated by the present invention.

The objects or characters in a scene may thus be assigned to specific layers, where for example layer 0 is a background layer, and layers 1 and upwards are considered as being on top of the previous layers. The image files corresponding to each layer provide the image details of the features in that layer, and also provide transparency where there are no features in that layer. These layers may be deduced in the way described above. The "depth image" mentioned above may be used to provide automatic masking, when an object allocated to a lower layer is partially or completely in front of something in a higher layer. For example if the depth image indicates that a pixel in a higher layer represents an element that is actually behind (i.e. further from the camera than) the element represented by a pixel in a lower layer, any such pixels from the higher layer may be treated as transparent.

In a second aspect, the present invention also provides an apparatus for rendering layers for use in producing frames for a video animation, the apparatus comprising:

- a video rendering engine;

- a controller for activating the video rendering engine to produce multiple different versions of an image of a frame;

- an image comparator for performing comparisons between the multiple versions to deduce desired layers;

- a data store for storing the deduced layers;

wherein the controller is arranged to activate the rendering engine repeatedly, so as to produce multiple successive frames.

The invention will now be further and more particularly described, by way of example only, and with reference to the accompanying drawings in which:

Figure 1 shows a schematic diagram of an apparatus of the invention;

Figures 2a to 2f show different images produced by the apparatus of figure 1 in accordance with the present invention;

Figures 3a to 3d show layers produced by the apparatus of figure 1 in accordance with the present invention; and

Figures 4a to 4k provide an alternative illustration of how layers are made.

Referring to figure 1 , an apparatus 10 for rendering layers for use in producing video frames is shown as a schematic diagram. The apparatus 10 includes a video rendering engine 12 (such as the Epic Games' Unreal Engine 3™), and means 14 to provide data D representing features within a scene to the rendering engine 12. In normal use the rendering engine 12, in response to the data D, produces a sequence of successive video frames F1 to FN, the frames being produced in real-time, so the resulting video frames F1 - FN can be displayed on a display screen 16 in real-time. This display mode is represented by broken lines. The apparatus 10 includes a controller 20 which can provide control signals 20a to the rendering engine 12 to change its mode of operation. The controller 20 may include a graphical user interface 21 whereby a user can provide instructions. Instead of simply rendering successive frames F1 to FN, the rendering engine 12 in response to the control signals 20a from the controller 20 is caused to render multiple different versions Vm1 to Vmn of the same frame Fm. These multiple versions Vm1 to Vmn are provided to a comparator module 22 which performs a series of comparisons, in response to instructions 20b from the controller 20, enabling it to generate a number of layers (or composites), Cm1 to Cmn' corresponding to the single frame Fm. When all the required layers Cm1 to Cmn' have been generated corresponding to frame Fm, they are stored in a memory 25, and the multiple versions Vm1 -Vmn discarded; and the comparator module 22 provides a feedback signal 23 to the engine 12 to indicate that processing of that frame Fm has been completed. The rendering engine 12 can start to work on rendering multiple versions of the next frame Fm+1 , in the same way.

It will be appreciated that once the layers C have been generated for all of the frames F1 to FN, and stored in the memory 25, so there is one set of layers Cm1 to Cmn' stored for each frame Fm, they can be supplied to an image processing module 26 (for example running Adobe After-Effects™) which combines the layers Cm 1 -Cmn' to form the required frame Fm, and repeats this to produce each of the frames F1 to FN, which can then be displayed on a video display screen 28. It will also be appreciated that one or more of the individual layers Cm1 to Cmn' may be processed separately, for example to modify the properties of a feature in that layer, e.g. to make the shadows darker, prior to being combined with the other layers.

Each frame F (or each version V) produced by the rendering engine 12 contains the data to represent the image, the image consisting of multiple pixels, for example 1024 x 768 pixels, and each pixel is associated with colour information. The rendering engine 12 can also produce a depth image representing the distance from the viewing point (e.g. a camera) to the feature corresponding to each pixel within the image, for example on a greyscale. The rendering engine 12 is designed to produce images representing the view of a scene from a particular viewpoint (which may be conceived as being a camera), where the scene may include a background; fixed features, and characters or objects that may be arranged to move or to be moved. In producing such images, the user would also provide instructions about the nature and source of illumination. The layers, C, are analogous to the frames F, in providing data associated with the pixels, this data representing the brightness of red, green and blue light associated with that pixel, but each pixel of a layer is also associated with a parameter alpha which refers to the transparency or opacity, for example with 1 representing 'totally opaque', and 0

representing 'totally transparent'.

For simplicity the production of the multiple versions V will be described in relation to a simple scene, showing a black cube and a black pyramid which are stationary, and a white sphere which may move. The black cube and the black pyramid are always present, but the black pyramid is closer to the viewpoint (the camera) than the black cube. In this example the white sphere is located between the black cube and the black pyramid, i.e. at an intermediate distance from the viewpoint.

Referring to figure 2a, this shows the rendered image with all the features activated: all the elements of the scene are shown, illuminated with diffuse lighting and with specular highlights from a specific light source, as instructed by the user. In accordance with the present invention the rendering engine 12 is then arranged to produce a variety of different versions V of this image. The first version, V1 , is that shown in figure 2a. Referring to figure 2b, this shows the second version, V2, of the same image, differing from the version V1 in that the rendering engine 12 renders the image without the specular highlights. Referring to figure 2c, this shows a third version, V3, of the same image, differing from the version V1 in that the rendering engine 12 is instructed to render the image without shadows. Referring to figure 2d, this shows a fourth version, V4, which is a depth image, showing the variation in distance from the viewpoint or camera on a greyscale, closer features being shown darker. It will be appreciated that the versions V1 to V4 are all different versions of the same scene.

Referring to figure 2e, a fifth version, V5, shows the same view as the first version, V1 , with the same illumination but without the white sphere. Referring to figure 2f, a sixth version V6 shows the same view as the first version, V1 , but without the white sphere and without the black cube and black pyramid. Referring to figure 2g, a seventh version V7 corresponds to the fifth version, V5, but without the specular highlights.

It will be appreciated that the seven versions V1 to V7 are merely examples of some suitable versions that may be rendered. The user can arrange suitable versions, depending on what types of layers are desired. Once the requisite different versions V have been rendered, the desired layers C can be obtained by suitable comparisons.

In this example a first layer, C1 , representing the background, is the same as the version V6; because this represents the background, the transparency or opacity parameter, alpha, is 1 for every pixel, representing 'totally opaque'. A second layer, C2, representing the black cube and the black pyramid (without highlights) can be obtained by calculating (for each pixel) the difference between V7 and V6; this layer C2 is shown in figure 3a, and the checked areas represent areas of the layer which are transparent, in which the parameter alpha is 0 for every pixel.

Considering the complete image, a layer C3 representing the white sphere can be obtained by calculating (for each pixel) the difference between V1 and V5; this layer C3 is shown in figure 3b. A layer C4 representing only the shadows can be obtained by calculating (for each pixel) the ratio between V1 and V3; this layer C4 is shown in figure 3c. A layer C5 representing only the specular lighting highlights can be obtained by calculating (for each pixel) the difference between V1 and V2; this layer C5 is shown in figure 3d.

Thus, a folder of multiple layers Cm1 to Cmn' is generated for each frame Fm, and is stored in the memory 25. The different layers may be named automatically to reflect the contents. For convenience the folder corresponding to a frame Fm may also store the first version V1 , which indicates the expected appearance of the frame Fm assembled from the layers if no changes are made to the individual layers. When recombining the layers to form frames F it will be appreciated that the layers

C must be recombined in a manner which reflects the calculations that were used to obtain them. This may either use a hard-coded set of rules, or a customisable scheme.

By way of example, if the layer C4 representing only the shadows is calculated by considering the ratios between pixels in different versions, then when recreating the image the pixel values in the layer C4 will be multiplied by those of the underlying layer; while if the layers representing specular and diffuse components of the surface lighting are obtained by a subtraction process, then when recreating the image the pixel values from those layers must be added together. The depth image V4 may be stored as one of the layers C. It may be utilised by the image processing module 26 in the compositing process, for example it may be used to achieve depth-based effects such as a restricted depth of field for camera focus. As another example, the "ambient occlusion" (in which the texture of the surface affects the brightness, as any grooves or creases are darker) for an image can be determined by rendering (A) the diffusely lit, unshadowed object with no ambient occlusion, and (B) the diffusely lit, unshadowed object with ambient occlusion (which will look the same but with extra darkness in any creases etc.). Then the output (C), which is a layer representing the ambient occlusion, would be the colour values of (B) divided by those of (A) calculated pixel by pixel. So if a pixel is naturally of brightness 0.3 (in image A) but when occluded is of brightness 0.1 (in image B), we would store the output pixel (in image C) as 0.1 / 0.3 = 0.3333. It will be appreciated that the full, occluded layer (B) can subsequently be calculated from the non-occluded layer (A) and the occlusion layer (C) in post-process by multiplying A x C = 0.3 x 0.3333 = 0.1 .

Referring now to figures 4a to 4k, a somewhat more detailed process of producing particular layer will be described. Ambient Cut-out

In this first example the aim is to render an isolated and masked image of the white sphere (layer 2). This is initially done without shadows; the shadow layers are extracted later. It also only contains the ambient light reflections, as specular highlights are dealt with after the cut-out phase (whenever a separate specular layer is requested by the user).

In order to do this the first step is to render the depth of all layers below layer 2, i.e. layer 1 (the background is not included as it cannot visually occlude anything). This is done by turning off rendering of layers 2 and upwards, rendering the scene and then accessing the contents of the depth buffer. This provides the version of figure 4a, which is a depth image of the items in layer 1 , i.e. the black cube and the black pyramid.

Then we turn on rendering of objects in layer 2, turn everything else off and repeat. This provides the version of figure 4b, which is a depth image of the items in layer 2, i.e. the white sphere. Now, using a pixel-by-pixel operation comparing the depth values between these two images we can generate a new image, figure 4c, which is transparent for any pixel that is darker (closer) in the first image, figure 4a, than in the second image, figure 4b. (In the example shown, the black cube has been compared to a white background of the sphere, and so part of the black cube gives a transparent portion of the new image.) This new image, figure 4c, may be referred to as a depth mask.

Now we render layer 2 on its own over a transparent background plane. This produces the version shown in figure 4d, showing just the white sphere and a transparent background.

Finally, to achieve the final masked layer we simply multiply together the previous two images, the depth mask of figure 4c with the image of the sphere of figure 4d. Any transparency in either image will be present in the final layer, shown as figure 4e.

Specularity

To extract the layer representing specularity, it is recalled that the white sphere (layer 2) has been rendered against a transparent background, as shown in figure 4d; in that example there was no specularity in the reflections. Now we arrange for it to be rendered again but with specularity activated within the engine; this produces the version as shown in figure 4f, it looks very similar but has a brighter highlight.

Now the specularity can be isolated by subtracting the image of figure 4d from the image of figure 4f on a per-pixel basis; the specularity is shown as figure 4g, in which the contrast is exaggerated for visibility.

Finally the depth mask of Figure 4c is multiplied with this to provide a masked specular image, which is the layer shown in figure 4h. This is intended to be added to the ambient component of a layer in post-processing where it can only serve to brighten the underlying image of the white sphere. Shadow pass

To extract the shadow pass, we render the full scene with only the objects in layer 2 (i.e. the sphere) allowed to cast shadows, so producing the version shown in figure 4i. Then the full scene is rendered with no shadows (in fact we do this once and re-use it for each layer). This produces the version shown in figure 4j.

Then finally, the version of figure 4i is divided on a per-pixel basis with the version of figure 4j to achieve the shadow layer shown as figure 4k, in which the contrast is again exaggerated for clarity. This shadow layer is intended to be multiplied by the non-shadowed image per-layer in post-processing.

The other layers can be broken down into suitable layers in the same way, and can eventually be combined in Adobe After-Effects™, as previously described. The process above would be different for example if no specular layer was requested, but the most complex case has been described.

The initial image, as produced by the video rendering engine 12, may have the same resolution to that required in the output image to be shown on the video display screen 28, or a different resolution. For example the initial image may be of a higher resolution than that required in the output image. In that case the layers may optionally be anti-aliased to give a smoother-looking picture for a given image resolution. For opaque images such as shadows this can be achieved by standard super-sampling. For layers that include transparent regions an anti-aliasing algorithm has been developed to deal with edge pixels where part of a super-sample (i.e. the square of pixels within a higher resolution scene capture texture that will be averaged to give the smooth output image when reduced to low resolution) is composed of both transparent and opaque pixels. In this algorithm the contribution from each pixel of the high resolution image is weighted so that only non- transparent pixels are included in the average colour value obtained for the destination (low resolution) pixel. The alpha value of the destination pixel is not weighted however, so that any transparent pixel in the super-sample increases the transparency of the destination pixel. So for example if we are averaging a square of four source pixels where one is opaque (alpha = 1 ) and white and the other three are completely transparent (alpha = 0), the destination pixel in the output image would be white in colour with an alpha parameter of 0.25. The rendering apparatus 10 also allows transparent or partially-transparent objects to be included in layers, whereby the output image itself also has transparency. An example would be if a window with blue glass was placed in layer 2 with a character behind it in layer 1 . The output image produced for layer 2 would be semi-transparent and blue-coloured for every pixel that was part of the glass so that the images could be overlaid in postprocessing with the correct final image and allowing the two images to be altered independently (e.g. changing the colour of the glass). However, this is not straightforward because the images (i.e. the versions V) that have been produced by the rendering engine 12 have no transparency information: the rendering engine 12 only provides colour information.

The solution to this is to produce a number of intermediate layers which are compared to deduce the transparency that needs to be added to the opaque images. This procedure is preferably only followed if a specific layer actually does contain an object with transparency, as it is a computationally slow procedure. Firstly, the objects in the layer being drawn are rendered by the rendering engine 12 with their transparency removed (so a blue window would be drawn as solid blue). This gives us a measure of the front surface colour of the objects, unaffected by anything behind them. Next the rendering engine 12 is arranged to render the same objects with their transparency, but rendered in front of a white background plane. This gives us an idea of how transparent the objects are because a more transparent object would be made brighter as the white surface behind would be more visible. Finally to create the layer which includes transparency data, we introduce enough alpha to each relevant pixel in the opaque (surface colour) image to bring its brightness up to match the brightness of the version rendered with the white background. This calculation is in a sense the reverse of the standard alpha equation which adjusts the perceived colour of a pixel based on a given alpha value.

Alpha = (BrightnessOnWhite - 1 .0) / (BrightnessOpaque - 1 .0) This method is effective, but it cannot deal with multiple coloured transparency in the same layer (e.g. if there was a red window behind the blue window in the same layer, then it would come out blue, not purple). The workaround for this limitation is to separate the transparent objects into different layers. An additional application for this method of generating layers is to mimic the functionality of chroma-keying (green-screening). A scene's characters can be rendered in a layer and this can be overlaid on top of video footage or still photos in post-processing to achieve the impression of CGI characters in a real-world setting. Auto-masking can be used to improve the impression of immersion as characters can be made to walk behind elements of the photo/video by placing similar-shaped objects into the 3D scene but assigning them a lower layer index than the characters.

Claims

1 . A process for rendering layers for use in producing frames for a video animation, the process comprising:

- storing the deduced layers; and

- repeating these three steps for multiple successive frames.

2. A process as claimed in claim 1 wherein the image shows a character and background scenery, as would be viewed by a camera, and the versions of the image include:

(c) the character and the background scenery as in (a), but without shadows;

(d) the background scenery alone, with surfaces reflecting light both diffusely and with specular highlights;

(e) the background scenery alone, with all surfaces being matt, so that no specular reflections occur; and

(f) the background scenery alone as in (d), but without shadows.

3. A process as claimed in claim 1 or claim 2 wherein the image is as would be viewed by a camera, and wherein the versions of the image include:

(g) a depth image representing the distance to the camera from the feature corresponding to each pixel within the image.

4. A process as claimed in any one of the preceding claims wherein no more than 25 different versions are produced for each frame.

5. A process as claimed in any one of the preceding claims, wherein the layers that are thus produced are subsequently subjected to compositing, being combined into a sequence of multiple frames of a video animation.

6. An apparatus for rendering layers for use in producing frames for a video animation, the apparatus comprising:

- a video rendering engine;

- a data store for storing the deduced layers;

7. An apparatus as claimed in claim 6 in combination with a device for compositing layers, to produce a series of frames, and a display for displaying the series of frames as a video animation.

8. A process for rendering layers for use in producing frames for a video animation substantially as hereinbefore described with reference to, and as shown in, the

accompanying drawings.

9. An apparatus for rendering layers for use in producing frames for a video animation substantially as hereinbefore described with reference to, and as shown in, the

accompanying drawings.