WO2024189314A1

WO2024189314A1 - Real-time video inpainter and method

Info

Publication number: WO2024189314A1
Application number: PCT/GB2024/050562
Authority: WO
Inventors: Seyed Danesh; Michael Garbutt
Original assignee: Pommelhorse Ltd
Priority date: 2023-03-16
Filing date: 2024-03-01
Publication date: 2024-09-19

Abstract

The present disclosure relates to a method, system, apparatus and computer program for inpainting an input video stream in real time. The input video stream includes a plurality of first input images indicative of an object and the method comprises, for each first input image of the input video stream: receiving a second input image that is based on the first input image, wherein the second input image comprises a first region of pixels indicative of the object; generating a reference image of the object such that the object in the reference image aligns with the object in the second input image; and generating an output image by inpainting one or more erroneous pixels in the first region of pixels of the second input image, wherein inpainting the one or more erroneous pixels is based on the reference image.

Description

REAL-TIME VIDEO INPAINTER AND METHOD

TECHNICAL FIELD

The present disclosure relates to techniques and methods of processing videos streams in realtime. In particular, the present disclosure relates to methods, electronic devices and systems for generating an output video stream based on at least one input video stream in real-time.

BACKGROUND

The act of 'inpainting' is generally referred to as the process of filling in missing pixels, or regions of missing pixels (e.g. holes) in an image. For example, in an image (a 2D image, made of pixels), the values of some pixels or regions of pixels in the image may be unknown, and inpainting is the process of filling in the missing pixels, by determining the values for those missing pixels, e.g. by guessing or estimating the values for those missing pixels.

Inpainting techniques may use one or a combination of the following information to estimate the values of missing pixels:

1. The values of other pixels in the image, such as neighbouring pixels to the missing pixel or hole.

2. Information about the scene captured in the image or objects in the scene, for example using a predefined model of the scene or the objects.

Classical inpainting approaches use the information 1. only . This information can be used to determine the value of a missing pixel, for example by averaging the neighbouring pixels. Alternatively, an approach can look through the image to find other areas of the image that appear similar to the area of the image with a hole in it, and copy pixel values from the other areas of the image to the area with the hole.

In recent years, approaches based on using the information 2. have become increasingly popular. Such approaches are based on neural networks. A neural network is trained by looking at many pictures and images of different scene types. The neural network learns what typical images/objects look like, and makes an educated guess on the values of the missing pixels. In some ways this is very similar to the way a human works to guess what is in a scene behind an occultation.

One of the challenges with the existing inpainting approaches is that they are slow to inpaint the image to a suitable degree of quality. For example, the above described neural network approach would typically require a large network to hold a lot of general information about what scenes and objects look like, which increases the amount of time required to perform the inpainting. As such, existing approaches to inpainting are not suited to inpainting images of a video stream in real time. SUMMARY OF THE DISCLOSURE

The present disclosure relates to an inpainter and inpainting method, that is suited for inpainting images of a video stream in real time. The video stream may capture a scene in a video call setting, and include a human subject. For each current image of the video stream, the inpainter inpaints or fills in the missing pixels of the current image based on one or more previous images of the video stream. New images/frames will be received at a rate determined by the frame rate of the camera capturing the images/frames, for example 24Hz/24fps, or 60Hz/60fps. In order to inpaint images of the video stream in real time, the process of inpainting a frame must be completed within the inter-frame time interval of the video stream (for example in 42ms or faster for a frame rate of 24Hz/24fps, or in 16.7ms of faster for a frame rate of 60Hz/60fps).

The previous images of the video stream may provide historic or prior information about the scene, which enables the inpainter to determine how to inpaint the missing pixels in the current image. Each image may be a colour image/frame (eg an RGB image), a greyscale image/frame and/or a depth map image/frame. For example a current depth map image of a scene may be inpainted using previous depth-map images of the scene and/or a current colour image of a scene may be inpainted using previous colour images of the scene.

The inpainter of the present disclosure uses an object-based approach. In particular, the inpainter is used to inpaint the pixels of a certain object within the current image. In a video call setting, the object may be the human-subject's head/face. The object may be any other rigid or partially/semi rigid object. Using one or more previous images of the video stream, the inpainter may determine a reference image of the object. Since the reference image is generated using previous images of the video stream, the reference image will contain pixel information that is suitable for inpainting any missing pixels of the object.

In a first aspect of the disclosure, there is provided a method of inpainting an input video stream in real time (i.e., within the inter-frame time interval of the input video), the input video stream comprising a plurality of first input images indicative of an object, the method comprising an inpainter, for each first input image of the input video stream: receiving a second input image that is based on the first input image, wherein the second input image comprises a first region of pixels indicative of the object; generating a reference image of the object such that the object in the reference image aligns with the object in the second input image; and generating an output image by inpainting one or more erroneous pixels in the first region of pixels of the second input image, wherein inpainting the one or more erroneous pixels is based on the reference image, and outputting each output image to generate an output video stream. In some cases, the first input image is identical to the second input image. One example is when the first input image is a depth image/frame that describes distances to objects within the image from a fixed reference. In this case, every instance of "second input image" may be replaced with the term "first input image" in the above two paragraphs, and the following paragraphs. For example, the first aspect of the disclosure would then be a method of inpainting an input video stream in real time, the input video stream comprising a plurality of first input images indicative of an object, the method comprising an inpainter, for each first input image of the input video stream: receiving the first input image, wherein the first input image comprises a first region of pixels indicative of the object; generating a reference image of the object such that the object in the reference image aligns with the object in the first input image; and generating an output image by inpainting one or more erroneous pixels in the first region of pixels of the first input image, wherein inpainting the one or more erroneous pixels is based on the reference image, and outputting each output image to generate an output video stream.

In other cases the second input image may be a transformation of the first input image, for example a colour or greyscale image that shows a visualistion of the object from a different perspective/position to that from which the fist input image was capture.

The plurality of first input images may be indicative of an object from a first camera position, and wherein the method further comprises: receiving a first transformation matrix indicative of a first 3D transformation between the first camera position and a second camera position different to the first camera position; and for each first input image of the input video stream: transforming the first input image based on the first transformation matrix to generate the second input image, wherein the second input image is indicative of an estimated view of the object from the second camera position; and receiving the second input image at the inpainter.

The method may further comprise: for each first input image of the input video stream: estimating locations of the one or more erroneous pixels in the second input image based on the pose of the object in the second input image and the first transformation matrix; and generating a mask indicative of the locations of the one or more erroneous pixels.

Generating the reference image may comprise: accessing a database, wherein the database comprises a plurality of previous input images, wherein each previous input image corresponds to a previous first input image of the input video stream or a previous second input image received at the inpainter; selecting, from the database, a first previous input image of the plurality of previous input images; and transforming the selected first previous input image such that the object in the transformed first previous input image aligns with the object in the second input image. Transforming the selected first previous input image may comprise: generating a second transformation matrix indicative of a second 3D transformation between a pose of the object in the selected first previous input image and a pose of the object in the second input image; and applying the second 3D transformation to the selected first previous input image using the second transformation matrix.

Generating the second transformation matrix may comprise: generating an input pose matrix indicative of the pose of the object in the second input image; receiving, from the database, a previous pose matrix indicative of the pose of the object in the selected first previous input image; and generating the second transformation matrix based on a difference between the input pose matrix and the previous pose matrix.

The transformed first previous input image may be the reference image.

Generating the reference image may further comprise: selecting, from the database, a second previous input image of the plurality of previous input images; transforming the selected second previous input image such that the object in the transformed second previous input image aligns with the object in the second input image; and combining the transformed first previous input image and the transformed second previous input image to create the reference image.

The second input image may be a colour or greyscale image and the reference image may be a colour or greyscale image, wherein the database further comprises a plurality of depth images corresponding to the plurality of previous input images, and wherein selecting the first previous input image further comprises selecting the corresponding depth image of the plurality of depth images, and wherein transforming the selected fist previous input image uses the selected depth image corresponding to the first previous input image.

The second input image may be a depth image and the reference image may be a depth image, wherein the database further comprises a plurality of colour or greyscale images corresponding to the plurality of previous input images, and wherein selecting the first previous input image further comprises selecting the corresponding colour or greyscale image of the plurality of colour or greyscale images, and wherein transforming the first selected previous input image comprises: determining, using the selected colour or greyscale image, a transformation of the selected first previous input image to align the object in the selected first previous input image with the object in the second input image.

Transforming the selected first previous input image may further comprise: generating a third transformation matrix indicative of a 2D transformation that is suitable for aligning the object in the second input image with the object in the selected first previous input image; and applying the 2D transformation to the selected first previous input image using the third transformation matrix.

Generating the second transformation matrix may comprise: detecting one or more feature of the object in the second input image and the one or more corresponding features of the object in the selected first previous input image; and generating the second transformation matrix based on differences between locations of the one or more features of the object in the second input image and the one or more corresponding features of the object in the selected first previous input image.

Selecting the first previous input image comprises: determining which previous input image of the plurality of previous input images is most suitable for inpainting the first region of pixels in the second input image; and selecting the determined previous input image.

The selected first previous input image may be the previous input image that is most likely to include information required for inpainting the one or more erroneous pixels.

The second input image may be a transformation of the first input image, wherein determining which previous input image of the plurality of previous input images is most suitable for inpainting the first region of pixels in the second input image is based on: the transformation of a pose of an object in the second input image compared with a pose of the object in the first input image; and a pose of the object in each previous input image of the plurality of previous input images.

The object in the determined previous input image may have a pose that is most similar to an extrapolation of the transformation of the object in the first input image to the object in the second input image.

Determining which previous input image of the plurality of previous input images is most suitable for inpainting the first region of pixels in the second input image may comprise: determining a degree of similarity between the pose of the object in the previous input image and the pose of the object in the second input image; and selecting the previous input image that has the highest degree of similarity.

Each previous input image stored in the database may be associated with a different predefined pose of the object, wherein selecting the first previous input image comprises: identifying which predefined pose is most suitable for inpainting the first region of pixels in the second input image; and selecting the previous input image that is associated with the identified predefined pose. The method may further comprise: after generating the reference image: replacing, in the database, the selected first previous input image with the first or the second input image in the database if the pose of the object in the first or the second input image has a higher degree of similarity to the identified predefined pose than the pose of the object in the selected first previous input image.

The method may further comprise generating at least one further reference image, wherein generating the output image comprises inpainting the one or more erroneous pixels based on correspondingly located pixels of the reference image. And the at least one further reference image.

Generating the reference image may comprise: receiving, from a database, a point cloud indicative of a 3D model of the object, wherein the point cloud is based on one or more previous input images, each previous input image corresponding to a previous first input image of the video stream or a previous second input image received at the inpainter; and generating the reference image using the point cloud.

The method may further comprise: after generating the reference image: updating the point cloud in the database based on the first input image or the second input image.

The method may further comprise: for each first input image of the input video stream: identifying, in the output image, a second region of pixels within the first region of pixels, wherein the second region of pixels defines a feature of the object that is likely to visually change over time, and wherein at least some of the one or more erroneous pixels are located within the second region; and applying a blurring effect or an image correction to the second region of pixels.

Identifying the second region of pixels may comprise: identifying, in the output image, a plurality of regions of pixels within the first region, each region of the plurality of regions corresponding to a respective feature of the object; and for each region of the plurality of regions: determining if at least some of the one or more erroneous pixels are located within the region; and identifying the region as the second region if the region corresponds to a feature of the object that is likely to visually change over time.

The second input image may be the first input image. The plurality of first input images may be a plurality of first depth images indicative of a distance to the object from a first fixed reference, and the input video stream may further comprise a plurality of second depth images indicative of a distance to the object from a second fixed reference, wherein the method may further comprise a further inpainter: receiving a second depth image of the plurality of second depth images, wherein the second depth image comprises a second region of pixels indicative of the object; generating a further reference image of the object such that the object in the further reference image aligns with the object in the second input image; and generating a further output image by inpainting one or more erroneous pixels in the second region of pixels, wherein inpainting the one or more erroneous pixels is based on the further reference image.

The inpainter may generate the reference image using one or more previous input images that are stored in a database, and the further inpainter may generate the further reference image using one or more previous input images that are stored in the database, wherein the database stores a plurality of previous input images that comprise one or more previous first depth images and one or more previous second depth images.

The first input image and the second input image may be colour images, greyscale images or depth maps.

The object may be a human head.

In a second aspect of the disclosure, there is provided a computer-readable medium comprising computer executable instructions stored thereon, which when executed by one or more processors, cause the one or more processors to perform the method according to the first aspect.

In a third aspect of the disclosure, there is provided an electronic device configured to perform the method of the first aspect.

In a fourth aspect of the disclosure, there is provided a system comprising: a first electronic device configured to perform the first aspect; and a second electronic device in data communication with the first electronic device over a network, wherein the second electronic device is configured to receive the output video stream from the first electronic device over the network.

In a fifth aspect of the disclosure, there is provided a system comprising: a first electronic device configured to perform the first aspect; and a second electronic device in data communication with the first electronic device over a network, wherein the first electronic device is configured to receive the input video stream from the second electronic device over the network. BRIEF DESCRIPTION OF THE DRAWINGS

Examples of the present disclosure are now described with reference to the accompanying drawings, in which:

Figure 1A shows a functional block diagram of a video processor according to an example of the present disclosure;

Figure IB shows a functional block diagram of a video processor according to another example of the present disclosure;

Figure 1C shows a functional block diagram of a video processor according to another example of the present disclosure;

Figure 2 shows a system according to an example of the present disclosure;

Figure 3 shows an inpainter according to an example of the present disclosure;

Figure 4 shows a reference view store according to an example of the present disclosure;

Figure 5 shows reference view database according to an example of the present disclosure;

Figure 6 shows an example of how the inpainters of the present disclosure may be used to inpaint colour images;

Figure 7 shows another example of how the inpainters of the present disclosure may be used to inpaint colour images;

Figure 8 shows an inpainter according to another example of the present disclosure;

Figure 9 shows an inpainter according to another example of the present disclosure;

Figure 10 shows an inpainter according to another example of the present disclosure;

Figure 11 shows a reference view store according to another example of the present disclosure;

Figure 12 shows an image stitcher according to another example of the present disclosure;

Figure 13 shows an inpainter according to another example of the present disclosure;

Figure 14 shows an abnormality manager according to another example of the present disclosure;

Figure 15 shows an example system for facilitating a video call;

Figure 16 shows a system according to another example of the present disclosure;

Figure 17 shows a system according to another example of the present disclosure;

Figure 18 shows a system according to another example of the present disclosure; and Figure 19 shows a computing system according to an example of the present disclosure.

DETAILED DESCRIPTION

Figure 1A shows a video processor 105 according to an example of the present disclosure. The video processor 105 receives an image comprising a colour frame 101 and/or a depth frame/data 102. The colour frame 101 may be part of an input video stream captured by a (real) camera. The video processor 105 also receives transformation data 103. The depth data 102 may be considered as a depth image or depth map. The transformation data 103 may be considered as a transformation matrix. The video processor 105 is configured to generate an output image 104 based on the input image (the colour frame 101 and/or the depth data 102) and the transformation data 103. The output image 104 may show the scene captured in the input image from the perspective of a (virtual) camera that is in a different position relative to the real camera. The output image 104 may be considered as an estimation of the scene from the perspective of the virtual camera.

The colour frame 101 may be referred to from here on the image 101. The image 101 includes a plurality of pixels. The image 101 has a size of a particular width of pixels X_MAX in the horizontal direction and a particular height of pixels /M W in the vertical direction. The image 101 can therefore be considered as a 2D grid of pixels having a coordinate system [0, X_MAX], [0, YMAX] . Individual pixels in the image 101 can be uniquely identified by a pair of pixel coordinates (x, y). The x values are integers ranging from 0 to X_MAX in the horizontal direction (or along an x axis of the image) and the y values are integers ranging from 0 to y_MAx in the vertical direction (along a y axis of the image). Each pixel may be considered as a discrete location in the image 101 that is capable of displaying a colour (e.g. based on RGB values or a greyscale value). As such, each pixel is also associated with a pixel value V(x, y). If the image 101 is a colour image, each pixel value V may have a R (red) component, G (green) component and B (blue) component, which in combination determine the colour of the pixel. If the image 101 is a greyscale image, the pixel value V(x, y) may be a greyscale value which indicates the colour (in this case the grey level) of the pixel. The pixel coordinate (x, y) corresponds to the discrete location of the image grid in which the associated pixel value V(x, y) (i.e. colour) should be generated and displayed. Each pixel may be considered as having a pixel size made up of a pixel width in the x direction and a pixel height in the y direction. For simplicity, the pixel width and height may be considered as being one integer, such that each pixel occupies a 1x1 area of the 2D image grid.

The depth data 102 may at times be referred to from here on as the image 102 and includes a depth value d(x, y) for each pixel (x, y) of the image 101. The depth value d(x, y) indicates the depth of the respective pixel (x, y) of the image 101, e.g. from the perspective or view of the real camera. The depth value d(x, y) may be considered as a distance of the pixel (x, y) from the real camera. The depth data 102 may be considered as a depth frame including depth values d(x, y) arranged in a 2D grid, according to the pixel coordinate (x, y) to which the depth value corresponds. In other words, the depth data 102 may be considered as a depth map or depth image 102. Similarly to the image 101, the depth image 102 will have a size of a particular width of pixels X_MAX in the horizontal direction and a particular height of pixels YMAX in the vertical direction. The depth image 102 can therefore be considered as a 2D grid of pixels having a coordinate system [0, X_MAX], [0, YMAX] . Individual pixels in the depth image 102 can be uniquely identified by a pair of pixel coordinates (x, y). The x values are integers ranging from 0 to XMAX in the horizontal direction (or along an x axis of the image) and the y values are integers ranging from 0 to y_MAx in the vertical direction (along a y axis of the image). Each pixel may be considered as a discrete location in the depth image 102 that is associated with a depth value d(x, y). The depth image 102 may be visually displayed as a greyscale image. Similarly to the image 101, the pixel coordinate (x, y) corresponds to the discrete location of the image grid in which the associated pixel value d(x, y) (i.e. depth) should be generated and displayed. Each pixel may be considered as having a pixel size made up of a pixel width in the x direction and a pixel height in the y direction. For simplicity, the pixel width and height may be considered as being one integer, such that each pixel occupies a 1x1 area of the 2D image grid.

The transformation data 103 is indicative of a 3D transformation between the position of the real camera and the position of a virtual camera relative to the scene. As such, the transformation data 103 indicates the desired change in view of the input video stream. The transformation data 103 may be represented as a mathematical function. The transformation data 103 may correspond to a 3D transformation matrix. The transformation data 103 may include parameters indicative of the change in the 3D translational and/or 3D rotational position between the real camera and the virtual camera. In particular, the transformation data 103 may include translation parameters indicative of a change in a 3D translational position between the real camera and the virtual camera in 3D space. The transformation data 103 may also include rotation parameters indicative of a change in a 3D rotational position between the real camera and the virtual camera in 3D space.

The video processor 105 is configured to generate an output image 104 based on the input image (i.e., the colour image/frame 101 and the depth image 102) and the transformation matrix 103. The output image 104 will show the scene captured in the input image, but from the perspective of the virtual camera instead of the perspective of the real camera. The output image 104 estimates the scene from the perspective of the virtual camera. For simplicity, the output image 104 may be considered as having the same size as the image 101, and therefore the same number of pixels.

The video processor 105 includes an image transformation module/unit 106 (i.e. an image transformer, also referred to as a 3D view transformer) which receives the image 101, depth image 102, and the transformation matrix 103. The image transformer 106 processes the image 101 based on the depth image 102 and the transformation matrix 103 to generate a transformed image 108. The image transformer 106 may also use camera parameters of the real camera to generate the transformed image 108, such as the focal length and/or the centre point of the real camera. The transformed image 108 shows the scene captured in the image 101, from the perspective of the virtual camera. The transformed image 108 may have the same size and dimensions as the image 101.

Due to the change in position between the real camera and the virtual camera, the transformed image 108 may include gaps or holes, i.e. regions of undefined pixels, or pixels without assigned values. Said pixels may be considered as a type of erroneous pixel, in particular missing pixels. The missing pixels may correspond to parts of the scene that the real camera was originally unable to see. As such, the video processor 105 further includes an inpainter 107, which may also be considered as a filler 107. The inpainter 107 is configured to inpaint the transformed image 108. The inpainter 107 receives the transformed image 108 and corrects the erroneous pixels. In particular, the inpainter 107 may assign values (i.e. colour) to the missing pixels. As described below, the inpainter 107 may inpaint the current image 101 based on previous images 101 of the video stream.

Figure IB shows the video processor 105 of Figure 1A in more detail. The video processor 105 includes a depth inpainter 201. In some scenarios, the depth image 102 may not be accurate, or may be missing information (e.g. may have incorrect depth values at some pixel locations and/or be missing depth values at some pixel locations). In other words, the depth image 102 may include one or more erroneous pixels, each of which may either be a missing pixel (eg, a pixel that has no value) or an incorrect pixel (eg, a pixel that has a value that is wrong). The depth inpainter 201 is configured to inpaint the depth image 102. In particular, the depth inpainter 201 receives the depth image 102, and corrects erroneous pixels, by generating and assigning values for missing pixels, or by changing the value of incorrect pixels. As a result, the depth inpainter 201 outputs a corrected depth image 206 (D), which corresponds to the depth image 102 with the erroneous pixels corrected. As such, the depth inpainter 201 may reduce inaccuracies in the depth data, and/or estimate and fill depth values that are missing from the depth data 102. As described below, the depth inpainter 201 may inpaint the current depth image 102 based on previous depth images 102.

In some examples, the depth inpainter 201 may receive the image 101 and use the pixels of the image 101 to generate the corrected depth image 206. However, the depth inpainter 201 may not perform any processing on the image 101, and therefore the image 205 (C) outputted by the depth inpainter 201 may be considered identical to the image 101. In other examples, the depth inpainter 201 may perform its function without the image 101, in which case the input image 101 is provided directly to the image transformer 106 (e.g. as shown in Figure 1A).

The image transformer 106 may include a pixel transformer 202. The pixel transformer 202 receives the image 205 (C) and the depth image (D) from the depth inpainter 201. The pixel transformer 202 also receives the transformation matrix 103. The pixel transformer 202 generates and outputs transformed pixel data 207 (TC) based on the transformation data 103 and the image 205 (C). In particular, for each pixel (x, y)c in the image 205, the pixel transformer 202 transforms the pixel coordinate (x, y)c into a new, transformed coordinate (x, y)Tc- The pixel transformer 202 transforms the pixel coordinate (x, y)c based on the depth value d(x, y)c that is associated with the pixel (x, y)c, and based on the transformation data 103. The pixel transformer 202 may also use the camera parameters of the real camera to transform the pixel coordinate (x, y)c, such as the focal length and/or the centre point of the real camera. The transformed coordinate (x, y)_Tc corresponds to an estimate of the location of the pixel (x, y)c in the output image 104. The transformed pixel data 207 comprises each of the transformed coordinates (x, y)_Tc- Each transformed coordinate (x, y)_Tc is associated with the corresponding pixel value V(x, y)c from the image 205, which may be denoted as V(x, y)_Tc and included in the transformed pixel data 207. It will be appreciated that the pixel (x, y)c and the pixel (x, y)_Tc correspond to the same source pixel value, for example the same colour information position at (x, y)c in image 205, whilst (x, y)_Tc indicates the pixel with the transformed coordinate value.

As described above, each pixel of the image 205 has an initial coordinate (x, y)c which indicates the location of the pixel in the image 205. Each pixel is also associated with a pixel value V(x, y)c (e.g. RGB or greyscale values) which determines the colour of the pixel. The transformed coordinate (x, y)_Tc corresponds to where the pixel (x, y)c should appear in the image, had the image been captured from the position and perspective of the virtual camera instead of the real camera. In other words, the new coordinate (x, y)_Tc determines where the pixel value V(x, y)c (or colour) of the original pixel (x, y)c should appear in the transformed image 108, from the perspective of the virtual camera. The pixel transformer 202 determines the new coordinate (x, y)_Tc based on the depth value d(x, y)c that is associated with the pixel (x, y)c, and based on the transformation data 103. In particular, the pixel transformer 202 may determine the new coordinate (x, y)_Tc by applying a transformation (e.g. a matrix multiplication) to the coordinate (x, y)c using the associated depth value d(x, y)c and the transformation data 103.

The transformation applied by the pixel transformer 202 to a pixel (x, y)c may be understood conceptually as follows. The pixel transformer 202 may begin with a pixel (x, y)c of the image 205. The coordinate (x, y)c indicates the location (i.e. the discrete position) of the pixel in the image 205, which was captured from the perspective of the real camera. The coordinate (x, y)c can be considered to then be transformed into a 3D coordinate (x, y, z)_r. The 3D coordinate (x, y, z)_r indicates the position of the pixel (x, y)c relative to the real camera. The 3D coordinate (x, y, z)_r may be determined using the depth value d(x, y)c that is associated with the pixel (x, y)c, and the camera parameters of the real camera (e.g. focal length). Then, the 3D coordinate (x, y, z)_r is transformed to generate a transformed 3D coordinate (x, y, z)_v. The 3D coordinate (x, y, z)_v indicates the position of the pixel (x, y)c relative to the virtual camera. The transformation may be conceptually considered as changing the origin of the 3D coordinate system from the position of the real camera to the position of the virtual camera. The transformation is performed using the transformation data 103. The 3D coordinate (x, y, z)_v is then transformed back into the 2D coordinate system [0, X_MAX], [0, YMAX] (i.e. the 2D grid which forms the image 205) to generate the transformed coordinate (x, y)_Tc- The transformation back into 2D space may also be performed using the camera parameters, such as focal length. This transformation may also result in generating a new depth value d(x, y)_Tc for each transformed pixel (x, y)_Tc- As such, the pixel transformer 202 may also determine a new depth d(x, y)Tc for each transformed pixel (x, y)_Tc, wherein the new depth indicates the distance of the pixel (x', y')_TC relative to the virtual camera. The pixel transformer 202 may output transformed depth data 208 (TD), which includes the new or updated depth values d(x, y)_Tc for each transformed pixel (x, y)_Tc-

The image transformer 106 may further include an image builder 203. The image builder 203 receives the transformed pixel data 207 (TC) and the transformed depths 208 (TD) from the pixel transformer 202. The image builder 203 builds or generates the transformed image 108 (FTC) from the transformed pixel data 207 and the transformed depths 208. The transformed image 108 is formed a plurality of pixels (x, y)_FTc- The image builder 203 uses the transformed pixel data 207 (TC) and the transformed depths 208 (TD) to generate each pixel (x, y)_FTc of the transformed image 108. The transformed image 108 may have the same size and dimensions as the image 205. The image builder 203 may also generate and output a transformed depth image, which indicates the depth of each pixel in the transformed image 108 (FTC). The transformed image 108 (and optionally the transformed depth image) is provided to the inpainter 204, which corresponds to the inpainter 107 of Figure 1A. The inpainter 204 will fill any missing/erroneous pixels in the transformed image 108.

Advantageously, the video processor 105 of the present invention does not require the building and rendering of a 3D image or mesh to transform the view of the image 101. As such, the video processor 105 is able to process an input video stream significantly faster and with higher computational efficiency.

Figure 1C shows a video processor 105-2 according to another example of the present disclosure. The video processor 105-2 corresponds to the video processor 105, with the following differences.

Each pixel (x, y)c of the image 205 occupies one unit area of the 2D pixel grid (i.e. a 1x1 square of the image). If the virtual camera is relatively closer to the scene in comparison to the real camera, then the pixel (x, y)c should occupy a relatively greater area of the transformed image 108 because it is now closer to the camera. If the virtual camera is relatively further from the scene in comparison to the real camera, then the pixel (x, y)c should occupy a relatively smaller area of the transformed image 108. Furthermore, if the virtual camera is angled relative to the scene in comparison to the real camera, then this may also change how much image area each pixel (x, y)c should occupy in the transformed image 108. For example, in the image 205, a surface may be facing the real camera directly. From the perspective of the virtual camera, the surface may be angled, e.g. at 45 degrees. Therefore, the total area of the transformed image 108 occupied by the surface is reduced. On a more granular scale, the pixels (x, y)c corresponding to the surface may each occupy approximately a factor of 0.7 of a unit pixel area. It will be appreciated that some parts of the scene may be relatively closer to the virtual camera and other parts of the scene may be relatively further from the virtual camera. As such, some pixels of the image 205 may be required to occupy greater areas of the transformed image 108, whilst other pixels of the image 205 may be required to occupy smaller areas of the transformed image 108.

The video processor 105-2 of Figure 1C further includes a spread coefficient estimator 3001. The spread coefficient estimator 3001 is configured to determine a spread coefficient S(x, y)_Tc (also referred to as a scaling coefficient) for each pixel (x, y)_Tc contained in the transformed pixel data 207 (i.e. for each pixel of the image 205). The spread coefficient S(x, y)_Tc indicates a degree or amount of scaling to be applied to the pixel (x, y)_Tc when generating the transformed image 108. In other words, the spread coefficient S(x, y)_Tc indicates the amount of image area that the pixel (x, y)_Tc should occupy in the transformed image 108. As described below, the spread coefficient S(x, y)_Tc may include an x-component Sx(x, y)_Tc which indicates the degree of scaling in the x-direction, and a y-component Sy(x, y)_Tc which indicates the degree of scaling in the y-direction. The components Sx(x, y)_Tc and Sy(x, y)_Tc may be considered as factors that indicate how many multiples of the unit area (a 1x1 square) the pixel (x, y)_Tc should occupy in the transformed image 108 in the x and y directions respectively. For example, Sx(x, y)_Tc = 1 may indicate that the pixel still occupies a unit area in the image 108. Sx(x, y)_Tc = 2 may indicate that the pixel occupies 2x (i.e. double) the unit area in the image 108 in the x-direction. Sx(x, y)_Tc = 0.5 may indicate that the pixel occupies 0.5x (half) the unit area in the image 108 in the x direction. The same also applies for the y-direction.

The spread coefficient estimator 3001 determines each spread coefficient as follows. The spread coefficient estimator 301 receives the image 205 (C) and the depth data/image 208 (D). For each pixel (x, y)c, the estimator 3001 calculates a surface normal of the pixel. In particular, the estimator 3001 calculates an angle a_x(x, y)c indicating the angle of the surface normal in the x-direction of the image 205, and an angle a_y(x, y)c indicating the angle of the surface normal in the y-direction of the image 205. Each angle a_x(x, y)c and a_y(x, y)c provides the angle of the surface normal relative to the position of the real camera that captured the image 205. The angles may be calculated using any known technique. For example, the angles may be calculated based on the depth d(x, y)c associated with the pixel (x, y)c, and the depths of the neighbouring or adjacent pixels in the image 205. A curve fitting procedure may be used to determine the angles.

The spread coefficient estimator 3001 also receives the transformation data/matrix 103. Using the transformation data/matrix 103, the estimator 3001 calculates the surface normal of the corresponding transformed pixel (x, y)_Tc- In particular, the estimator 3001 calculates the angles a_x(x, y)_Tc and a_y(x, y)_Tc of the surface normal of the pixel (x, y)_Tc, but this time relative to the position of the virtual camera. The estimator 3001 may transform the angles a_x(x, y)c and a_y(x, y)c based on the transformation data 103 to generate the angles a_x(x, y)_Tc and a_y(x, y)Tc- The transformation may be applied using similar principles to the transformation performed by the pixel transformer 202, e.g. using matrix multiplication.

In an alternative example, the estimator 3001 may calculate the surface normal of the corresponding transformed pixel (x, y)_Tc based on the transformed pixel data 207 (TC) and the transformed depth data 208 (TD). In particular the estimator 301 may calculate the angles a_x(x, y)_Tc and a_y(x, y)_Tc in a similar way to the angles a_x(x, y)c and a_y(x, y)c described above. For example, the angles may be calculated based on the depth d(x, y)_Tc associated with the pixel (x, y)_Tc, and the depths of neighbouring pixels according to the transformed pixel data 208. As such, the estimator 3001 may calculate the angles the angles a_x(x, y)_Tc and a_y(x, y)_Tc without requiring the transformation data 103.

The estimator 3001 then calculates the coefficient component Sx(x, y)_Tc based on the difference between the angle a_x(x, y)c and the angle a_x(x, y)_Tc- Furthermore, the estimator 3001 calculates the coefficient component Sy(x, y)_Tc based on the difference between the angle a_y(x, y)c and the angle a_y(x, y)_Tc- The angle differences may be scaled based on the corresponding depths d(x, y)c and d(x, y)_Tc of the pixels. In general, a decrease in the angle may indicate that the pixel (x, y)_Tc should occupy an area of the transformed image 108 that is larger than the unit area (1x1 pixel). An increase in the angle may indicate that the pixel (x, y)c should occupy an area of the transformed image 108 that is smaller than the unit area. Where it is described that a pixel occupies an area of an image, it will be appreciated that it is intended to mean that the pixel value (i.e. the colour) associated with that pixel should be assigned to the pixel(s) in said area of the image.

In an alternative example, the estimator 3001 may determine each spread coefficient S(x, y)_Tc further based on the transformed pixel data 207, as follows. For a pixel (x, y)c of the image 205, the estimator 301 may identify one or more pixels that are neighbouring or adjacent to the pixel (x, y)c in the image 205. For example, in the x-direction, the adjacent pixels (x-1, y)c and (x+ l)c may be identified. Each of the pixels (x-1, y)c, and (x+ l)c have coordinates that are originally one unit pixel apart from the pixel (x, y)c in the image 205. The pixels (x-1, y)c, (x, y)c and (x+ l)c will be transformed by the pixel transformer 202 to the corresponding transformed pixels (x-1, y)_TC, (x, y)_Tc and (x+l)_Tc- However, the transformed pixels (x-1, y)_Tc, (x, y)_Tc and (x+ l)_Tc may no longer have coordinates that are one pixel width apart. For example, the transformed pixels (x-1, y)_Tc, (x, y)_Tc and (x+ l)_Tc may be relatively closer or further apart. The differences between the coordinate of the pixel (x, y)_Tc and the coordinates of the pixels (x-1, y)_TC, and (x+l)_Tc can be used to generate the spread coefficient Sx(x, y)_Tc in the x-direction. The spread coefficient Sx(x, y)_Tc may be proportional to the differences. For example, if the coordinates of the pixels (x-1, y)_TC, and (x+ l)_Tc are relatively further apart from the coordinate (x, y)_Tc, then this may indicate a proportionally larger spread coefficient Sx(x, y)_Tc- If the coordinates of the pixels (x-1, y)_Tc, and (x+ l)_Tc are relatively closer to the coordinate (x, y)_Tc (e.g. they occupy the same pixel width) then this may indicate a proportionally smaller spread coefficient Sx(x, y)_Tc- It will be appreciated that the spread coefficient Sy(x, y)_Tc can be determined in a similar way, by considering the adjacent pixels and distances of the transformed pixels in the y-direction.

The estimator 3001 outputs coefficient data 3003 (EC). The coefficient data 3003 includes the pair of coefficient components S_x(x, y)_Tc and S_y(x, y)_Tc for each pixel (x, y)_Tc of the transformed pixel data 207.

The image builder 3002 generally corresponds to the image builder 203 in Figure IB, and performs a similar function. However, the image builder 3002 differs in that the image builder 3002 further receives the coefficient data 3003. The image builder 3002 generates the transformed image 108 further based on the coefficient data 3003. In particular, when generating a pixel (x, y)_FTc of the image 108, the image builder 302 identifies which pixels of the image 205 fall within that pixel area, based on the transformed coordinates (x, y)_Tc and the corresponding scaling coefficients Sx(x, y)_Tc and Sy(x, y)_Tc- Then, the pixel (x, y)_FTc is generated based on those identified pixels.

It will be appreciated that the image builder 3002 may use any appropriate image scaling and/or pixel blending techniques in order to determine each pixel value V(x, y)_FTc based on the spread coefficients and the values of nearby pixels (x, y)_Tc- For example, any appropriate algorithm may be used, such as nearest neighbour, bi-linear interpolation and cubic interpolation.

As such, the image builder 3002 generates the transformed image 108 whilst ensuring that each pixel (x, y)_Tc from the transformed pixel data 207 occupies an area of the transformed image 108 that is appropriately scaled by the spread components Sx(x, y)_Tc and Sy(x, y)_Tc- In particular, the value V(x, y)_Tc will occupy Sx(x, y)_Tc more unit areas in the image 108 in the x direction, and Sy(x, y)_Tc more unit areas in the image 108 in the y direction, whilst being centred on the pixel coordinate (x, y)_Tc-

Figure 2 shows a system 1000 according to an example of the present disclosure. The system 1000 includes a video processor 105-5 according to an example of the present disclosure. The video processor 105-5 is configured to combine images from two video streams in order to form the final output image 104.

The video processor 105-5 includes a first channel "a" and a second channel "b". The first channel receives a frame 101a of a first input video stream captured by a first real camera (i.e. camera 901a), and corresponding depth data 102a. Furthermore, the second channel receives a frame 101b of a second input video stream. The second video input stream is captured by a second real camera (i.e. camera 901b). The second real camera is positioned to capture the same scene or environment as the first real camera, but at a different position or angle to the first real camera. For example, if the scene includes a human subject in a video call setting, the first real camera 901a may be positioned to capture more of the left side of the human's face and the second real camera may be positioned to capture more of the right side of the human's face. As such, the frames 101b of the second input stream correspond to the frames 101a of the first input stream, in the sense that they show the same scene but from a different angle or perspective. Moreover, it will be appreciated that the first and the second real cameras may capture images at the same frame rates, and approximately at the same points in time.

Each of the first channel and the second channel include an image transformation module/unit 106a and 106b, respectively. The image transformation modules 106a and 106b function similarly to the image transformation 106 described previously, and will not be repeated. Any additional I alternative functionality of those blocks is described below.

The image transformation module 106a receives first transformation data 103a, which is indicative of a transformation between the position of the first real camera and the position of the virtual camera relative to the scene. The first transformation data 103a may be characterised and function similarly to the transformation data 103 described previously. Consequently, the image transformation module/unit 106a generates a first transformed image 108a which shows the scene captured in the image 101a from the perspective of the virtual camera.

The image transformation module 106b cannot use the first transformation data 103a used in the first channel, because the image 101b is captured using the second real camera that is in a different position to the first real camera. As such, the video processor 105-5 includes a transform adjustment block 6001. The transform adjustment block 6001 receives the first transformation data 103a. The transform adjustment block 6001 processes or transforms the first transformation data 103a to generate second transformation data 103b. The second transformation data 103b is indicative of a transformation between the position of the second real camera 901b and the position of the same virtual camera relative to the scene. As such, the second transformation data 103b can be used by the second channel. The second transformation data 103b can be generated using camera metadata 6006 which is received by the transform adjustment block 6001. The camera metadata 6006 is indicative of the relative position between the first and second real cameras. The transform adjustment block 6001 may transform the transformation data 103a based on the camera metadata 6006 to generate the second transformation data 103b. Consequently, the image transformation module 106b is able to generate a second transformed image 108b, which shows the scene captured in the image 101b from the perspective of the same virtual camera. The first transformed image 108a and the second transformed image 108b are different estimates of the scene from the perspective of the virtual camera. The image transformation module 106b may use camera parameters of the second real camera to transform the pixel coordinate, such as the focal length and/or the centre point of the second real camera.

Since the first real camera 901a and the second real camera 901b captures the same scene from different angles, some areas of the scene may be better visible to the second real camera 901b or the first real camera 901a. As such, the first transformed image 108a and the second transformed image 108b are combined by an image combiner 6004 to generate a more complete and higher quality combined image 6003.

It will be appreciated that although only two channels are shown in Figure 2, more than two channels are also envisaged. For example, additional channels may receive additional video streams from additional cameras and process the video streams in a similar way to the first and the second channels described above, and the image combiner 6004 may generate the image 6003 further based on the additional transformed pixel data generated by the additional channels (e.g. by selecting the pixel with the highest quality score, or blending the pixels proportionally to the quality score).

The first and second real cameras 901a and 901b may be set up as follows. For example, with a subject sitting in front and using a typical 24-28 inch display unit (e.g. monitor), it may be desired to position the virtual camera anywhere across a 40cm distance in the x direction. As such, the real cameras can be set up on either side of the 40cm space to capture the scene from different angles. As a result, the real cameras will capture sufficient image content to generate the final output image 104, whilst putting less burden on the inpainter 204.

It will be appreciated that the cameras can be arranged independently of one another and belong to different electronic devices (e.g., different webcam units or video conferencing cameras). For example, the cameras can be positioned in a meeting room to capture attendees from different angles.

It will be appreciated that the positioning of the first and second real cameras 901a and 901b may differ from a pair of stereo cameras. Stereo cameras may be placed intentionally close to one another, to capture the scene from almost the same angle in order to determine the depth of the scene. However, the first and second real cameras are preferably placed at a greater distance apart, in order to capture the same scene from different angles and/or positions, so that each camera may intentionally capture different views of the scene. Figure 2 further shows how the outputs of the first and the second cameras 901a and 901b may be used to determine the corresponding depth data 102a and 102b for each channel. As described previously, the cameras 901a and 901b are spaced at a sufficient distance apart (e.g. at the extremities of the region in the scene where the virtual camera is likely to be placed), in order to capture more pixel data of the scene. As a result, although the cameras 901a and 901b capture the same scene from different angles, the images captured by the cameras 901a an 901b will not contain the same parts of the scene. Rather, there may be significant parts of the image 101a that is not present in the image 101b. As such, stereo depth estimation techniques may not be suitable to determine the complete depth data 102a and 102b from the outputs of the cameras 901a and 901b.

The system 1000 therefore includes a depth estimator 1001 which receives each of the images 101a and 101b. The depth estimator 1001 also receives the camera metadata 6006, which indicates the relative positions of the cameras 901a and 901b as described above. The depth estimator 1001 generates preliminary depth data/image 1002a and preliminary depth data/image 1002b based on the images 101a and 101b, and the camera metadata 6006.

Using the camera metadata 6006, the depth estimator 1001 identifies pairs of pixels of the image 101a and 101b that correspond to the same location in the scene. In particular, for each pixel of the image 101a, the depth estimator 1001 determines whether there is a corresponding pixel in the image 101b that captures the same location within the scene. The corresponding pixel may have a different coordinate to the pixel of the image 101a. The depth estimator 1001 makes this determination based on the camera metadata 6006. If a corresponding pair of pixels is found, then the depth estimator 1001 determines the depth of the pixel in the image 101a based on the difference between the coordinate of the pixel in the image 101a and the coordinate of the corresponding pixel in the image 101b. Furthermore, the depth of the corresponding pixel in the image 101b is also determined in similar way. The depth estimator 1001 outputs preliminary depth data 1002a which includes the depths of the pixels of the image 101a for which a corresponding pixel was found. Furthermore, the depth estimator 1001 outputs preliminary depth data 1002b which includes the depths of the corresponding pixels of the image 101b.

Since the cameras 901a and 901b have different views of the scene, not all pixels in the image 101a will have a corresponding pixel in the image 101b. As such, the preliminary depth data 1002a may be missing depth values for pixels in the image 101a, and the preliminary depth data/image 1002b may be missing depth values for pixels in the image 101b. Therefore, the system 1000 includes a first depth inpainter (also referred to as a depth hole filler) 1003a. The depth hole filler 1003a is configured to identify erroneous pixels in the preliminary depth data 1002a , in particular pixels in the preliminary depth data 1002a have missing depth values, and determine depth values to fill, or inpaint, the missing depth values. In particular, the first depth hole filler 1003a receives the preliminary depth data 1002a and the image 101a. The first depth hole filler 1003a determines depth values to inpaint the missing depth values based on the known depth values indicated in the preliminary depth data 1002a. The first depth hole filler 1003a then outputs the first depth data/image 102a corresponding to the image 101a. Similarly, the second depth hole filler/inpainter 1003b is configured to identify erroneous pixels in the preliminary depth data 1002b, in particular pixels in the preliminary depth data 1002b have missing depth values, and determine depth values to fill, or inpaint, the missing depth values, in a similar way to the first depth hole filler/inpainter 1003a. As such, the second depth hole filler 1003b outputs second depth data/image 102b corresponding to the image 101b. As described below, each of the depth inpainters 1003a and 1003b may inpaint the corresponding depth images 1002a I 1002b based on previous depth images 1002a I 1002b.

Advantageously, the system 1000 does not require any dedicated hardware (e.g. stereo camera pairs or time of flight units) in order to determine the depths of each pixel in the images 101a and 101b. Rather, the two-camera setup 901a and 901b can be used to determine the depth values, whilst achieving the advantages of the two-channel video processor 105-5.

Optionally, the video processor 105-5 may be modified to only use the colour frame and depth data from one source camera, for example 101a and 102a, and either ignore, or not receive, the colour frame and depth data from the other source camera, for example 102b and 101b. In such arrangement, the depth estimation 1001 may use the depth data 1002a and 1002b from both cameras 901a and 901b as shown in Fig. 2. Optionally, depth inpainting may be conducted on only one depth stream, for example 1002a, and the other depth inpainter 1003b may be omitted. In this case, the view transformer 105-5 may be configured in the same way as any of the view transformers shown in Fig. 1A, Fig. IB or Fig. 1C, 105, or 105-2 or any other view transformer implementation that are been configured to transform views using a single camera stream.

Figure 3 shows an inpainter 301 according to an example of the present disclosure. The inpainter 301 is configured to inpaint an input video stream in real time. The inpainter 301 receives each input image 302 of the input video stream. Each input image 302 may be a colour image, greyscale image or a depth image/map. The input video stream shows a view of a scene comprising an object (e.g. a human face/head) from a first camera position. As such, each input image 302 shows the object from the first camera position. Moreover, each input image 302 will comprise a region of pixels corresponding to the object (i.e. that defines the object in the image).

The inpainter 301 estimates a pose of the object. A pose of the object is understood as the location and orientation of the object in the image, i.e. relative to a fixed reference, for example the orientation of the object relative to the position of the first real camera, or relative to a virtual camera position. The inpainter 301 generates a pose matrix 306 indicative of the pose of the object in the input image 302. In particular, an object detector 303 detects the object in the input image 302, and outputs information 304 indicative of the location of the object in the 2D image. An object pose estimator 305 then estimates the pose matrix 306 based on the information 304 and the input image 302. The pose matrix 306 indicates a translational position (i.e. location in 3D space) of the object that is shown in the input image 302. In particular, the pose matrix 306 indicates the translational position in each of the x, y and z planes, relative to the fixed reference (eg, the first real camera position). The pose matrix 306 also indicates a rotational position (i.e. orientation) of the object that is shown in the input image 302. In particular, the pose matrix 306 indicates the rotational position about each of the x, y and z axes, relative to the fixed reference first camera position. The pose matrix 306 may be considered as a 6D pose matrix, as understood by the person skilled in the art. For example, the pose matrix 306 may be a 4x4 matrix, made of a 3x3 rotational component indicative of the rotational position of the object relative to the fixed reference, and a translational component indicative of the translational position of the object relative to the fixed reference.

The object detector 303 and the object pose estimator 305 may be implemented using any known techniques in the art. In some examples, the object pose estimator 305 may detect one or more dominant features of the object that have known locations relative to one another, and detect the pose based on the detected features. For example, if the object is a face or head, then the pose estimator 305 may detect the locations of features such as the eyes, mouth and/or nose. Then, the features may be compared to a model of what a face/head should look like, and determine the pose of the face/head based on the comparison. The object pose estimator 305 may be implemented using a neural-network trained to achieve the above described functionality. The neural network may be used to perform the entire functionality of the object pose estimator 305, or may be used to perform an individual function of the pose estimator 305 (e.g. only to detect the features of the object).

The inpainter 301 is configured to generate a reference image 313 of the object. The reference image 313 is generated so that the object in the reference image 313 aligns with the corresponding object in the input image 302. The reference image 313 is generated based on one or more previous input images 302 previously received at the inpainter 302.. In particular, a previous input image 309 (also referred to from hereon as an initial reference image 309, or a reference image 309) is obtained from a reference view store 308. The initial reference image 309 may be one of a number of previous input images that are available in the reference view store 308. The initial reference image 309 may be obtained at a time Time 1. The initial reference image 309 may show the object from a reference camera position that is different to the first camera position.. A reference pose matrix 310 may also received from the reference view store 308. The reference pose matrix 310 may be indicative of the pose of the object in the initial reference image 309, i.e. relative to the reference camera position. The reference pose matrix 310 may be characterised similarly to the pose matrix 306, and so is not described in detail.

The inpainter 301 may generate the reference image 313 by transforming the previous input image 309 to align the transformed previous input image with the input image 302 so that the object in the reference image 313 aligns with the object in the input image 302. For example, the previous input image 309 may be transformed so that the pose of the object in the transformed previous input image(s) (eg, the reference image 313) matches the pose of the object in the input image 302. A matched pose between the reference image 313 and the input image 302 does not necessarily mean that the pose of the object(s) in the reference image 313 and the input image 302 is identical, but that they are as close to being the same as the inpainter 301 can reasonably achieve (eg, the reference image 313 is generated in order to match the pose of the object in the reference image 313 as closely as possible to the pose of the object in the input image 302). A pose difference calculator 311 may receive the input pose matrix 306 and the reference pose matrix 310. The pose difference calculator 311 then generates a transformation matrix 316 based on a difference between the input pose matrix 306 and the reference pose matrix 310. The transformation matrix 316 is indicative of a 3D transformation between the pose of the object in the reference image 309 and the pose of the object in the input image 302. A 3D view transformer 312 applies the 3D transformation to the reference image 309 using the transformation matrix 316 to produce the aligned reference image 313. The 3D view transformer 312 may function similarly to the image transformation module 106 previously described, and so is not described in detail. However, any other type of 3D view transformer may be used. It will be appreciated that the reference images 309/313 may be colour images, greyscale images or depth images/maps, in correspondence with the type of the input image 302.

Alignment of the object in the reference image 313 and the input image 302 means that features in the reference image 313 (such as eyes, ears and/or a nose in the case of the object being a face) are aligned with the same features in the input image 302. The term "aligned" as used herein means that features that are located at the same pixel positions in the image 302 and the reference image 313, or within an acceptable pixel distance such as within 10 pixels, or within 5 pixels, or within 3 pixels, or within 2 pixels. Therefore, an object in the reference image 313 that is aligned with the same object in the input image 302 can be found at a pixel(s) position in the reference image 309 that is within a predetermined acceptable distance of its pixel(s) position in the input image 302. As will be well understood by the skilled person, a 'feature' of an object in an image is a point in the image that is uniquely describable or traceable, such as eyes in a face, or the corner of a whiteboard, or a logo on a phone, etc.

An image stitcher 314 then inpaints or fills the input image 302 based on the reference image 313. In particular, the image stitcher 314 generates an output image 315 by inpainting or filling erroneous pixels (for example, missing pixels that do not have a value) based on the reference image 313. For example, the image stitcher 314 may identify pixels of the reference image 313 that correspond in location to the erroneous pixels of the input image 302. The image stitcher 314 may then fill/inpaint the erroneous pixels of the input image 302 using the correspondingly located pixels of the reference image 313. For example, the image stitcher 314 may assign the values of the corresponding pixels of the reference image 313 to the erroneous pixels of the input image 302 to provide a complete output image 315.

The inpainter 301 outputs each output image 315 to generate an output video stream. The above-described functionality is performed for each input image of the input video stream. As such, the inpainter 301 is able to inpaint the images of the input video stream in real time.

Optionally, after generating the output image 315, the current input image 302 is received at the reference view store 308, e.g. at a time Time 2. The reference view store 308 may update the available reference images in the reference view store 308 (or the inital reference image 309) based on the current input image 302.

As discussed above, the input video stream may be a stream of colour (or greyscale) images. In other words, each input image 302 may be a colour (or greyscale) image. As such, the inpainter 301 can be used as the colour inpainter 204 of the video processor 105, where the inpainter 301 receives the transformed image 108 (FTC) as the input image 302. In some examples, if the input image 302 is an input colour (or greyscale) image, the inpainter 301 may also receive a corresponding input depth image/map that is indicative of the depth of each pixel of the input image 302. E.g. if the input colour image 302 is the transformed image 108 (FTC), the inpainter 301 may also receive the corresponding transformed depth image. In some examples, the object detector 303 and the object pose estimator 305 may perform their functions using the input depth image instead of the input colour image 302, or may use the input depth image in addition to the input colour image 302. In some examples, when generating the reference colour image 313, the inpainter 301 may further obtain a reference depth image from the reference view store 308 that is indicative of the depth of each pixel in the initial reference colour image 309 (eg, the previous input image 309 that is selected from the reference view store 308). The 3D view transformer 312 may transform the initial reference colour image 309 further based on the reference depth image, e.g. as described in connection with the image transformer 106. At Time 2, the reference view store 308 may further receive the input depth image corresponding to the input image 302, which may be stored or processed by the reference view store 308 to update the available reference images and corresponding reference depth images.

As discussed above, when the input frame 302 is a colour (or greyscale) image, the colour or greyscale image may be accompanied with a corresponding depth map image. The depth map information may be used in parts of the inpainter 301 to aid with view transformation and/or other functions. For example, depth map information/data may be stored and retrieved from the reference view store 308 and used in view transformation 312 as described above, to aid the colour image inpainting.

As discussed above, the input video stream may alternatively be a stream of depth images/maps. In other words, each input image 302 may be a depth image/map. As such, the inpainter 301 can be used as the depth inpainter 201 of the video processor 105, where the inpainter 301 receives the depth frame 102 as the input image 302. The inpainter 301 can also be used as the depth inpainters 1003a and 1003b, where the inpainter 301 receives the depth frame 1002a or 1002b as the input image 302. In some examples, if the input image 302 is an input depth image, the inpainter 301 may also receive a corresponding input colour (or greyscale) image. E.g. if the input image 302 is the depth image 102, 1002a or 1002b, the inpainter 301 may also receive the corresponding colour image 101, 101a or 101b. In some examples, the object detector 303 and the object pose estimator 305 may perform their functions using the input colour image instead of the input depth image 302, or may use the input colour image in addition to the input depth image 302. In some examples, when generating the reference depth image 313, the inpainter 301 may further obtain a reference colour image from the reference view store 308, where the initial reference depth image 309 (eg, the previous input image 309 that is selected from the reference view store 308) is indicative of the depth of each pixel in the reference colour image. The 3D view transformer 312 may transform the initial reference depth image 309 further based on the reference colour image, e.g. as described in connection with the image transformer 106. At Time 2, the reference view store 308 may further receive the input colour image, which may be stored or processed by the reference view store 308 to update the available reference depth images and corresponding reference colour images.

As discussed above, when input frame 302 is a depth images/frame/map, the depth image/frame/map may be accompanied with a corresponding colour or greyscale image. The colour or grayscale image may be used in parts of the inpainter 301 to aid with pose estimation, image alignment and stitching and/or other functions. For example, as described above, colour or grayscale frames may be used by the object pose estimator 305 and/or the pose difference calculator 311 and/or stored and retreated from the reference view store 308 and/or transformed using the view transformer 312, and/or used by the image stitcher 314, to aid the depth image/frame/map inpainting.

In some examples, there may be multiple objects shown in the input video stream and therefore in the input image 302 (e.g. a hand in addition a face/head). In such cases, multiple instances of the inpainter 301 may be arranged in parallel. Each inpainter may function to inpaint a different object shown in the input video stream. Figure 4 shows an example implementation of the reference view store 308. The reference view store 308 comprises a database 403. The database 403 stores a plurality of previous input images 302 of the input video stream that were previously received at the inpainter 301. For each previous input image 302, the database 403 also stores a corresponding previous input pose matrix which is indicative of the pose of the object in the previous input image.

To obtain the initial reference image 309, a previous input image is selected from the database 403 as the initial reference image 309. A reference view selector 401 may request a list of the previous input images available in the database 403, e.g. via a communication line 404. The reference view selector 401 then selects the previous input image that is most suitable for inpainting the input image 302, in particular for inpainting the erroneous/missing pixels of the object in the input image 302. In one example, for each previous input image, the reference view selector 401 determines a degree of similarity between the pose of the object in the previous input image and the pose of the object in the input image 302. In particular, the reference view selector 401 may determine the degree of similarity based on a difference between the input pose matrix 306 and the previous input pose matrix corresponding to the previous input image. The reference view selector 401 then selects the previous input image in which the pose of the object has the highest degree of similarity with the pose of the object in the input image 302 (eg, the previous image that is most suitable for inpainting the input image 302 is the previous image in which the pose of the object has the highest degree of similarity with the pose of the object in the input image 302). The selected previous input image is provided as the initial reference image 309. Furthermore, the previous input pose matrix corresponding to the selected previous input image is provided as the reference pose matrix 310.

Optionally, at Time 2, the reference view database 403 is updated to include the current input image 302 (and the current pose matrix 305), so that the current input image 302 is available as a previous input image for future iterations (i.e. when the inpainter 301 receives the next input image 302). A database updater 402 may receive the current input image 302 and current input pose matrix 306. The database updater 402 may then communicate with the database 403, e.g. via communication lines 405, to update the database 403 with the current input image 302 and the current input pose matrix 306. For example, the database updater 402 may query the list of previous input images available in the database 403. The database updater 402 may then decide whether or not to update the database 403 to include the current input image 302 and current input pose matrix 306. The decision may be based on one or more factors relating to the current state of the database 403 (e.g. based on the number of previous input images currently in the database 403, the relative object pose matrix values in the database versus the current input pose matrix, an amount of time elapsed since those previous input images were stored in the database 403, or other factors described herein). In examples where the input image 302 is an input colour image, then for each previous input colour image in the database 403, the database 403 may also store a corresponding previous input depth image indicative of the depth of the pixels in the previous input colour image. The reference view selector 401 may provide the previous input depth image that corresponds to the selected previous input colour image as the reference depth image, which may then be used by the 3D view transformer 312 as discussed above. At Time 2, the database updater 402 may further receive the current input depth image corresponding to the current input colour image 302. The database updater 402 may store the current input depth image in the database 403, along with the current input colour image 302, so that the corresponding input depth image is available as a reference depth image for future iterations.

In examples where the input image 302 is an input depth image, then for each previous input image in the database 403, the database 403 may also store a corresponding previous input colour image, where the previous input depth image 302 is indicative of the depth of the pixels in the previous input colour image. The reference view selector 401 may provide the previous input colour image corresponding to the selected previous input depth image as the reference colour image, which may then be used by the 3D view transformer 312 as discussed above. At Time 2, the database updater 402 may further receive the current input colour image corresponding to the current input depth image 302. The database updater 402 may store the current input colour image in the database 403, along with the current input depth image 302, so that the corresponding input colour image is available as a reference colour image for future iterations.

Figure 5 shows an example implementation of the reference view database 403. The reference view database 403 stores a set of previous input images 403a, 403b, 403c, 403d. Each previous input image is associated with a different predefined pose of the object. The predefined poses preferably correspond to the most extreme possible or most extreme expected poses of the object in the input video stream, relative to the fixed reference. For example, in a video call scenario where the object is a human face/head, then the most extreme expected poses may be when the face/head is angled to the upper-left, upper-right, lower-left or lower-right of the image. As such, the image 403a is associated with a pose in which the face/head is angled to the upper-left of the image. The image 403b is associated with a pose in which the face/head is angled to the upper-right of the image. The image 403c is associated with a pose in which the face/head is angled to the lower-left of the image. The image 403d is associated with a pose in which the face/head is angled to the lower-right of the image. Each previous input image may be further stored in association with a predefined pose matrix, which is indicative of the associated predefined pose of the object. Advantageously, the database 403 holds a diverse range of reference views of the object, which will improve the inpainter's 301 ability to inpaint the input image 302. It will be appreciated that the database 403 may hold any number of previous input images associated with any number of predefined poses of the object. Where the object is a head/face, the four views described above may be sufficient to provide a diverse range of reference images of the object.

The reference view selector 401 selects the previous input image 403a-403d from the database 403 that is most suitable for inpainting the image 302. In some examples, the reference view selector 401 selects one of the previous input images 403a-403d as previously described above. For example, in the example input image 602a, the face/head is facing towards the bottom-left of the image. As such, the reference view selector 401 may select the previous input image stored in the database 403 in which the pose of the object is most similar to the pose of the object in the image 602a (e.g. the previous input image 403c may be selected), as described above.

In other examples, the reference view selector 401 may select the initial reference image 309 instead based on the predefined poses. The reference view selector 401 may request a list of the predefined poses, e.g. via a communication line 404. In particular, the reference view selector 401 may receive the predefined pose matrices from the database 403. The reference view selector 401 may identify which predefined pose has a highest degree of similarity to the pose of the object in the image 302. For example, for each predefined pose, the reference view selector 401 may determine a degree of similarity between the predefined pose and the pose of the object in the input image 302, by comparing the predefined pose matrix to the input pose matrix 306. The reference view selector 401 may then select the previous input image from the database that is associated with the identified predefined pose. For example, the previous image 403c is associated with a predefined pose in which the face/head is facing the bottom-left of the image. The reference view selector 401 may determine that in the image 602a, the pose of the face/head is most similar to the predefined pose in which the face/head is facing to the bottom left of the image. The reference view selector 401 may identify that the previous image 403c in the database 403 is associated with that predefined pose. The reference view selector 401 may therefore select the previous image 403c as the initial reference image 309. Furthermore, the previous input pose matrix associated with the selected previous input image is provided as the reference pose matrix 310. Optionally, if the previous input image is a previous colour input image, then a corresponding previous input depth image may also be provided from the database 403 as described above. Alternatively, if the previous input image is a previous depth input image, then a corresponding previous input colour image may also be provided from the database 403 as described above.

Optionally, at Time 2, the selected previous input image (e.g. 403c) is replaced in the database 403 by the current input image 302, if the pose of the object in the current input image 302 has a higher degree of similarity to the identified predefined pose than the pose of the object in the selected previous input image. For example, the database updater 402 may determine a first degree of similarity between the predefined pose that is associated with the selected previous input image (e.g. 403c) and the pose of the object in the selected previous input image, by comparing the predefined pose matrix with the previous input pose matrix corresponding to the selected previous input image. The database updater 402 may also determine a second degree of similarity between the predefined pose that is associated with the selected previous input image (e.g. 403c) and the pose of the object in the current input image 302, by comparing the predefined pose matrix with the current input pose matrix 306 corresponding to the current input image 302. The database updater 402 may then replace the selected previous input image (e.g. 403c) in the database 403 with the input image 302 if the second degree of similarity is higher than the first degree of similarity. Furthermore, the current input pose matrix 306 may be stored in the database 403 in association with the current input image 302. Optionally, if the current input image 302 is a colour input image, then the corresponding current input depth image may also be stored in the database 403 in association with the current input image 302, as described above. Alternatively, if the current input image 302 is a depth input image, then the corresponding current input colour image may also be stored in the database 403 in association with the current input image 302, as described above. Advantageously, over time, the database 403 maintains an up-to-date list of the most extreme poses of the object.

Figure 6 illustrates how the inpainter 301 may be used to implement the colour inpainter 204 of the video processor 105 according to an example. Figure 6 shows the image transformation module 106, which receives the colour image 205 and outputs a transformed colour image 108 as previously described. As described previously, the image 205 is captured using a real camera. In the illustrated example, the pose 601a of the face/head in the image 205, from the perspective of the real camera, may be towards the right of the image 205. The transformed image 108 shows an estimate of the scene, including the face/head, from the perspective of a virtual camera that is in a different position to the real camera. The pose 602a of the face/head in the transformed image 108, from the perspective of the virtual camera, may be to the left of the image 108. The input image 302 received by the inpainter 301 corresponds to the transformed colour image 108, which shows the scene and object from the virtual camera position.

Since the image 205 has been captured by a real camera, the image 205 may not contain any erroneous pixels. However, since the image 108 corresponds to a new (estimated) virtual view of the scene, the image 108 will contain erroneous pixels that require inpainting, as described previously. In particular, in the present example, the right side of the face/head was not originally visible in the captured image 205. However, the image 108 requires visibility of the right side of the face/head, and so the image 108 is likely to contain erroneous pixels (particularly missing pixels) on the right side of the face/head, which can be inpainted using the inpainter 301 described herein. As described above, the reference view selector 401 is likely to select the previous input image 403c as the initial reference image 309, which provides visibility of the right side of the face/head, and is therefore suitable for inpainting the image 302.

In some examples, the inpainter 301 may further receive the transformation matrix 103. The reference view selector 401 may select a previous input image from the database 403, further based on the transformation matrix 103 to assist in determining the most suitable previous input image. In particular, rather than the most suitable previous input image being the previous input image that has an object pose that is most similar to the object pose in the input image 302, it may instead be some other previous input image. In particular, the most suitable previous input image may be the previous input image that is most likely to include information that is useful for inpainting the one or more erroneous pixels. If the input image 302 is a transformation of an image captured by a real camera (eg, the input image is a visualisation of an image captured by a virtual camera), the most suitable previous image may not be the previous image that includes an object pose that is most similar to the object pose in the input image 302. Instead, it may be a previous image that includes the most information about the erroneous pixels in the transformed input image that have resulted from the view transformation done by the transformer 106. For example, it may be the previous image that includes an object with a pose that is most similar to the object pose in the input image 302, but on an extrapolation of the transformation that was performed to create the input image 302 from the real camera image. In this case, the selected previous image may be the previous image that is most similar to an extrapolation of the transformation that was performed to create the input image 302 from the real camera image. In other words, if the image captured by the real camera includes an object with a pose pointing towards the top right of the image, and in the input image 302 the object has been transformed to a pose pointing towards the bottom left of the image, an extrapolation of that transformation would be a pose pointing even more towards the bottom left of the image. In this particular example the most suitable previous image may be the previous image that includes an object with a pose that is the most similar to the pose of the object in the input image 302, but that also points even more towards the bottom left of the image. This might be the most suitable previous frame because the object in the previous frame is more likely to include useful pixel information that a previous frame that includes an object with a pose that is similar to that of the object in the input image 302, but pointing a bit less towards the bottom left of the frame.

Therefore, the reference view selector may base the selection of the previous input image on the transformation of the pose of the object in the input frame 302 compared with the object of the object in the frame captured by the real camera. This may be achieved by using the transformation matrix 103, or by receiving both the input frame 302 and the corresponding frame captured by the real camera. Although Figure 6 illustrates use of the inpainter 301 of Figure 5, it will be appreciated that any inpainter of the present disclosure may be used in combination with the features described above in connection with Figure 6. Moreover, the view transformer 106 may receive and transform a depth image (e.g. depth image 206) instead of a colour image, and an inpainter for inpainting an input depth image 302 may be used.

Figure 7 illustrates how the inpainter 301 may be used to implement the colour inpainter 204 of the video processor 105 according to another example. The example shown in Figure 7 is similar to the example of Figure 6, with the differences described below.

In the example of Figure 7, the pose matrix 306 is generated using the colour image 205 instead of input image 302/108. In particular, the object detector 303 detects the object in the colour image 205, and outputs information 304 indicative of the object in the colour image 205. The object pose estimator 305 then estimates an initial pose matrix 701 based on the information 304 and the colour image 205. The pose matrix 701 indicates the translational position (i.e. location) of the object and the rotational position (i.e. orientation) of the object in the colour image 205, i.e. relative to the fixed reference, for example the real camera position/perspective. A pose transformer 702 is then used to transform the initial pose matrix 701 using the transformation matrix 103 to generate the pose matrix 306. The transformation may be performed, for example, by performing a matrix multiplication between the matrix 701 and the transformation matrix 103. As a result of the transformation, the pose matrix 306 will indicate the pose of the object in the input image 302, i.e. relative to the virtual camera position/perspective. Any known technique may be used to perform the pose transformation.

In the example of Figure 7, the database updater 402 receives the colour image 205 and the initial pose matrix 701, instead of the input image 302 and the input pose matrix 306. The database updater 402 may also receive the depth image corresponding to the colour image 205 (e.g. the depth image 206), instead of the input depth image corresponding to the input colour image 302. The database updater 402 functions as described above, but instead using the colour image 205 and initial pose matrix 701 (and optionally the depth image corresponding to the colour image 205, e.g. the depth image 206). Advantageously, the colour image 205 will not have any erroneous or missing pixels since it is captured by a real camera. Therefore, the reference images obtained from the reference view store 308 will be of relatively higher quality, and therefore result in improved inpainting functionality.

Although Figure 7 illustrates features of the inpainter 301 of Figure 5, it will be appreciated that any inpainter of the present disclosure may be used in combination with the features described above in connection with Figure 7. Moreover, the view transformer 106 may receive and transform a depth image (e.g. depth image 206) instead of a colour image, and an inpainter for inpainting an input depth image 302 may be used. Figure 8 shows a further variation of the example shown in Figure 7. In some scenarios, especially when inpainting an input colour image, the erroneous pixels are grouped in a contiguous region (i.e. a hole) in the image 302. However, the location of the hole may not be always apparent. For example, if the hole is located at an edge of the object, then it may not always be apparent which of the pixels in the hole region should correspond to a background of the scene, and which of the pixels belong to the object. In the illustrated example, it is expected to see more of the right side of the face/head as shown in the image 602a. However, it may not be clear whether that region of pixels should correspond to the background of the scene behind the face/head, or to the face/head itself.

In the present example, the inpainter 301 estimates the locations of the erroneous/missing pixels in the input image 302. A hole estimator 801 estimates the locations based on the pose matrix 306 and the transformation matrix 103. The hole estimator 801 then generates a mask 802 that is indicative of the locations of the erroneous/missing pixels. The image stitcher 314 can then identify the erroneous pixels based on the mask.

In some examples, the hole estimator 801 estimates the locations of the erroneous pixels (eg, the missing pixels) by detecting an edge of the object in the input image 302, and estimating a likelihood that a region of pixels of the object at the edge is missing from the input image 302.

In some examples, the hole estimator 801 estimates the locations of the erroneous pixels (eg, the missing pixels) based on the spread coefficients generated in the 3D view transformer 106, as described above. It has been identified that pixels at the edge of a hole have very large spread coefficients. As such, for each pixel of the image 302, the hole estimator 801 can check whether corresponding spread coefficient is above a threshold (e.g. 4). If it is above the threshold, then that pixel can be identified as being an edge of the hole. As such, the hole estimator 801 will be able to identify the hole, i.e. a region of erroneous pixels, and generate a mask 802 indicative of the locations of those pixels.

The mask 802 may be generated using any other technique. In some examples, generating the mask 802 does not require use of the transformation matrix 103.

Although Figure 8 illustrates use of the inpainter 301 of Figure 5, it will be appreciated that any inpainter of the present disclosure may be used in combination with the features described above in connection with Figure 8. Moreover, the view transformer 106 may receive and transform a depth image (e.g. depth image 206) instead of a colour image, and an inpainter for inpainting an input depth image 302 may be used. Figure 9 shows a variation of the inpainter 301 described herein. In the present example, the inpainter 301 generates an array of N reference images 313b. Each n'th reference image is generated so that the pose of the object in that reference image corresponds to, or matches, the pose of the object in the input image 302, such that the objects in the reference image and the input image 302 are aligned. The image stitcher 314 then generates the output image 315 by inpainting or filling in the identified erroneous pixels based on the array of reference images 313b. In particular, the image stitcher 314 inpaints or fills the identified erroneous pixels based on correspondingly located pixels in each of the N reference images 313b.

To generate the array of N reference images 313b, at time 1, the reference view selector 401 selects an array of N previous input images from the database 403. Each n'th previous input image preferably shows a different pose of the object. The reference view selector 401 then outputs the array of N previous input images as an array of N initial reference images 309b. The reference view selector 401 also outputs an array of N corresponding pose matrices 310b. Each n'th pose matrix 310b is indicative of the pose of the object in a corresponding n'th reference image 309b. The inpainter 301 includes N pose difference calculators 311b and N 3D view transformers 312b. Each n'th pose difference calculator 311b and n'th 3D view transformer 312b operates as described above, to align a corresponding n'th reference image 309b with the input image 302. Then, the set of N view transformers 312b outputs the array of N reference images 313b to the image stitcher 314.

The inpainter 301 of the present example may be particularly advantageous when inpainting a depth input image 302. In particular, the erroneous pixels in depth images have a tendency to be relatively more distributed, especially if the depth image has been estimated, e.g. by a depth estimator 1001 as shown in Figure 2. Therefore, a single historic view of the object may be unlikely to include sufficient pixel information to inpaint the erroneous pixels, because each possible reference image may include its own erroneous or missing pixels. However, by using a set of N reference images, inpainting of the input image 302 may be improved.

Figure 10 shows a variation of the inpainter 301 shown in Figure 9. Each n'th reference image in the array 313b may include its own erroneous or missing pixels. Therefore, in the example of Figure 10, the array of N reference images 313b is combined into a master reference image 1020. Combining the reference images 313b may include combining the hole-less or error-free regions of pixels of each reference image 313b to generate the master reference image 1020. As such, by combining the array of reference images 313b, a more complete master reference image 1020 is provided to the image stitcher 314, which has fewer holes or erroneous pixels than an individual reference image. The image stitcher 314 then inpaints or fills the input image 302 based on the master reference image 1020 to generate the output image 315. In particular, the image stitcher 314 then inpaints or fills the erroneous pixels of input image 302 based on the correspondingly located pixels in the master reference image 1020. Figure 11 shows another example implementation of the reference view store 308. In the present example, the reference view store 308 comprises a point-cloud store 1105. The point cloud store 1105 stores a point cloud that is indicative of a 3D model of the object. The point cloud is based on one or more previous input images received at the inpainter 302. Each point or coordinate of the point cloud is defined relative to a world or reference coordinate system.

The inpainter 301 generates the reference image 313 using the point cloud. To generate the reference image 313, the point cloud store 1105 provides the point cloud 309c to the 3D view transformer 312c. The reference view store 308 also provides data 1104 indicative of the world coordinate system (e.g. an origin of the world coordinate system). The pose difference calculator then generates the transformation matrix 316 based on a difference between the pose matrix 306 and the world coordinate system (in particular the origin of the world coordinate system). The transformation matrix 316 is indicative of a transformation between the pose of the object in the input image 302 (relative to the first camera position) and the pose of the object relative to the origin of the world coordinate system. The 3D view transformer 312c applies the 3D transformation 316 to the point cloud 309c using the transformation matrix 316 to produce the aligned reference image 313. The 3D view transformer 312c may be considered as extracting the reference image 313 from the point cloud 309c based on the transformation matrix 316.

Optionally, after generating the output image 315, the point cloud is updated in the point cloud store 1105 based on the current input image 302. In particular, an image to point-cloud transformer 1101 receives the input image 302. The transformer 1101 generates a point cloud representation 1106 of the input image 302. The point cloud 1106 defines the pixels of the object as 3D coordinates, relative to the fixed reference (eg, the first real camera position). A point cloud transformer 1103 then transforms the point cloud 1106 to the world coordinate system to generate a transformed point cloud 1107. In particular, the points of the point cloud 1106 are mapped to the world coordinate system to generate the point cloud 1107. Then, the points in the point cloud 1107 are added or integrated into the point cloud 309c stored in the point cloud store 1105.

Over time, the amount of storage required for the point cloud 309c may increase as it accumulates more points based on previous input images 302. To reduce the burden on storage requirements, the point cloud store 1105 may implement a pruning algorithm/technique. In particular, the point cloud store 1105 may delete or disregard any points or coordinates of the point cloud 309c that are within a threshold distance to one another or have been in the point cloud store for more than a set amount of time. Figure 12 shows an example implementation of the image stitcher 314 when inpainting a depth image. The image stitcher 314 receives an input depth image 1202 which corresponds to the input image 302 of the inpainter 301, and a reference depth image 1206 which corresponds to the reference image 313 provided by the view transformer 312. The image stitcher 314 also receives an input colour image 1201 associated with the input depth image 1202, and a reference colour image 1205 associated with the reference depth image 1206. As described previously, the input depth image 1202 is indicative of the depth of each pixel in the colour image 1201, and the reference depth image 1206 is indicative of the depth of each pixel in the reference colour image 1205. The image stitcher 314 also receives a hole mask 1203, which may correspond to the hole mask 802 generated by the hole estimator 801.

In the present example, the image stitcher 314 further aligns the reference depth image 1206 with the input depth image 1202. The image stitcher 314 generates a transformation matrix 1213 which is indicative of a 2D transformation. The 2D transformation is for aligning one or more features of the object in the reference depth image 1206 with corresponding features of the object in the input depth image 1202. A feature detector 1215a receives the input colour image 1201, and detects features of the object in the input colour image 1201. The feature detector 1215a may detect dominant features of the object. For example, if the object is a face, the feature detector 1215a may detect facial features (e.g. mouth, eyes, nose etc.). The feature detector 1215a determines the locations of the detected features in the input colour image 1201, and provides information 1211 indicative of the detected features and locations to a (2D) transform estimator 1216. Similarly, a feature detector 1215b receives the reference colour image 1205, and detects the same features of the object but this time in the reference colour image 1205. The feature detector 1215b determines the locations of the detected features in the reference colour image 1205, and provides information 1212 indicative of the detected features and locations to a (2D) transform estimator 1216. Since the input colour image 1201 corresponds to the input depth image 1202, the feature detector 1215a may be considered as performing the function of detecting the features of the object in the input depth image 1202. Furthermore, since the reference colour image 1205 corresponds to the reference depth image 1206, the feature detector 1215b may be considered as performing the function of detecting the corresponding features of the object in the reference depth image 1206. The feature detectors 1215a and 1215b may be the same feature detector used in the object pose estimator 305.

The transform estimator 1216 then generates the transformation matrix 1213 based on differences between the locations of the features of the object in the input colour image 1201 and the corresponding features of the object in the reference colour image 1205. In other words, the transform estimator 1216 generates the transformation matrix 1213 based on differences between locations of the features of the object in the input colour image 1201 and the corresponding features of the object in the reference colour image 1205. A 2D view transformer then applies the 2D transformation to the reference depth image 1206 using the transformation matrix 1213. As a result, the 2D view transformer 1217 outputs a reference depth image 1214 which is further aligned with the input depth image 1202. In particular, the dominant features of the object in the reference depth image 1214 will be aligned with the corresponding dominant features of the object in the input depth image 1202.

A view inserter 1218 receives the reference depth image 1214, the input depth image 1202 and the mask 1203. The view inserter 1218 identifies the locations of the erroneous pixels (eg, missing pixels) of the object in the input depth image 1202 based on the locations indicated in the mask 1203. The view inserter 1218 then inpaints the erroneous pixels using the correspondingly located pixels of the reference depth image 1214. In particular, the view inserter 1218 may set or assign the values of the erroneous pixels to the values of the correspondingly located pixels in the reference depth image 1214. The view inserter 1218 outputs an inpainted or filled output image 1210, which may correspond to the output image 315.

It will be appreciated that the illustrated image stitcher 314 may also be used in the inpainter 301 when inpainting an input colour image as the input image 302. In such cases, both images 1201 and 1202 are the same image and may correspond to the input colour image 302. The images 1205 and 1206 are also the same image may correspond to the reference colour image 313. The feature detector 1215a will detect the features of the object in the input colour image 1201/1202. The feature detector 1215b will detect the corresponding features of the object in the reference colour image 1205/1206. The transform estimator 1216 will generate the transformation matrix 1213 based on differences between locations of the features of the object in the input colour image 1201/1202 and the corresponding features of the object in the reference colour image 1205/1206. The 2D view transformer will apply the 2D transformation to the reference colour image 1205/1206 using the transformation matrix 1213 to output a reference colour image 1214 that is further aligned with the input colour image 1201/1202. The view inserter 1218 will inpaint or fill the input colour image 1202 based on the mask 1203 and the reference colour image 1214. As such, the image stitcher 314 may not require the corresponding depth images in order to achieve the above described function for an input colour image 302.

In some examples, the functionality of the image stitcher 314 described herein may be performed by an appropriately trained neural network.

In some examples, where an array of N reference images are generated, the image stitcher 314 may be adapted to perform the above described functionality for each reference image. Figure 13 shows a variation of the inpainter 301 according to another example. In some scenarios, the object may have a combination of "stateful" and "stateless" features.

"Stateful" features may correspond to features of the object that are likely to change visual appearance over time. For example, if the object is a face/head, then stateful features may correspond to the eyes and mouth, which are likely to visually change over time. For example, eyes might look in different directions and therefore move relative to stateless features or might visually change in some other way, such as by closing. Stateless features may correspond to features of the object that are unlikely to change visual appearance over time. For example, if the object is a face/head, stateless features might be the nose and ears.

If the erroneous/missing pixels are in the location of a stateful feature of the object, then inpainting/filling those pixels using a reference image may result in an abnormal or unnatural looking image. In particular, the state of the feature in the reference image may be different to what the state should be in the output image. Therefore, using the pixels of the stateful feature in the reference image to inpaint the corresponding stateful feature in the input image may result in an unnatural looking output image. For example, if the stateful feature is an eye, then the visual appearance of the eye (i.e. its state) in the reference image may be different to what the visual appearance should be in the output image. Therefore, the pixels of the eye in the reference image may have an unnatural appearance in combination with the input image.

In the present example, the image stitcher 314 outputs an initial output image 1302, which may suffer from the above-mentioned abnormalities or unnatural appearances. The inpainter 301 includes an abnormality manager 1303 which receives the output image 1302. The abnormality manager 1303 is configured to detect one or more stateful features of the object that have an abnormal or unnatural appearance (i.e. one or more abnormal features of the object). Preferably, the abnormality manager 1303 identifies each stateful feature of the object in the image 1302, then classifies the stateful feature as an abnormal feature if pixels of the feature have been filled or inpainted by the image stitcher 314. The abnormality manager 1303 then processes the abnormal feature, for example by applying blurring to the abnormal feature. In one example, the abnormality manager 1303 may identify, in the output image 1302, a region of pixels that corresponds to a stateful feature of the object, e.g. that is likely to move relative to the object over time. Then, if at least some of the erroneous pixel locations are within the region of pixels of the stateful feature (i.e. if at least some of the pixels within the stateful feature region have been inpainted), then the stateful feature may classified as an abnormal feature. The abnormality manager may then apply a blurring effect to the region of pixels corresponding to the abnormal feature. Figure 14 shows an example implementation of the abnormality manager 1303 when the inpainter 301 is inpainting an input colour image 302. The abnormality manager 1303 receives a colour image 1412 which corresponds to the output image 1302 of the image stitcher 314. The abnormality manager 1303 also receives a mask 1411, which may correspond to the hole mask 802 previously described. A feature detector 1403 detects features of the object in the image 1412. For example, if the object is a face/head, then the regions of pixels corresponding to the main features of the face/head are detected (e.g. eyes, mouth, nose etc.). The feature detector 1403 outputs data 1404 indicative of the locations of the detected features (e.g. x and y coordinates). The feature detector 1403 may be implemented using an appropriately trained neural network, as will be well understood by the skilled person. A feature classifier 1405 uses the hole mask 1411 to determine whether the pixels of any detected feature has been filled or inpainted by the image stitcher 314. Since the hole mask 1411 is the same mask used by the image stitcher 314, the hole mask 1411 can be used to identify any detected features which have had their pixels filled or inpainted. Such features may be considered as inpainted features of the object. Then, the feature classifier 1405 determines whether the inpainted features are stateful or stateless features. This can be done by comparing the inpainted feature to a database 1406 which is indicative of the stateful features of the object. If an inpainted feature is stateful, then it may be classified as an abnormal feature. The classifier 1405 provides data 1408 indicative of the abnormal features of the object. An image filter 1409 then applies blurring to the regions of pixels in the colour image 1412 corresponding to the abnormal features. The filter 1409 outputs a final output image 1410 with the blurring applied. The output image 1410 may correspond to the output image 315.

It will be appreciated that the same abnormality manager of Figure 14 may be used when inpainting an input depth image 302.

In some examples, instead of blurring the abnormal feature, the image filter 1409 may instead apply an image correction to the abnormal feature. The image correction may be performed using an appropriately trained neural network or statistical model.

In alternative examples, the abnormality manager 1303 may detect abnormal features using an appropriately trained neural network or a statistical model. The abnormality manager 1303 may then blur or correct the abnormal features as described above.

As mentioned before, in some examples, the input image 302 received at the inpainter 301 may include multiple objects. For example, in a video call setting, the objects may include a user's face/head, hands, etc. Multiple instances of the inpainter 301 may be provided in parallel, where each inpainter detects and inpaints a respective one of the objects. For example one instance of the inpainter may fill in the face, and another fill in a hand. In some examples, where multiple inpainters are present, the inpainters may share the functional blocks or modules described herein in order to improve efficiency. For example, the multiple inpainters provided in parallel may share blocks. Additionally or alternatively, where the video processor 105 uses both a depth inpainter and a colour inpainter, those inpainters may share functional blocks.

In some examples, before receiving the input image 302 at the inpainter 301, the input image 302 may be separated into a foreground image and a background image. The inpainter 301 may receive only the foreground image as the input image 302.

Video calling is an increasingly popular method of human-to-human interaction. For example, reference is made to Figure 15, which shows a system 1100 which facilitates a video call between a first electronic device 1101-1 and a second electronic device 1101-2, via a network 1102. The first electronic device 1101-1 and the second electronic device 1101-2 each includes a respective video capture device or camera unit 1103-1 and 1103-2. Each video capture device I camera unit comprises at least one camera for capturing video footage. The first camera unit 1103-1 captures video footage of a first scene including a first user 1110-1 and outputs a first video stream 1105-1. The second camera unit 1103-2 captures video footage of a second scene including a second user 1110-2 and outputs a second video stream 1105-2. The first electronic device 1101-1 and the second electronic device 1101-2 each include a respective transceiver 1107-1 and 1107-2. The first video stream 1105-1 is transmitted to the network 1102 via the first transceiver 1107-1, and received by the second electronic device 1101-2 via the second transceiver 1107-2. Furthermore, the second video stream 1105-2 is transmitted to the network 1102 via the second transceiver 1107-2, and received by the first electronic device 1101-1 via the first transceiver 1107-1. The first electronic device 1101-1 and the second electronic device 1101-2 each includes a respective display 1104-1 and 1104- 2. The first display 1104-1 receives the second video stream 1105-2 which is displayed by the display 1104-1. The second display 1104-2 receives the first video stream 1105-1 which is displayed on the display 1104-2.

The first video stream 1105-1 captured by the first camera unit 1103-1 will show a particular perspective or view of the first scene, based on the position of the camera unit 1103-1 within I relative to the scene. In particular, the images (i.e. frames) captured by the first camera unit 1103-1 will depend on the 3D translational and/or rotational position of the first camera unit 1103-1 within I relative to the scene. It is often the case that the first camera unit 1103-1 is positioned at some distance from the first display 1104-1. When the first user 1110-1 wants to engage directly with the second user 1110-2, the first user 1110-1 may look at the first display 1104-1 where they can see the second user 1110-2. In other words, the first user 1110-1 does not look directly at the first camera unit 1103-1. As such, from the perspective of the second user 1110-2 looking at the second display 1104-2, the first user 1110-1 will appear to be looking away from the second user 1110-2. It is impractical to position the physical camera unit 1103-1 over the display 1104-1 so that the first user 1110-1 appears to be looking directly at the camera unit 1103-1.

The real-time video processor(s) and a video processing methods of the present disclosure can be used for processing an input video stream in real-time. The video processor processes the input video stream to generate an output video stream. The output video stream will show a view of the scene that is different to the view that was captured by the original camera. In particular, the output video stream will show the scene from the perspective of a (virtual) camera that has a different position (e.g. different 3D translational and/or rotational position) within I relative to the scene in comparison to the original camera that captured the input video stream. The output video stream may be considered as an estimation of what the captured video footage would look like, had the original camera been in the position of the virtual camera.

Advantageously, the virtual camera can have a position that corresponds to the position of the display 1104-1. The video processor may alter first video stream 1105-1 so that when the first user 1110-1 is looking at the display 1104-1, the first user 1110-1 appears as though they are looking directly at the camera unit 1103-1 (and therefore at the second user 1110-2). This may improve engagement and communication of visual cues between the first user 1110-1 and the second user 1110-2 during the video call.

Figure 16 shows a system 1200 according to an example of the present disclosure. The system 1200 corresponds to the system 1100 with the following differences.

As shown in Figure 16, the first electronic device 1101-1 includes the video processor 105. The video processor 105 is in a transmit path of the first electronic device 1101-1 (i.e. between the camera unit 1103-1 and the transceiver 1107-1). The video processor 105 receives the first video stream 1105-1. The video processor 105 also receives transformation data 103. The transformation data 103 is indicative of a (3D) transformation between the actual position of the first camera unit 1103-1, and the position of a virtual camera (not shown) within the first scene. As such, the transformation data 103 indicates the desired change in view or perspective of the first video stream 1105-1. The transformation data 103 may be represented as a mathematical function. The transformation data 103 may include parameters indicative of the change in the 3D translational and/or rotational position between the camera unit 1103-1 and the virtual camera. In particular, the transformation data 103 may include translation parameters indicative of a change in a 3D translational position between the first camera unit 1103-1 and the virtual camera in 3D space. The transformation data 103 may also include rotation parameters indicative of a change in a 3D rotational position between the first camera unit 1103-1 and the virtual camera in 3D space. The video processor 105 generates an output video stream 12001 based on the first video stream 1105-1 and the transformation data 103. In particular, for each individual frame (e.g. image) of the first video stream 1105-1, the video processor 105 applies a transformation to the image. The transformation transforms the image so that the image shows the scene from the perspective of the virtual camera instead of the first camera unit 1103-1. The transformed image may be considered as an estimation of the scene from the position of the virtual camera. The video processor 105 processes each image of the first video stream 1105-1 in real-time. The transformed images are outputted by the video processor 105 in real-time, to generate the output video stream 12001. Accordingly, the output video stream 12001 will show the video footage from the perspective of the virtual camera instead of the first camera unit 1103- 1. The output video stream 12001 is then transmitted to the second user device 1101-2 via the transceiver 1107-1 and the network 1102. As such, at the second display 1104-2, the first user 1110-1 will appear to be looking directly at the second user 1110-2, when looking at the position of the virtual camera (e.g. at the first display 1104-1). Moreover, the output video stream 12001 will be free of erroneous/missing pixels or holes as a result of the inpainting techniques described herein.

Figure 17 shows a system 1300 according to another example of the present disclosure. The system 1300 is similar to the system 1200, except that the video processor is in the receive path of the first electronic device 1101-1 (i.e. between the transceiver 1107-1 and the display 1104-1). The video processor 105 receives the second video stream 1105-2 and the transformation data 103. In the system 1300, the transformation data 103 is indicative of a transformation between the actual position of the second camera unit 1103-2, and the position of a virtual camera within the second scene. As such, the transformation data 103 indicates the desired change in view of the second video stream 1105-2. The video processor 105 generates an output video stream 1301 based on the second video stream 1105-2 and the transformation data 103, as already described above. Accordingly, the output video stream 1301 will show the video footage from the perspective of the virtual camera instead of the second camera unit 1103-2. The output video stream 1301 is then displayed on the first display 1104-1. As such, at the first display 1104-1, the second user 1110-2 will appear to be looking directly at the first user 1110-1, when looking at the position of the virtual camera (e.g. at the second display 1104-2).

Figure 18 shows a system 1400 according to another example of the present disclosure. The system 1400 is similar to the systems 1200 and 1300, except that the video processor 105 is not implemented in the first or second electronic devices. Rather, the video processor 105 may be implemented elsewhere in system 1400, for example elsewhere in the network 1102 (e.g. on a server, in the cloud, etc.). The video processor 105 may receive and process the first video stream 1105-1 to generate the output video stream 12001 (or receive and process the second video stream 1105-2 to generate the output video stream 1301) as described above.

The system 1400 may also include a third camera unit 1401. Like the first camera unit 1104- 1, the third camera unit 1401 may also capture video footage of the first user 1110-1, but from a different position or angle to the first camera unit 1104-1. The third camera unit 1401 may output a third video stream 1402. The video processor 105 may generate the output video stream 12001 further based on the third video stream 1402. The third camera unit 1401 may be a physically separate device to the first camera unit 1104-1. The third camera unit 1401 may not be part of the first electronic device 1101-1, and may instead be independently connected to the network 1102. The third camera unit 1401 may belong to a different, third electronic device (not shown). As such, multiple independent camera units from different electronic devices can be used by the video processor 105 in order to generate the output video stream 12001.

The techniques of the present disclosure may be used in many applications and settings, including in work settings, education, entertainment and healthcare. For example, in a work setting, the systems of the present disclosure may allow two callers to have better perspectives of one another, improving interaction and engagement between the callers. In an education setting, a student may better see the education content being presented by a teacher by changing the virtual camera position. In an entertainment or streaming setting, a viewer (and many other viewers connected to the network) may view a single broadcaster whilst independently choosing a viewing position. In a healthcare setting, a physician may adjust the virtual camera position to better assess a patient.

As such, variations of the systems described herein are envisaged. It will be appreciated that either user 1110-1 or 1110-2 (or a third party) may determine the position of the virtual camera. Alternatively, the position of the virtual camera can be determined automatically. It will be appreciated that the camera units 1103-1 and 1103-2 may be part of the respective electronic devices 1101-1 and 1101-2, or physically external to the respective devices 110-1 and 1101-2 (e.g. connected to the electronic device via a wired or wireless connection). It will be appreciated that the displays 1104-1 and 1104-2 may be optional.

It will be appreciated that the systems described above may use any of the video processors, and any of the inpainters, described herein.

The aspects of the present disclosure described in all of the above may be implemented by software, hardware or a combination of software and hardware. Any electronic device or server can include any number of processors, transceivers, displays, input devices (e.g. mouse, keyboard, touchscreen), and/or data storage units that are configured to enable the electronic device/server to execute the steps of the methods described herein. The functionality of the electronic device/server may be implemented by software comprising computer readable code, which when executed on the processor of any electronic device/server, performs the functionality described above. The software may be stored on any suitable computer readable medium, for example a non-transitory computer-readable medium, such as read-only memory, random access memory, CD-ROMs, DVDs, Blue-rays, magnetic tape, hard disk drives, solid state drives and optical drives. The computer-readable medium may be distributed over network-coupled computer systems so that the computer readable instructions are stored and executed in a distributed way.

Each of the functional modules/units described above and represented in the figures may be implemented by software, hardware or a combination of software and hardware. In one particular example, any one or more of the functional modules/units may comprise, or make use of, a trained neural network or artificial intelligence in order to perform the functionality described above. Furthermore, apparatus/ system is described as a series of functional modules/units only for the sake of understanding and clarity. In practice, the functionality of any or all of the functional modules/units may be combined, and conversely the functionality of any one of the functional modules/units may alternatively be split across two or more modules/units.

For example, Figure 2 shows two separate depth inpainters 1003a and 1003b, one for each depth frame 1002a and 1002b. In some examples, the depth inpainters 1003a and 1003b may be implemented entirely independently, for example each depth inpainter 1003a and 1003b may be implemented as shown in Figure 3. However, in other examples the depth inpainters 1003a and 1003b may share some resources. For example, they may each be implemented as shown in Figure 3, except that they may not include the reference view store 308. Instead, there may be a common reference view store 308 that the two depth inpainters 1003a and 1003b have access to.

In the aspects described above, the image stitcher 314 may optionally identify one or more erroneous pixels in the image that it receives (for example, one or more erroneous pixels in the first region of pixels - which correspond to an object in the image - of the second input image). This may be done a number of ways. For example, the image stitcher 314 may receive a mask (for example 802 from hole estimate 801) that indicates where the erroneous pixels are, or it may identify erroneous pixels in the transformed image 313 for itself (for example, by looking for pixels that lack a pixel value), or it may receive identification information from any other suitable source, for example from some other analysis module/unit that may be within or without the inpainter 301. In a further example, the image stitcher 314 may receive a probability score map corresponding to the input image, the probability score map indicating a likelihood that each pixel is erroneous. The inpainter 301 may consider erroneous pixels to be any pixels with a likelihood of being erroneous that exceeds a predetermined threshold. Alternatively, it may combine the probably score with other scores it itself determined, for example similarity of a pixel value between the input image 302 and the reference image, to determine if a pixel is erroneous.

Optionally, in any of the implementations described above, when the input image 302 is a colour or greyscale image, the inpainter 301 may include one or more additional functional modules/units to perform image balancing prior to the image stitcher. In particular, the additional functional module(s)/unit(s) may adjust the colour balance and/or white balance and/or contrast of the reference image 313 to better match the input image 302. Consequently, the appearance of the output image 315 may be improved. In an alternative, the image stitcher 314 may perform this functionality itself, rather than using a separate functional unit/module.

Figure 19 shows a non-limiting example of an electronic device suitable for performing any of the above described aspects of the present disclosure. Figure 19 shows a block diagram of an example computing system.

A computing system 9000 can be configured to perform any of the operations disclosed herein. Computing system includes one or more computing device(s) 9020. The one or more computing device(s) 9020 of the computing system 9000 comprise one or more processors 9040 and a memory 9060. The one or more processors 9040 can be any general-purpose processor(s) configured to execute a set of instructions (i.e., a set of executable instructions). For example, the one or more processors 9040 can be one or more general-purpose processors, one or more field programmable gate array (FPGA), and/or one or more application specific integrated circuits (ASIC). In one example, the one or more processors 9040 include one processor. Alternatively, the one or more processors 9040 include a plurality of processors that are operatively connected. The one or more processors 9040 are communicatively coupled to the memory 9060 via an address bus 9080, a control bus 9100, and a data bus 9120. The memory 9060 can be a random-access memory (RAM), a read-only memory (ROM), a persistent storage device such as a hard drive, an erasable programmable read-only memory (EPROM), and/or the like. The one or more computing device(s) 9020 further comprise an I/O interface 9140 communicatively coupled to the address bus 9080, the control bus 9100, and the data bus 9120.

The memory 9060 can store information that can be accessed by the one or more processors 9040. For instance, memory 9060 (e.g., one or more non-transitory computer- readable storage mediums, memory devices) can include computer-readable instructions (not shown) that can be executed by the one or more processors 9040 in order to perform the methods/processes described herein. The computer-readable instructions can be software written in any suitable programming language or can be implemented in hardware. Additionally, or alternatively, the computer-readable instructions can be executed in logically and/or virtually separate threads on the one or more processors 9040. For example, the memory 906 can store instructions (not shown) that when executed by the one or more processors 904 cause the one or more processors 9040 to perform operations such as any of the operations and functions for which the electronic device 9000 is configured, as described herein. In addition, or alternatively, the memory 9060 can store data (not shown) that can be obtained, received, accessed, written, manipulated, created, and/or stored. In some implementations, the computing device(s) 9020 can obtain from and/or store data in one or more memory device(s) that are remote from the computing system 9000.

The computing system 9000 further comprises a storage unit 9160, a network interface 9180, an input controller 9200, and an output controller 9220. The storage unit 9160, the network interface 9180, the input controller 9200, and the output controller 9220 are communicatively coupled to the computing device(s) 9020 via the I/O interface 9140.

The storage unit 9160 is a computer readable medium, preferably a non-transitory computer readable medium or non-transitory machine readable medium, comprising one or more programs, the one or more programs comprising instructions which when executed by the one or more processors 9040 cause the computing system 9000 to perform the method steps of the present disclosure. Alternatively, the storage unit 9160 is a transitory computer readable medium. The storage unit 9160 can be a persistent storage device such as a hard drive, a cloud storage device, or any other appropriate storage device.

The network interface 9180 can be a Wi-Fi module, a network interface card, a Bluetooth module, and/or any other suitable wired or wireless communication device. In one example, the network interface 9180 is configured to connect to a network such as a local area network (LAN), or a wide area network (WAN), the Internet, or an intranet.

Claims

1. A method of inpainting an input video stream in real time, the input video stream comprising a plurality of first input images indicative of an object, the method comprising an inpainter, for each first input image of the input video stream: receiving a second input image that is based on the first input image, wherein the second input image comprises a first region of pixels indicative of the object; generating a reference image of the object such that the object in the reference image aligns with the object in the second input image; and generating an output image by inpainting one or more erroneous pixels in the first region of pixels of the second input image, wherein inpainting the one or more erroneous pixels is based on the reference image.

2. The method of claim 1, wherein the plurality of first input images are indicative of an object from a first camera position, and wherein the method further comprises: receiving a first transformation matrix indicative of a first 3D transformation between the first camera position and a second camera position different to the first camera position; and for each first input image of the input video stream: transforming the first input image based on the first transformation matrix to generate the second input image, wherein the second input image is indicative of an estimated view of the object from the second camera position; and receiving the second input image at the inpainter.

3. The method of claim 2, further comprising: estimating locations of the one or more erroneous pixels in the second input image based on the pose of the object in the second input image and the first transformation matrix; and generating a mask indicative of the locations of the one or more erroneous pixels.

4. The method of any preceding claim, wherein generating the reference image comprises: accessing a database, wherein the database comprises a plurality of previous input images, wherein each previous input image corresponds to a previous first input image of the input video stream or a previous second input image received at the inpainter; selecting, from the database, a first previous input image of the plurality of previous input images; and transforming the selected first previous input image such that the object in the transformed first previous input image aligns with the object in the second input image.

5. The method of claim 4, wherein transforming the selected first previous input image comprises: generating a second transformation matrix indicative of a second 3D transformation between a pose of the object in the selected first previous input image and a pose of the object in the second input image; and applying the second 3D transformation to the selected first previous input image using the second transformation matrix.

6. The method of claim 5, wherein generating the second transformation matrix comprises: generating an input pose matrix indicative of the pose of the object in the second input image; receiving, from the database, a previous pose matrix indicative of the pose of the object in the selected first previous input image; and generating the second transformation matrix based on a difference between the input pose matrix and the previous pose matrix.

7. The method of any of claims 4 to 6, wherein the transformed first previous input image is the reference image.

8. The method of any of claims 4 to 6, wherein generating the reference image further comprises: selecting, from the database, a second previous input image of the plurality of previous input images; and transforming the selected second previous input image such that the object in the transformed second previous input image aligns with the object in the second input image; and combining the transformed first previous input image and the transformed second previous input image to create the reference image.

9. The method of any of claims 4 to 8, wherein the second input image is a colour or greyscale image and the reference image is a colour or greyscale image, and wherein the database further comprises a plurality of depth images corresponding to the plurality of previous input images, and wherein selecting the first previous input image further comprises selecting the corresponding depth image of the plurality of depth images, and wherein transforming the selected fist previous input image uses the selected depth image corresponding to the first previous input image.

10. The method of any of claims 4 to 9, wherein the second input image is a depth image and the reference image is a depth image, and wherein the database further comprises a plurality of colour or greyscale images corresponding to the plurality of previous input images, and wherein selecting the first previous input image further comprises selecting the corresponding colour or greyscale image of the plurality of colour or greyscale images, and wherein transforming the first selected previous input image comprises: determining, using the selected colour or greyscale image, a transformation of the selected first previous input image to align the object in the selected first previous input image with the object in the second input image.

11. The method of any of claims 4 to 10, wherein transforming the selected first previous input image further comprises: generating a third transformation matrix indicative of a 2D transformation that is suitable for aligning the object in the second input image with the object in the selected first previous input image; and applying the 2D transformation to the selected first previous input image using the third transformation matrix.

12. The method of any of claims 4 to 11, wherein selecting the first previous input image comprises: determining which previous input image of the plurality of previous input images is most suitable for inpainting the first region of pixels in the second input image; and selecting the determined previous input image.

13. The method of any of claims 4 to 12, wherein the selected first previous input image is the previous input image that is most likely to include information required for inpainting the one or more erroneous pixels.

14. The method of claim 12 or claim 13, wherein the second input image is a transformation of the first input image, and wherein determining which previous input image of the plurality of previous input images is most suitable for inpainting the first region of pixels in the second input image is based on: the transformation of a pose of an object in the second input image compared with a pose of the object in the first input image; and a pose of the object in each previous input image of the plurality of previous input images.

15. The method of claim 14, wherein the object in the determined previous input image has a pose that is most similar to an extrapolation of the transformation of the object in the first input image to the object in the second input image.

16. The method of claim 12 or claim 13, wherein determining which previous input image of the plurality of previous input images is most suitable for inpainting the first region of pixels in the second input image comprises: determining a degree of similarity between the pose of the object in the previous input image and the pose of the object in the second input image; and selecting the previous input image that has the highest degree of similarity.

17. The method of claim 12 or claim 13, wherein each previous input image stored in the database is associated with a different predefined pose of the object, wherein selecting the first previous input image comprises: identifying which predefined pose is most suitable for inpainting the first region of pixels in the second input image; and selecting the previous input image that is associated with the identified predefined pose.

18. The method of claim 17, further comprising: after generating the reference image: replacing, in the database, the selected first previous input image with the first or the second input image in the database if the pose of the object in the first or the second input image has a higher degree of similarity to the identified predefined pose than the pose of the object in the selected first previous input image.

19. The method of any preceding claim, wherein generating at least one further reference image, and wherein generating the output image comprises inpainting the one or more erroneous pixels based on correspondingly located pixels of the reference image and the at least one further reference image.

20. The method of any preceding claim, further comprising: for each first input image of the input video stream: identifying, in the output image, a second region of pixels within the first region of pixels, wherein the second region of pixels defines a feature of the object that is likely to visually change over time, and wherein at least some of the one or more erroneous pixels are located within the second region; and applying a blurring effect or an image correction to the second region of pixels.

21. The method of claim 20, wherein identifying the second region of pixels comprises: identifying, in the output image, a plurality of regions of pixels within the first region, each region of the plurality of regions corresponding to a respective feature of the object; and for each region of the plurality of regions: determining if at least some of the one or more erroneous pixels are located within the region; and identifying the region as the second region if the region corresponds to a feature of the object that is likely to visually change over time.

22. The method of claim 1, wherein the second input image is the first input image.

23. The method of claim 22, wherein the plurality of first input images are a plurality of first depth images indicative of a distance to the object from a first fixed reference, and wherein the input video stream further comprises a plurality of second depth images indicative of a distance to the object from a second fixed reference, wherein the method further comprises, at a further inpainter: receiving a second depth image of the plurality of second depth images, wherein the second depth image comprises a second region of pixels indicative of the object; generating a further reference image of the object such that the object in the further reference image aligns with the object in the second input image; and generating a further output image by inpainting one or more erroneous pixels in the second region of pixels, wherein inpainting the one or more erroneous pixels is based on the further reference image.

24. The method of claim 23, wherein the inpainter generates the reference image using one or more previous input images that are stored in a database, and wherein the further inpainter generates the further reference image using one or more previous input images that are stored in the database, wherein the database stores a plurality of previous input images that comprise one or more previous first depth images and one or more previous second depth images.

25. The method of any of claims 1 to 21, wherein the first input image and the second input image are colour images, greyscale images or depth maps.

26. A computer-readable medium comprising computer executable instructions stored thereon, which when executed by one or more processors, cause the one or more processors to perform the method according to any preceding claim.

27. An electronic device configured to perform the method of any of claims 1 to 25.