CN116601674A

CN116601674A - Dark flash normal camera

Info

Publication number: CN116601674A
Application number: CN202180071805.3A
Authority: CN
Inventors: 杰森·劳伦斯; 苏普瑞思·阿查尔; 夏志豪
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2020-11-09
Filing date: 2021-11-09
Publication date: 2023-08-15

Abstract

Techniques for estimating surface normals and reflectivities from poorly illuminated images include determining albedo and surface normal maps using images illuminated with Near Infrared (NIR) radiation in addition to RGB images of objects in a set of objects for performing image re-illumination, the images being captured with NIR radiation from substantially the same viewing angle as the RGB images were captured. In some implementations, the prediction engine takes as input a single RGB image and a single NIR image, and estimates surface normals and reflectivity from the object.

Description

Dark flash normal camera

Cross Reference to Related Applications

The present application is a non-provisional application, filed on even 9/11/2020, entitled "A DARK FLASH NORMAL CAMERA," U.S. provisional patent application No.63/198,736, the contents of which are incorporated herein by reference in their entirety, and claims priority thereto. The present application is also a non-provisional application, filed 11/16 2020, U.S. provisional patent application No.63/198,836 entitled "A DARK FLASH NORMAL CAMERA for dark flash light," the contents of which are incorporated herein by reference in their entirety, and claims priority thereto.

Technical Field

The present description relates to performing re-illumination of images taken under poorly illuminated conditions in, for example, mobile photography and camera applications.

Background

Some panning and imaging is often performed under poor, uncontrolled illumination, which results in low quality images and reduces the performance of downstream image processing and computer vision algorithms. Controlling the visible light in an environment or supplementing it with a flash is often too difficult or damaging to be practical.

Disclosure of Invention

Embodiments described herein relate to estimating high quality normal and albedo maps of a scene depicting a person (face and torso) in poor quality lighting conditions, so-called dark flash images, by supplementing the available visible spectrum illumination with a single near infrared light source and camera. Embodiments herein take as input a single color image captured under arbitrary visible light and a single dark flash image captured under controlled NIR illumination at the same viewing angle, and calculate a normal map and an albedo map of the scene. Because a ground-around normal map of a face is difficult to capture, embodiments herein include a novel training technique that combines information from multiple noise sources, particularly stereo and photometric shading cues. The performance of these embodiments is assessed under a range of subjects and lighting conditions.

In one general aspect, a method can include receiving image training data representing a plurality of color (RGB) images and a plurality of Near Infrared (NIR) images, each of the plurality of RGB images captured with a visible spectrum illumination source, each of the plurality of NIR images captured with a NIR illumination source, the plurality of RGB images and the plurality of NIR images including a subset of the plurality of RGB images and a subset of the plurality of NIR images, respectively, each subset of the plurality of RGB images and each subset of the plurality of NIR images including an image of a respective object in a set of objects through a plurality of illumination conditions. The method can further include generating a prediction engine based on the image training data, the prediction engine configured to generate an estimated surface normal map of the user and an estimated reflectance map of the user from a single RGB image of the user and a single NIR image of the user captured from a perspective that differs by less than a threshold perspective in a period of time that is less than a threshold period of time.

In another general aspect, a computer program product includes a non-transitory storage medium including code that, when executed by processing circuitry of a computing device, causes the processing circuitry to perform a method. The method can include receiving image training data representing a plurality of color (RGB) images and a plurality of Near Infrared (NIR) images, each of the plurality of RGB images captured with a visible spectrum illumination source, each of the plurality of NIR images captured with a NIR illumination source, the plurality of RGB images and the plurality of NIR images including a subset of the plurality of RGB images and a subset of the plurality of NIR images, respectively, each subset of the plurality of RGB images and each subset of the plurality of NIR images including an image of a respective object in a set of objects in a pose by a plurality of illumination conditions. The method can further include generating a prediction engine based on the image training data, the prediction engine configured to generate an estimated surface normal map of the user and an estimated reflectance map of the user from a single RGB image of the user and a single NIR image of the user captured from a perspective that differs by less than a threshold perspective in a period of time that is less than a threshold period of time.

In another general aspect, an electronic device includes a memory and a control circuit coupled to the memory. The control circuit can be configured to receive image training data representing a plurality of color (RGB) images and a plurality of Near Infrared (NIR) images, each of the plurality of RGB images captured with a visible spectrum illumination source, each of the plurality of NIR images captured with a NIR illumination source, the plurality of RGB images and the plurality of NIR images including a subset of the plurality of RGB images and a subset of the plurality of NIR images, respectively, each subset of the plurality of RGB images and each subset of the plurality of NIR images including an image of a respective object in a set of objects in a pose by a plurality of illumination conditions. The control circuit can be further configured to generate a prediction engine based on the image training data, the prediction engine configured to generate an estimated surface normal map of the user and an estimated reflectance map of the user from a single RGB image of the user and a single NIR image of the user captured from a perspective that differs by less than a threshold perspective in a period of time that is less than a threshold period of time.

The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

Drawings

FIG. 1 is a diagram illustrating an example electronic environment in which the improved techniques described herein may be implemented.

FIG. 2 is a diagram illustrating an example prediction engine configured to estimate surface normals and reflectivity in an image.

FIG. 3 is a diagram illustrating an example system for generating RGB and NIR images for training a prediction engine.

Fig. 4A, 4B, and 4C are diagrams illustrating example effects of using NIR images in addition to RGB images on angular errors.

FIG. 5 is a flowchart illustrating an example method of estimating surface normals and reflectivity in an image, according to a disclosed embodiment.

Fig. 6 is a diagram illustrating an example of a computer device and a mobile computer device that can be used to implement the described techniques.

FIG. 7 is a diagram illustrating an example of a distributed computer device that can be used to implement the described techniques.

Detailed Description

Some conventional approaches to re-illuminating poorly illuminated images include performing a "shape-from-shaping" operation on the color image that restores shape from a gradual change in the light and shade in the image. Other conventional means of re-illuminating poorly illuminated images include intrinsic image decomposition techniques in which the image is decomposed into the product of a reflectance image and a bright-dark image.

The technical problem of the above-described conventional method for re-illuminating a poorly illuminated image is that this method is not well suited for estimating a map of the surface normal and the albedo of the surface in the image used in the image re-illumination. For example, using any of these techniques to determine image re-illumination may lead to problems that are actually difficult to handle when estimating surface normals and albedo maps. While these techniques may become easier to handle by the availability of a map of true value reflectivities, such a map of true values may not be available in many mobile photography situations.

According to embodiments described herein, the solution to the above-described technical problem includes using Near Infrared (NIR) radiation illuminated images in addition to RGB images of objects in a set of objects to determine albedo and surface normal maps for performing image re-illumination, the images being captured with NIR radiation from substantially the same perspective as the captured RGB images. In some implementations, the prediction engine takes as input a single RGB image and a single NIR image and estimates surface normals and reflectivity from the object. In some embodiments, the reflectivity includes an albedo component and a specular component.

In some implementations, the improved technique includes receiving image training data representing a plurality of color (RGB) images and a plurality of Near Infrared (NIR) images, each of the plurality of RGB images captured with a white light illumination source, each of the plurality of NIR images captured with a NIR illumination source, the plurality of RGB images and the plurality of NIR images including a subset of the plurality of RGB images and a subset of the plurality of NIR images, respectively, each subset of the plurality of RGB images and each subset of the plurality of NIR images including an image of a respective object of a set of objects in a plurality of poses by a plurality of illumination conditions; and generating a prediction engine based on the image training data, the prediction engine configured to generate an estimated surface normal map of the user and an estimated reflectivity map of the user from a single RGB image of the user and a single NIR image of the user, the single RGB image of the user and the single NIR image of the user being captured simultaneously from substantially the same perspective.

In some implementations, prior to generating the prediction engine, the improved technique includes, for each of a set of objects, performing a semantic segmentation operation on RGB images in a corresponding subset of the plurality of RGB images to generate a label image of the object, the label image having a specified number of categories into which each of the plurality of pixels of the RGB image is classified, the label image of the object being included in the image training data.

In some embodiments, a single RGB image of the user and a single NIR image of the user are captured substantially simultaneously. In some implementations, in such a scene, the RGB illumination used in image capture need not be generated by an RGB illumination source, but can be generated with arbitrary illumination.

In some embodiments, each subset of the plurality of RGB images and each subset of the plurality of NIR images are captured using an image capture device comprising a plurality of RGB sources, a plurality of NIR illumination sources, and RGB and NIR illumination detectors, respectively, the plurality of co-located RGB and NIR illumination sources being arranged in a geometric pattern around the RGB and NIR illumination detectors. In some embodiments, the plurality of RGB sources and the plurality of NIR illumination sources are arranged at corners of a rectangle surrounding the RGB and NIR illumination detectors. In some embodiments, each of the plurality of illumination conditions includes one of a plurality of RGB illumination sources that produce illumination and one of a plurality of NIR illumination sources, and all other illumination sources of the plurality of RGB illumination sources and all other illumination sources of the plurality of NIR illumination sources that do not produce illumination. In some implementations, the RGB and NIR illumination detector includes a first NIR camera, a second NIR camera, and an RGB camera that are offset by an amount less than a specified offset threshold. In some embodiments, the image capture device further includes a spot-on-NIR projector configured to project a spot-on-spot (dot) speckle pattern onto the object, the spot-on-spot speckle pattern being staggered in time with illumination emitted by a NIR illumination source of the plurality of NIR illumination sources.

In some implementations, the prediction engine includes a first branch configured to generate a surface normal map and a second branch configured to output a predicted reflectivity map. In some implementations, the prediction engine includes a neural network having a unet encoder-decoder architecture with a skip-level connection (skip level connection) that includes an encoder including a set of blocks, each including a set of convolutional layers and a set of ReLU active layers, and a decoder configured to output a surface normal map in a first branch and a predicted reflectivity map in a second branch. In some implementations, generating the prediction engine includes generating the loss of light metric based on a rendering of the estimated surface normal map and the estimated reflectivity map from the illumination conditions of the plurality of illumination conditions. In some implementations, the estimated reflectivity map includes a diffuse reflection component and a specular reflection component, and generating the photometric loss includes generating the diffuse reflection component of the estimated reflectivity map using a Lambertian reflectivity model and generating the specular reflection component of the estimated reflectivity map using a Blinn-Phong bi-directional reflectivity distribution function (BRDF). In some implementations, generating the photometric loss includes generating a binary shadow map based on a position and a stereoscopic depth map of a light source used in generating the RGB images of the plurality of RGB images, generating an observed intensity map based on the estimated reflectance map, and generating a Hadamard product of the binary shadow map and a difference between the observed intensity map and the RBG image as the photometric loss. In some implementations, the improved techniques further include obtaining a stereoscopic depth map, performing a smoothing operation on the stereoscopic depth map to produce a smoothed stereoscopic depth map, and generating a stereoscopic loss based on the estimated surface normal map and gradients of the smoothed stereoscopic depth map. In some implementations, generating the stereoscopic loss includes generating an L1 norm of a difference between the gradient of the estimated surface normal map and the smoothed stereoscopic depth map as an L1 vector loss, generating an inner product of the gradient of the estimated surface normal map and the smoothed stereoscopic depth map as an angle loss, and generating a difference between the L1 vector loss and the angle loss as a stereoscopic loss.

A technical advantage of the disclosed embodiments is that, unlike conventional approaches, application of the improved techniques results in easy-to-handle problems for determining the albedo and surface normal map used in image re-illumination. Reducing the problem of generating such a map to an easy to handle problem provides a robust solution to the image re-illumination problem described above when the image is taken under poor illumination conditions.

Furthermore, this robust scheme is provided when there are no real values available due to the novel predictor engine that uses NIR illumination in addition to conventional visible spectrum illumination. Note that many mobile devices have the ability to take pictures using both visible spectrum illumination and NIR illumination, and that these pictures are taken from the same orientation. In addition, such visible spectrum and NIR illumination images may be acquired substantially simultaneously.

Fig. 1 is a diagram illustrating an example electronic environment 100 in which the above-described aspects may be implemented. The computer 120 is configured to train and operate a prediction engine configured to estimate surface normals and reflectivity from image data.

Computer 120 includes a network interface 122, one or more processing units 124, and memory 126. The network interface 122 includes, for example, an ethernet adapter, a token ring adapter, etc., for converting electrical and/or optical signals received from the network into electronic form for use by the computer 120. The set of processing units 124 includes one or more processing chips and/or components. Memory 126 includes volatile memory (e.g., RAM) and nonvolatile memory such as one or more ROMs, disk drives, solid state drives, and the like. The collection of processing units 124 and the memory 126 together form a control circuit configured and arranged to perform the various methods and functions as described herein.

In some implementations, one or more components of computer 120 can be or include a processor (e.g., processing unit 124) configured to process instructions stored in memory 126. Examples of such instructions as depicted in fig. 1 include an image acquisition manager 130, a semantic segmentation manager 140, and a prediction engine manager 150. Further, as illustrated in fig. 1, the memory 126 is configured to store various data, which is described with respect to a respective manager using such data.

The image acquisition manager 130 is configured to receive image training data 131. In some implementations, the image acquisition manager 130 receives the image training data 131 from the display device 170 through the network interface 122, i.e., through a network (such as network 190). In some implementations, the image acquisition manager 130 receives the image training data 131 from a local store (e.g., disk drive, flash drive, SSD, etc.).

In some implementations, the image acquisition manager 130 is further configured to crop and resize facial images from the image training data 131 to produce a standard-sized portrait. By cropping the image and adjusting it to a standard size, the training of the predictive image becomes more robust with the face of each subject being located at approximately the same position.

The image training data 131 represents a set of facial images in the same pose under different lighting conditions. Different lighting conditions include light type (e.g., visible spectrum/RGB and NIR) and light position relative to the object. Example acquisition of images under such conditions is discussed in more detail with respect to fig. 3. In some implementations, the RGB color channels may be replaced by other color channels such as YUV, Y 'UV, YCbCr, Y' IQ.

As shown in fig. 1, the image training data 131 includes a plurality of RGB images 132 (1), … … (M), where M is the number of RGB images in the image training data 131. Each RGB image, image 132 (1), includes pixel data 133 (1) indicating the intensity and/or color per pixel.

As also shown in fig. 1, the image training data 131 includes a plurality of NIR images 134 (1), … … (N), where M is the number of NIR images in the image training data 131. Each NIR image, image 134 (1), includes pixel data 135 (1) indicating the intensity per pixel. In some embodiments, n=m.

In some embodiments, image training data 131 includes NIR annular light image 136. In some embodiments, the NIR annular light image 136 is captured using an NIR annular light source surrounding an NIR detector (e.g., camera). The captured image has the same posture as that used in the image 131 (1..m). In some embodiments, the NIR annular light image 136 is captured using an NIR light source that does not surround the NIR detector but is close to the NIR detector, i.e., sideways. In such embodiments, the NIR ring light may have a shape other than a ring shape, i.e., a small disc shape, a square shape, etc.

In some implementations, the image training data 131 includes a source depth map 138 that represents an image of a person as a set of distances from a light source to the surface of the person. Such depth images 138 may be taken substantially simultaneously and in the same pose as RBG images 131 (1..m) and/or NIR images 133 (1..n). In some implementations, the source depth map 138 is a stereoscopic depth map that is acquired using a stereoscopic spot projector when acquiring RGB images; further details are discussed in fig. 3.

The semantic segmentation manager 140 is configured to perform semantic segmentation operations on at least one of the RGB images 132 (1..m) to produce semantic segmentation data 142. In some implementations, the semantic segmentation operation involves using a neural network with a convolutional hidden layer and an output segmentation layer. In some implementations, the semantic segmentation operation involves the use of an encoder/decoder structure in which the spatial resolution of the input is downsampled, developing lower resolution feature maps that are learned to be efficient in distinguishing between categories; the feature representation may then be upsampled into a full resolution segmentation map.

The semantic segmentation data 142 represents semantic segmentations that divide the image into a specified number of categories, namely a segmentation map 144. Such a segmentation map 144 may be used as an auxiliary input to a prediction engine to assist the engine in determining image shape and reflectivity. In some embodiments, the segmentation is a 6-class segmentation, i.e., there are six classes of pixels in each segmentation map.

Prediction engine training manager 150 represents a prediction engine configured to generate prediction engine data 153, prediction engine data 153 representing data generated by prediction engine manager 150, prediction engine manager 150 generating surface normals and reflectivity maps for images illuminated with visible light spectrum and NIR illumination. In some implementations, the prediction engine manager 150 is configured to perform training operations on the image training data 131 and implement a loss function tailored to minimize the loss of quantities related to generating surface normals and reflectivity maps from the visible spectrum/RGB image and the NIR image. As shown in fig. 1, the prediction engine manager includes an encoder 151 and a decoder 152.

In some embodiments, the prediction engine includes a neural network having a unet encoder-decoder architecture with skip-stage connections, the unet encoder-decoder architecture including an encoder 151 and a decoder 152. In such an embodiment, encoder 151 includes a set of blocks, each of which includes a set of convolutional layers and a set of ReLU activation layers.

The encoder 151 is configured to take the image training data 131 as input to generate parameter values in the fully connected layer to be input to the decoder 152. Decoder 152 is configured to take as input the parameter values generated by encoder 151 and to generate prediction engine data 153. For example, encoder 151 receives RGB image 132 (1..m), NIR image 134 (1..n), semantic segmentation data 142, and source depth map 138, and generates intermediate quantities used by decoder 152 to generate an estimated surface normal (e.g., estimated normal data 156). The cost function in this case reflects, for example, the difference between the estimated normal data 156 and the surface normal of the same object and the pose obtained in another way, for example using a stereoscopic image of the object. More details about stereoscopic imaging are described with reference to fig. 3.

The prediction engine represented by prediction engine manager 150 is configured to generate not only the estimated surface normals of objects from RGB and NIR images, but also reflectivity maps of objects estimated from these images. To achieve such an estimation of the surface normal and reflectivity map, the decoder 152 is configured to output the surface normal map in the geometric branch and the reflectivity map in the reflectivity branch.

Prediction engine data 153 represents data generated by prediction engine manager 150 to produce surface normal and reflectance maps for images illuminated with visible and NIR illumination. As shown in FIG. 1, prediction engine data 153 includes geometric branch data 154, reflectivity branch data 155, and rendering data 161.

The geometric branch data 154 includes data related to estimating surface normals of objects in RGB and NIR images. The decoder 152 is configured to output, as a vector diagram, normal data 156 representing an estimate of the surface normal, i.e. a three-dimensional vector at each pixel of the object classified as being on a pixel associated with the surface of the object.

However, in training the prediction engine, the estimated normal data 156 is only one component of the prediction engine training. To achieve training, a loss function is required. In this case, the loss function is defined using the estimated normal data 156 and the reference data. As shown in fig. 1, the reference data includes stereo normal data 157; in some implementations, the stereo normal data 157 includes or is a source depth map 138. Prediction engine manager 150 is configured to generate stereo loss data 158 based on estimated normal data 156 and stereo normal data 157.

The reflectivity branch data 155 includes data related to reflectivity component data estimated from the surface of the image of the person. For example, the reflectivity can include an albedo component and a specular reflectivity component. Albedo is defined as diffuse reflectance, i.e., the ratio of the diffuse reflected radiant flux emitted by the surface to the irradiance received by the surface; this is in contrast to specularly reflected radiant flux that is reflected by the surface. Thus, as shown in fig. 1, reflectivity branch data 155 includes estimated albedo data 159 and estimated specular reflectivity data 160.

The prediction engine uses the reflectivity branch data 155 to estimate another loss component for training the prediction engine: loss of luminosity. The prediction engine estimates the loss of luminosity by applying a rendering model to the reflection component map represented by estimated albedo data 159 and estimated specular reflectivity data 160 to produce a diffuse reflected and specular reflected image layer of the image that when added produces an estimated image of the object. The light metric loss is defined as the difference between the estimated image and one of the input images 132 (1..m) or 134 (1..n). By optimizing the light metrics and stereo losses, the prediction engine acts as an image formation model that connects its output to the scene image taken under known point illumination.

Thus, the rendering data 161 comprises illumination model data 162 representing a radiation reflectivity model connecting the output of the decoder 152 with images taken under known lighting conditions. For example, a reflectivity function f is introduced that gives the ratio of reflected light to incident light for a particular unit length light vector l, view vector v, surface normal n, four-way (rgb+nir) albedo α, scalar specular reflection intensity ρ, and specular reflection index m as follows:

wherein h= (n+l) i/n+l. The intensity observed at the pixel due to the spot light is given by:

I＝f(l，v，n)(n·l)L， (2)

where L is the light intensity. The reflected intensity is not observed at each pixel from a unique enough light direction to estimate all parameters in equation (1). In some embodiments, to address this issue, the specular reflection index is specified as m=30, based on previous measurements of human skin and our own observations, and only n, α, and ρ are estimated. The geometric quantities L and v and the light intensity L are determined by a calibration procedure.

In this context (context), the geometric branch prediction normal mapAs estimated normal data 156, and reflectivity branch prediction, albedo map +.>And logarithmic scale specular reflection intensity plot- >Respectively as estimated albedo data 159 and specular reflectivity data 160. Rather than relying on real-valued normals or reflectivity data to supervise the training of the prediction engine, the above-described stereo loss data 158 and photometric loss data 165 are combined.

The stereo loss represented by stereo loss data 158 combines the Ll vector loss and the angle loss as follows:

wherein n is _s Representing the gradient of the smoothed version of the stereoscopic depth map.

The photometric loss represented by the photometric loss data 165 is calculated between each of the RGB image 132 (1..m) and/or the NIR image 134 (1..n) rendered according to equation (2) and the output of the prediction engine for the corresponding lighting condition as follows:

wherein the method comprises the steps ofIs the color per pixel observed in the j-th image, and S _j Is a binary shadow map calculated using the stereo depth map data 138 and the calibrated light positions. One can apply a priori values on the albedo map that encourage constant change from segment to segment as follows:

wherein the method comprises the steps ofRepresenting a 5 x 5 neighborhood centered on pixel i. The prior value may be applied to pixels classified as clothing, body, or arm in the semantic segmentation map 144.

The total loss function is a weighted sum of the above-mentioned loss terms:

wherein the weight lambda _p And lambda (lambda) _c Designated as a value, e.g., λ in some embodiments _p =10 and λ _c ＝50。

The components (e.g., modules, processing units 124) of the user device 120 can be configured to operate based on one or more platforms (e.g., one or more similar or different platforms), which can include one or more types of hardware, software, firmware, operating systems, runtime libraries, etc. In some implementations, components of computer 120 can be configured to operate within a cluster of devices (e.g., a server farm). In such an embodiment, the functions and processing of the components of computer 120 can be distributed to several devices of a cluster of devices.

The components of computer 120 can be or can include any type of hardware and/or software configured to process attributes. In some implementations, one or more portions of the components shown in the components of the computer 120 in fig. 1 can be, or can include, hardware-based modules (e.g., a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), memory), firmware modules, and/or software-based modules (e.g., a computer code module, a set of computer-readable instructions capable of being executed on a computer). For example, in some implementations, one or more portions of the components of computer 120 can be or can include software modules configured to be executed by at least one processor (not shown). In some embodiments, the functionality of the components can be included in different modules and/or different components than illustrated in fig. 1, including combining the functionality illustrated as two components into a single component.

Although not shown, in some embodiments, components of computer 120 (or portions thereof) can be configured to operate within, for example, a data center (e.g., a cloud computing environment), a computer system, one or more server/host devices, and the like. In some implementations, components of computer 120 (or portions thereof) can be configured to operate within a network. Accordingly, components of computer 120 (or portions thereof) can be configured to function in various types of network environments that can include one or more devices and/or one or more server devices. For example, the network can be or can include a Local Area Network (LAN), a Wide Area Network (WAN), or the like. The network can be or can include a wireless network and/or be implemented using, for example, gateway devices, bridges, switches, and the like. The network can include one or more segments and/or can have portions based on various protocols, such as Internet Protocol (IP), and/or proprietary protocols. The network can include at least a portion of the internet.

In some implementations, one or more components of computer 120 can be or include a processor configured to process instructions stored in memory. For example, the image acquisition manager 130 (and/or a portion thereof) and the predictive image training manager 140 (and/or a portion thereof) can be a combination of a processor and a memory configured to execute instructions associated with a process to implement one or more functions.

In some implementations, the memory 126 can be any type of memory, such as random access memory, disk drive memory, flash memory, and/or the like. In some implementations, the memory 126 can be implemented as more than one memory component (e.g., more than one RAM component or disk drive memory) associated with components of the VR server computer 120. In some implementations, the memory 126 can be a database memory. In some implementations, the memory 126 can be or can include non-local memory. For example, the memory 126 can be or include memory shared by multiple devices (not shown). In some implementations, the memory 126 can be associated with a server device (not shown) within the network and configured to serve components of the computer 120. As illustrated in fig. 1, memory 126 is configured to store various data, including image training data 131, semantic segmentation data 136, and prediction engine data 153.

FIG. 2 is a diagram illustrating an example prediction engine 200 configured to estimate surface normals and reflectivity in an image. As shown in fig. 2, prediction engine 200 includes an encoder 220, a decoder 222, a geometry branch 230, a reflectivity branch 232, and a renderer 260. The prediction engine 200 shown in fig. 2 is a standard UNet with a jump connection.

Encoder 220 accepts as input label image 210 (c) generated from semantic segmentation map 144, as well as single RBG image 210 (a) and single NIR image 210 (b), as discussed with respect to fig. 1. In some embodiments, the RGB image 210 (a) and the NIR image 210 (b) are captured substantially simultaneously. For example, a cell phone camera may have both a visible spectrum illumination source (i.e., camera flash) and a near infrared light source (different flashes on the cell phone) that are activated simultaneously by the user. Thus, image/map 210 (a, b, c) represents users in the same pose.

Encoder 220 and decoder 222 are each composed of five blocks, each having three convolutional layers. One bottleneck has 256 channels. As shown in fig. 2, the output of decoder 222 is fed to two branches: a geometric branch 230 and a reflectivity branch 232. As described above, geometric branch 230 estimates surface normal map (i.e., predicted normal) 240And reflectivity branch 232 estimates albedo map 242 +.>And a logarithmic scale specular reflection intensity map 244, < >>Both the geometric branches 230 and the reflectivity branches 232 have three convolution layers with 32 channels and one final output layer.

During training, prediction engine 200 uses prediction normal 240 and observed normal 242 to produce a stereo loss. The observed normals 242 can be taken from a stereoscopic depth map that is input with the images 210 (a, b, c) and acquired with the training data.

During training, prediction engine 200 uses diffuse reflection layer 270, specular reflection layer 272, shadow map 280 (i.e., S in equation (4)) input into prediction engine 200 _j ) And a one light at a time (OLAT) image 284 as part of the training to produce a loss of luminosity. The prediction engine 200 generates layers 270 and 272, such as equation (1), based on the light locations 250 used in generating OLAT 284 and the diffuse and specular emission light rendering models in the renderer 260. The prediction engine 200 applies the light locations 250 and the renderer 260 to the predicted normal 240, the diffuse reflectance albedo map 242, and the specular reflection intensity map 244 to produce a diffuse reflection layer 270 and a specular reflection layer 272. The sum of diffuse reflectance layer 270 and specular reflectance layer 272 produces a rendered image 282 that, along with OLAT 284 and shadow map 280, produces a loss of luminosity.

FIG. 3 is a diagram illustrating an example system 300 for generating RGB and NIR images for training a prediction engine. The system 300 includes a plurality of NIR illumination sources 310 (1, 2, 3, 4) and a plurality of RGB sources 320 (1, 2, 3, 4) and an image capture device 330.

As shown in fig. 3, a plurality of NIR illumination sources 310 (1, 2, 3, 4) and a plurality of RGB sources 320 (1, 2, 3, 4) are arranged at corners of a rectangle where the image capturing apparatus 330 is located at the center. In some arrangements, the number of NIR illumination sources is different from the number of RGB sources. In some embodiments, the NIR illumination source and the RGB source are arranged in different geometric patterns, e.g., at the vertices of a polygon, in a circle or ellipse, along a straight line, etc.

The image capture device 330, as shown in fig. 3, includes a pair of NIR detectors (e.g., cameras) 332 (1, 2), an NIR ring light 336 surrounding the NIR detector 332 (1), and an RGB detector 334. In some embodiments, NIR ring light 336 may be placed at another location near NIR detector 332 (1). In some embodiments, the NIR ring light 336 may be replaced with a NIR illumination source of a different shape, such as a disk shape. In some implementations, the image capture device 330 further includes a pair of NIR stereoscopic spot projectors configured to project a series of spots onto the object to generate a stereoscopic normal.

Ideally, the RGB image and the NIR image are taken simultaneously and from the same perspective, so that the training images of the subject will all have the same pose. However, there may be small differences in the time at which images are taken and their viewing angles, as the sources may not be precisely timed or co-located together. In some embodiments, the time difference is less than a time threshold. In some implementations, the time threshold corresponds to a single frame in the video (e.g., 1/60 second and 60fps or 1/24 second and 24 fps).

As an example, RGB detector 334 may be a 7.0MP RGB camera operating at 66.67fps and a pair of stereoscopic 2.8MP NIR cameras operating at 150 fps. One of the RGB camera and the NIR camera is co-located using a flat panel beam splitter and an optical trap. The RGB and NIR cameras in this example have linear photometric responses, and all images can be downsampled by a factor of 2 in each dimension; a central crop of the overlay face at 960 x 768 resolution may be employed. Visible spectrum (RGB) illumination may be provided by 4 wide-angle LED spotlights placed at the corners of a rectangle of about 1.5m x 0.8m (wide x high) surrounding the camera about 1.1m from the object. NIR illumination may be provided by 5 NIR spotlights, one adjacent each of the visible light, and a flash LED lamp located near the reference NIR camera to produce a "dark flash" input. These NIR light sources are staggered in time with the projector that emits a NIR point speckle pattern to aid in stereo matching. The microcontroller may coordinate the trigger lights and cameras to ensure that only one visible light source and one NIR light source are active at any time. All light sources can be calibrated for position and intensity and geometrically considered as point sources. The light intensity term L in equation (2) accounts for these calibration colors. Note that the NIR and visible light sources are not collocated, and thus the L values used in equation (2) are slightly different between these two cases.

FIGS. 4A, 4B and 4C are graphs 400, 430 and 460, respectively, illustrating the use of NIR images in addition to RGB images versus baseline versus average angle error (i.e., in equation (3)) for a prediction engine modified to employ only a single RGB image) Is an example effect of (a). Curves comparing the overexposure level (400), the color temperature difference (430), and the noise level (460) all show greater and greater error when using only RGB, which exhibits significant stability with increasing exposure level, color temperature difference, and noise level.

FIG. 5 is a flow chart depicting an example method 500 of generating re-illumination via surface normals and reflectivity estimation in accordance with the improved techniques described above. The method 500 may be performed by a software architecture described in connection with fig. 1, which resides in the memory 126 of the computer 120 and is executed by the set of processing units 124.

At 502, the image acquisition manager 130 receives image training data (e.g., image training data 131) representing a plurality of color (RGB) images (e.g., image 132 (1..m)) and a plurality of Near Infrared (NIR) images (e.g., image 134 (1..n)), each of the plurality of RGB images being captured with a visible spectrum illumination source (e.g., RGB source 334), each of the plurality of NIR images being captured with a NIR illumination source (e.g., NIR illumination source 332 (1, 2)), the plurality of RGB images and the plurality of NIR images comprising a subset of the plurality of RGB images and a subset of the plurality of NIR images, respectively, each subset of the plurality of RBG images and each subset of the plurality of NIR images comprising an image of a respective object in a set of objects in a pose by a plurality of illumination conditions.

At 504, the prediction engine manager 150 generates a prediction engine (e.g., prediction engine data 153) based on the image training data, the prediction engine configured to generate an estimated surface normal map (e.g., estimated normal data 156) of the user and an estimated reflectivity map (e.g., estimated albedo data 159 and/or specular reflectivity data 160) of the user from the single RGB image of the user and the single NIR image of the user captured from perspectives that differ by less than a threshold perspective in less than a threshold period of time.

In some embodiments, the threshold time period is less than 100 milliseconds, 10 milliseconds, 1 millisecond, or less. In some implementations, the threshold viewing angle rotates less than 10 degrees, 5 degrees, 2 degrees, 1 degree, 0.5 degrees, or less about any axis of the user.

In some embodiments, the prediction engine described above may be applied to a stereoscopic refinement application. Stereoscopic methods are good at measuring rough geometries, but it is often difficult to recover fine surface detail. This can be overcome by optimizing the stereoscopic depth from an accurate high resolution normal, which is typically estimated using photometric methods. The normals produced by our method are used to refine the depth measurements produced by the NIR spatiotemporal stereo algorithm, which can be compared to smoothing the stereo depth using standard bilateral filters. The normals generated by the improved techniques described herein result in higher quality reconstructions, especially around the mouth, nose and eyes, and better recovery of fine lines and wrinkles in the skin.

In some implementations, the above-described prediction engine may be applied to illumination adjustment to improve illumination in a portrait, for example, by adding virtual light supplements to make the shadow of the face bright. The normals and reflectograms estimated by our method can be used to render contributions of virtual point light that lie within the field of view of the shadow region; these maps may be combined with the original RGB image. The model provided by the prediction engine can achieve a convincing effect, even producing realistic specular highlights along the nasolabial folds and nasal prongs.

Fig. 6 illustrates an example of a general purpose computer device 600 and a general purpose mobile computer device 650, which may be used with the techniques described here. Computer device 600 is one example configuration of computer 120 of fig. 1 and 2.

As shown in FIG. 6, computing device 600 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 650 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.

Computing device 600 includes a processor 602, memory 604, storage 606, a high-speed interface 608 coupled to memory 604 and high-speed expansion ports 610, and a low-speed interface 612 and storage 606 coupled to a low-speed bus 614. Each of the components 602, 604, 606, 608, 610, and 612 are interconnected using various buses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 602 is capable of processing instructions for execution within the computing device 600, including instructions stored in the memory 604 or on the storage device 606 to display graphical information for a GUI on an external input/output device, such as a display 616 coupled to the high-speed interface 608. In other embodiments, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. In addition, multiple computing devices 600 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a blade server bank, or a multiprocessor system).

Memory 604 stores information within computing device 600. In one implementation, the memory 604 is one or more volatile memory units. In another implementation, the memory 604 is one or more non-volatile memory units. Memory 604 may also be another form of computer-readable medium, such as a magnetic or optical disk.

The storage device 606 is capable of providing mass storage for the computing device 600. In one implementation, the storage device 606 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. The computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer-or machine-readable medium, such as the memory 604, the storage device 606, or memory on processor 602.

The high-speed controller 608 manages bandwidth-intensive operations of the computing device 500, while the low-speed controller 612 manages low-bandwidth-intensive operations. Such allocation of functions is merely exemplary. In one embodiment, the high-speed controller 608 is coupled to the memory 604, the display 616 (e.g., via a graphics processor or accelerator), and to a high-speed expansion port 610 that can accept various expansion cards (not shown). In an embodiment, a low speed controller 612 is coupled to the storage device 506 and to the low speed expansion port 614. Low-speed expansion ports, which may include various communication ports (e.g., USB, bluetooth, ethernet, wireless ethernet), may be coupled to one or more input/output devices, such as a keyboard, pointing device, scanner, or network device, such as a switch or router, for example, through a network adapter.

Computing device 600 may be implemented in a number of different forms, as shown. For example, it may be implemented as a standard server 620, or multiple times in such a server bank. It may also be implemented as part of a rack server system 624. In addition, it may be implemented in a personal computer such as a laptop computer 622. Alternatively, components from computing device 600 may be combined with other components in a mobile device (not shown), such as device 650. Each of such devices may contain one or more of computing devices 600, 650, and the entire system may be made up of multiple computing devices 600, 650 in communication with each other.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments can include embodiments in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications or code) include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium," computer-readable medium "and/or" computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) having a display device for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices can also be used to provide interaction with a user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can receive input from a user in any form including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server) that includes either a middleware component (e.g., an application server), or a front-end component (e.g., a client having a graphical user interface computer or a Web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), and the Internet.

The computing system can include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Many embodiments have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the specification.

It will also be understood that when an element is referred to as being on, connected to, electrically connected to, coupled to, or electrically coupled to another element, it can be directly on, connected to, or coupled to the other element or one or more intervening elements may be present. In contrast, when an element is referred to as being directly on, directly connected to, or directly coupled to another element, there are no intervening elements present. Although the terms directly on, directly connected to, or directly coupled to may not be used throughout the detailed description, elements shown as directly on, directly connected to, or directly coupled to … can be so called. The claims of the present application may be modified to enumerate exemplary relationships described in the specification or shown in the drawings.

While certain features of the described embodiments have been illustrated as described herein, many modifications, substitutions, changes, and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the embodiments. It is to be understood that they have been presented by way of example only, and not limitation, and various changes in form and details may be made. Any portion of the apparatus and/or methods described herein may be combined in any combination, except mutually exclusive combinations. The embodiments described herein can include various combinations and/or sub-combinations of the functions, components, and/or features of the different embodiments described.

Furthermore, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be removed from the described flows, and other components may be added to or removed from the described systems. Accordingly, other embodiments are within the scope of the following claims.

Claims

1. A method, comprising:

receiving image training data representing a plurality of color images, each color image of the plurality of color images captured with a visible spectrum illumination source, and a plurality of Near Infrared (NIR) images, each NIR image of the plurality of NIR images captured with a NIR illumination source; and

a prediction engine is generated based on the image training data, the prediction engine configured to generate an estimated surface normal map of a user and an estimated reflectivity map of the user from a color image of the user and a NIR image of the user captured from a perspective that differs by less than a threshold perspective in a period of time that is less than a threshold period of time.

2. The method of claim 1, wherein the plurality of color images and the plurality of NIR images comprise a subset of the plurality of color images and a subset of the plurality of NIR images, respectively, each subset of the plurality of color images and each subset of the plurality of NIR images comprising an image of a respective object of a set of objects in a pose by a plurality of lighting conditions; and is also provided with

Wherein the method further comprises:

before generating the prediction engine, for each object in the set of objects, performing a semantic segmentation operation on a color image in a corresponding subset of the plurality of color images to produce a label image of the object, the label image having a specified number of categories into which each of a plurality of pixels of the color image is classified, the label image of the object being included in the image training data.

3. The method of claim 2, wherein each subset of the plurality of color images and each subset of the plurality of NIR images are captured using an image capture device comprising a plurality of color sources, a plurality of NIR illumination sources, and a color illumination detector and an NIR illumination detector, respectively, the plurality of color sources and the plurality of NIR illumination sources being arranged in a geometric pattern around the color illumination detector and the NIR illumination detector.

4. The method of claim 3, wherein the plurality of color sources and the plurality of NIR illumination sources are arranged at corners of a rectangle surrounding the color illumination detector and the NIR illumination detector.

5. The method of claim 3, wherein each of the plurality of lighting conditions comprises: one of the plurality of color illumination sources and one of the plurality of NIR illumination sources produce illumination, and all other of the plurality of color illumination sources and all other of the plurality of NIR illumination sources do not produce illumination.

6. The method of claim 3, wherein the color illumination detector and the NIR illumination detector comprise a first NIR camera, a second NIR camera, and a color camera, the first NIR camera and the color camera being offset by an amount less than a specified offset threshold.

7. The method of claim 3, wherein the image capture device further comprises a NIR spot projector configured to project a spot speckle pattern on the object, the spot speckle pattern being staggered in time with illumination emitted by a NIR illumination source of the plurality of NIR illumination sources.

8. The method of claim 1, wherein the color image of the user and the NIR image of the user are captured substantially simultaneously.

9. The method of claim 1, wherein the prediction engine comprises a first branch configured to generate a surface normal map and a second branch configured to output a predicted reflectivity map.

10. The method of claim 9, wherein the prediction engine comprises a neural network having a unet encoder-decoder architecture with skip-stage connections, the unet encoder-decoder architecture comprising an encoder and a decoder, the encoder comprising a set of blocks, each block in the set of blocks comprising a set of convolutional layers and a set of ReLU activation layers, the decoder configured to output a surface normal map in the first branch and a predicted reflection map in the second branch.

11. The method of claim 9, wherein generating the prediction engine comprises:

the training operations on the prediction engine are supervised using stereo and light metric losses.

12. The method of claim 11, wherein generating the prediction engine further comprises:

the photometric loss is generated based on rendering from the estimated surface normal map and the estimated reflectance map under an illumination condition of the plurality of illumination conditions.

13. The method of claim 12, wherein the estimated reflectivity map includes a diffuse reflection component and a specular reflection component, and

wherein generating the photometric loss comprises:

generating a diffuse reflection component of the estimated reflectogram using a Lambertian reflection model; and

specular reflection components of the estimated reflectance pattern are generated using a Blinn-Phong bi-directional reflectance distribution function (BRDF).

14. The method of claim 12, wherein generating the photometric loss comprises:

generating a binary shadow map based on a position of a light source used to generate a color image of the plurality of color images and a stereoscopic depth map;

generating an observed intensity map based on the estimated reflectivity map; and

generating a Hadamard product of the difference between the observed intensity map and the color image and the binary shadow map as a photometric loss.

15. The method of claim 12, further comprising:

acquiring a stereoscopic depth map;

performing a smoothing operation on the stereoscopic depth map to generate a smoothed stereoscopic depth map; and

generating a stereoscopic loss based on the estimated surface normal map and the gradient of the smoothed stereoscopic depth map.

16. The method of claim 15, wherein generating the stereoscopic loss comprises:

generating an L1 norm of a difference between the estimated surface normal map and the gradient of the smoothed stereo depth map as an L1 vector loss;

generating an inner product of the gradients of the estimated surface normal map and the smoothed stereo depth map as an angle loss; and

the difference between the L1 vector loss and the angle loss is generated as a stereo loss.

17. The method of claim 1, further comprising:

the estimated surface normal map of the user and the estimated reflectivity map of the user are generated from the color image of the user and the NIR image of the user using the prediction engine.

18. The method of claim 18, wherein the color image of the user is a single color image and the NIR image of the user is a single NIR image.

19. A computer program product comprising a non-transitory storage medium, the computer program product comprising code that, when executed by processing circuitry, causes the processing circuitry to perform a method comprising:

a prediction engine is generated based on the image training data, the prediction engine configured to generate an estimated surface normal map of a user and an estimated reflectance map of the user from a single color image of the user and a single NIR image of the user captured from a perspective that differs by less than a threshold perspective in a period of time that is less than a threshold period of time.

20. An apparatus, the apparatus comprising:

a memory; and

a control circuit coupled to the memory, the control circuit configured to:

A prediction engine is generated based on image training data, the prediction engine configured to generate an estimated surface normal map of a user and an estimated reflectivity map of the user from a color image of the user and a NIR image of the user, the color RGB image of the user and the NIR image of the user captured from perspectives differing by less than a threshold perspective in a period of time less than a threshold period of time.