WO2021173637A1

WO2021173637A1 - Differentiable pipeline for simulating depth scan sensors

Info

Publication number: WO2021173637A1
Application number: PCT/US2021/019370
Authority: WO
Inventors: Benjamin PLANCHE; Rajat Vikram SINGH
Original assignee: Siemens Aktiengesellschaft; Siemens Corporation
Priority date: 2020-02-25
Filing date: 2021-02-24
Publication date: 2021-09-02

Abstract

System and method to simulate a real depth scan sensor include a differentiable 3D simulation module that receives input parameters related to a virtual scene to be scanned, a virtual emitter and a virtual camera of the depth scan sensor. A differentiable shadow mapping uses ray traces from the virtual camera and the virtual emitter and virtual images of the virtual camera virtual emitter are generated. A differentiable stereo-matching module generates a disparity map between the virtual images based on a pixel block extraction centered around a pixel of interest, and further based on a matching cost function of candidate pixel blocks of the emitter's virtual image. A simulated depth map is computed from the disparity map, wherein for each pixel, a depth value is a ratio of the disparity value from the disparity map to sensor focal length and a baseline distance between the virtual emitter and the virtual camera.

Description

DIFFERENTIABLE PIPELINE FOR SIMULATING DEPTH SCAN

SENSORS

TECHNICAL FIELD

[0001] This application relates to machine learning applied to depth scan models supporting computer vision applications. More particularly, this application relates to simulating depth scan sensors and generating realistic depth scans from virtual 3D scenes.

BACKGROUND

[0002] Since the development of low-cost commodity depth sensors (e.g., Microsoft Kinect, Structure IO, etc.), these devices have been integrated into a variety of vision-based applications. In industry, structured-light depth sensors are commonly used for the automated recognition and analysis of manufactured objects, for the surveillance and monitoring of production lines, etc.

[0003] Recent progress in computer vision has been dominated by deep neural networks trained over large amount of accurately labeled data. Collecting and annotating such datasets is however a tedious and, in some contexts, impossible task. Hence a recent surge in approaches that rely solely on synthetic data generated from 3D models for their training. Such an approach is especially common in the industry, as computer aided design (CAD) models of target objects are usually available whereas real images would be too tedious and costly to capture and annotate. Synthetic color images can usually be realistically rendered from CAD models, leveraging the advances in computer graphics (i.e., using 3D graphics engines). However, little effort has been put into the generation of realistic depth images, i.e., scans representing the distance between the depth sensor and elements in the scene, commonly used in the industry for computer vision applications. [0004] Structured-light depth sensors measure the depth information by projecting a pattern onto the scene (typically using an infrared projector), capturing the result with a dedicated camera/sensor, and finally inferring the depth by computing the disparity via stereo-matching between the original pattern image and the captured version. To avoid the tedious task of capturing a large number of real depth scan images and labeling that data for the training of machine-learning applications, noiseless synthetic depth images can easily be rendered from CAD models using any recent 3D graphics engine (most computer graphics tool compute depth maps as an intermediary step to rendering realistic color images). The challenge resides in applying proper noise to the images in order to simulate the visual quality of real depth sensors. Otherwise, the simulated depth scan images will not accurately resemble the real scans, leaving a realism gap between real sensors and simulated sensors.

[0005] Early works related to generation of realistic depth scans focused on learning the statistical distribution of the sensor noise from real scan samples, in order to apply this noise distribution to the synthetically rendered images. However, such solutions fail to model complex noise sources impacting depth sensors (shadow noise, reflectance noise, etc.). Recent work to bridge the realism gap involves CNN-based methods in which a mapping is learned from rendered to real data, directly in the image domain. Mostly based on unsupervised conditional generative adversarial networks (GANs) or style-transfer solutions, these methods still need a set of real samples to learn their mapping.

[0006] US Patent 10,901,740 describes a method in which a deep-learning-based post processing step is added to a simulation pipeline. Given real scans from the target depth sensor, a convolutional neural network is trained to edit the synthetic images to make them look more similar to their real equivalents. However, like earlier simulation methods, this post-processing step can only learn and apply statistical noise. By operating only on projected 2D images, the neural network does not have the capability to learn noise distributions linked to the properties of the original 3D scene (e.g., distance, orientation, and reflectance of the scene surfaces).

[0007] Another area of work related to depth scan sensors involves efforts to improve accuracy in the design for the next evolution of such devices. Producers of such sensors have been focused on engineering light patterns and optimizing the visual properties of their sensors in order to improve their accuracy. For instance, light patterns projected by depth sensors should typically possess criteria such as (1) high contrast to make the pattern reflection in the scene easier to capture and process, (2) a color range which would neither be too absorbed nor too reflected by common materials, and which would not be interfered with other possible light sources in the scene (e.g., infrared patterns are commonly used for indoor applications, but would fair poorly outdoor due to interfering infrared light from the sun), (3) distinct patterns that avoid pattern repetition to facilitate the stereo-matching process (i.e., stereo-matching algorithms typically work by searching correspondences between block patterns in the target images). Other visual properties of a sensor can also impact its end accuracy, such as intrinsic parameters (focal length, frame size), baseline distance between the pattern emitter and the pattern sensor, etc. To optimize all these parameters when developing a new sensor, or improving an existing one, requires expert knowledge and tedious iterations involving researchers and engineers, taking a significant amount of time resources for the solution.

SUMMARY

[0008] A system and method are disclosed to simulate a real depth scan sensor in a virtual scene using a differentiable pipeline. A differentiable 3D simulation module receives input parameters related to a virtual scene to be scanned, a virtual emitter and a virtual camera of the depth scan sensor. A differentiable shadow mapping uses ray traces from the virtual camera and virtual emitter to generate virtual images of the virtual camera virtual emitter. A differentiable stereo-matching module generates a disparity map between the virtual images based on a pixel block extraction centered around a pixel of interest, and further based on a matching cost function of candidate pixel blocks of the emitter's virtual image. A simulated depth map is computed from the disparity map, where for each pixel, a depth value is a ratio of the disparity value from the disparity map to sensor focal length and a baseline distance between the virtual emitter and the virtual camera. Parameters for the differentiable pipeline can be optimized by backpropagation of a loss value for the difference between the simulated depth map and a real depth scan image generated by the actual depth scan sensor. The differential pipeline may be deployed as a synthetic image generator in an adversarial training network.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] Non-limiting and non-exhaustive embodiments of the present embodiments are described with reference to the following FIGURES, wherein like reference numerals refer to like elements throughout the drawings unless otherwise specified.

[0010] FIG. 1 shows an example of a differentiable pipeline for simulating real depth sensors in accordance with embodiments of this disclosure.

[0011] FIG. 2 shows an example of a differentiable ray tracer to simulate the projection and capture of the depth scan sensor light pattern in accordance with embodiments of this disclosure.

[0012] FIG. 3 shows an example of shadow mapping logic applied in the differentiable 3D simulation in accordance with embodiments of this disclosure. [0013] FIGs. 4 and 5 show an example of a differentiable stereo block-matching process in accordance with embodiments of this disclosure.

[0014] FIG. 6 shows an example of an implementation of the pipeline 100 as a training data generator in accordance with embodiments of this disclosure.

[0015] FIG. 7 shows an example of a computer environment within which embodiments of the disclosure may be implemented.

DETAILED DESCRIPTION

[0016] A differentiable pipeline simulates different mechanisms of real depth sensors to render realistic depth maps. This novel pipeline relies on differentiable operations at each step of pattern projection and capture, noise application, and stereo-matching. Like an artificial neural network, the pipeline is differentiable end-to-end and can therefore be fine-tuned and optimized over a training dataset to better achieve the task of generating realistic depth images. Given a small set of real images from a target depth sensor, the pipeline can be optimized while missing or approximated parameters can be inferred.

[0017] Machine learning from synthetic data for scalable recognition processes is an objective for computer vision systems. To achieve this objective, a system and method is disclosed herein which deploys an adversarial training framework with a generative model which is a near-exact simulation of the target depth sensors. Such a machine learning network is easier to train as its parameters directly map to real phenomena and are easily constrainable.

[0018] The differentiable pipeline can also optimize parameters for an optimized design of a depth sensor without reliance on expert hand-crafted models. [0019] FIG. 1 shows an example of a differentiable pipeline for simulating real depth sensors in accordance with embodiments of this disclosure. Pipeline 100 combines simulation with machine learning. When some real target images are available, the simulation's parameters can be fine-tuned to improve realism of the simulation. Each stage applies differentiable functions for optimizing parameters. A differentiable 3D simulation 110 receives inputs 101 and generates a virtual scene 111, a pattern projection ray trace 112, and a pattern capture ray trace 113, from which a virtual camera image and a virtual emitter image is derived. A statistical noise application 120 uses a noise model 123 to generate an augmented captured image 122 from the virtual camera image 121. Differentiable stereo matching 130 stage applies block matching algorithms 132 to match augmented captured image 122 to original pattern image 131 (corresponding with virtual emitter image) to produce depth map 142.

[0020] In an embodiment for simulating a real depth scan sensor, the inputs 101 to the pipeline should be provided by the user once for each new structure-light depth sensor to be simulated. Inputs 101 can include intrinsic parameters of the sensor (focal length, frame size, etc.), baseline distance between the pattern emitter and the camera of the sensor, image of the light pattern(s) used by the sensor, statistical noise model impairing the sensor (e.g., noise as a function to the distance to its focal center). If some of these parameters are missing, but there is access to real images from the target sensor of a known scene/object, the proper values for these parameters can be inferred/optimized through an optimization procedure later described below. As examples of enoding of inputs 101, the sensor light pattern may be encoded as an image, or as a parameterizable function returning an image (e.g., a function that returns a series of dots/stripes to form the light pattern). As an image format, such an input parameter may then be processed by the differentiable pipeline 100 to edit the image according to pixel values (e.g., increasing contrast, editing the pattern at the pixel level). For the parameterizable function format, the function parameters may be optimized via controlling the thickness, color, and other characteristics of the dots/stripes.

[0021] For every new scene-object combination to be simulated, the inputs 101 provided by the user may include 3D/CAD file describing its 3D geometry (i.e., mesh data), material/reflectance information of the object. As before, the reflectance information can be inferred by the pipeline if the input information is missing.

[0022] For each depth image to be rendered, the following inputs 101 provided by the user may include position of the scene/object in the virtual 3D world, position of the sensor in the virtual 3D world, optional 3D clutter / additional objects to populate the scene.

[0023] Using the aforementioned inputs, the virtual scene and sensor are instantiated and properly placed into the simulated environment. In this differentiable simulation, the environment is defined as follows. The scene with one or more objects is defined by its 3D mesh (surface discretized into triangles) and the reflectance value of its surface to instantiate the virtual scene 111. The virtual depth scan sensor is decomposed into a pattern emitter and a capturing sensor (both parametrized with the provided intrinsic values, e.g., focal length f). The pattern emitter is simulated as a spotlight with the provided pattem(s) as light cookie(s). The pattern sensor is simulated as a camera sensitive to the emitter’s light. Both elements are separated by a baseline distance b provided as an input 101. Virtual emitter is configured to generate a virtual pattern project ray trace 112, and virtual camera generates a virtual pattern capture ray trace 113.

[0024] The virtual environment is passed to a differentiable ray tracer to simulate the projection and capture of the light pattern, as illustrated in FIG. 2. Using differentiable tools, virtual rays can be casted from the camera of virtual sensor 210 toward the scene to measure the surfaces seen by the camera. An example of a differentiable ray tracing technique is differentiable Monte Carlo ray tracing through edge sampling (Li et al, 2018). For each sub-surface (e.g., triangle) seen by the camera, another virtual ray is casted toward the pattern emitter to evaluate (a) if this sub surface receives light from the pattern emitter (or if it is in the shadow of another surface); and (b) if so, which part of the pattern image is projected onto that triangular surface. Factoring the distance to the emitter and the reflectance of the sub-surface, its light/color as seen by the virtual camera can be inferred.

[0025] More formally, the process can be defined as follows. For each pixel p_c in camera viewpoint 201, ray is projected that starts at the camera’s focal center and intersects that pixel p_c, ray being continued into the virtual 3D scene, until it intersects (or not) an 3D object in that scene. Point V represents an intersection point of the ray with any virtual object in the virtual scene. Pattern emitter pixel coordinates p_e in emitter viewpoint 211 are computed by tracing in reverse from point V back to the virtual emitter source. Because the virtual emitter is a pattern emitter (i.e., a single-point light source but with an image pattern placed in front so that the pattern is projected onto the scene by the light), if this virtual ray from V to the emitter’s light point actually reaches the emitter without any collision, pixel p_e represents the pixel of the pattern image that the virtual ray intersects. Determining pixel p_e may be according to the intrinsic and extrinsic values of the emitter to define the transformation matrix that would project any 3D point of the scene into the emitter’s viewport 2D coordinate system. Light value of a pixel, denoted as / = L(p_e), is measured according to L of sensor light pattern 220. Emitter viewpoint 211 represents shows what the virtual emitter would “see” if it were also a camera. The emitter would see the sensor light pattern 220 undistorted (i.e., identical patterns in 220 and 211), as cast over the virtual objects 22 le, 222e in the virtual 3D scene. Camera viewpoint 201 depicts the sensor light pattern 220 cast over the virtual objects 221c, 222c, from the distorted view of the virtual camera due to bias b. Color value C of a pixel p_cis computed according to the following relationship:

where: L is light vector (ray from emitter to V);

N is scene/object normal surface vector at V; a is scene/object albedo at V.

The pixel color value may be either in RGB form if the light pattern L itself is in color, or in grayscale/light intensity (e.g., for binary/grayscale patterns). The above defined pixel color C for each pixel p_c takes into account the contribution of the pattern emitter through the terms L (vector from pattern emitter to V) and / (color/light intensity from the part of the pattern that has hit V).

[0026] Repeating the ray-casting procedure for each pixel, a complete image of the projected pattern as seen by the virtual camera of sensor 301 can be obtained, shown as virtual image 121 in FIG. 1.

[0027] FIG. 3 shows an example of shadow mapping logic applied in the differentiable 3D simulation in accordance with embodiments of this disclosure. Virtual image 301 corresponds to the light pattern from virtual camera viewpoint, with pixel regions 321c, 322c corresponding to light pattern on surface of virtual objects 321, 322. Virtual image 311 corresponds to the light pattern from virtual emitter viewpoint, with pixel region 32 le, 322e corresponding to light pattern on surface of virtual objects 321, 322. With reference to the process described above with respect to FIG. 2, the following computation for shadowing is performed on a condition that a pattern occlusion is present. As shown in FIG. 3, point V on object 321 has a depth of z(V) and point V_coi on object 322 has a depth of z(V_col), where V_col represents a first point of collision when projecting a ray from pe back to V, and z(V) > z(V_col). Here, a simulation error for pixel p_c of camera viewport 301 is detected since it should appear in the shadow of object 322, shown as pixel region 322c, and there is a mismatch to corresponding pixel p_e of emitter viewport 311, which appears within pixel region 322e corresponding to intersection point V_coi on the surface of virtual object 322. The light vector L is blocked by another scene element before reaching V (occlusion), resulting in V not receiving any direct light from the emitter. This is caused by the baseline distance b separating the emitter and camera of real structured-light depth sensors. In fact, the real depth scan devices suffer from capture failure in some scene regions because of pattern occlusion/shadowing such as shown in FIG. 3.

[0028] As a corrective measure for the pipeline simulation, in response to the detected occlusion, a soft shadow mapping process is initiated to correct the simulation error. To make the shadow mapping differentiable, a sigmoid gate replaces a conditional probability distribution of the soft shadow mapping. More specifically, a ray is projected from pixel p_e back into the 3D scene until intersecting point F_coi. A soft shadow map Ms is computed according to the following relationship:

where s is a sigmoid operator applied to soften discontinuity of a hard condition for pixel occlusion (i.e., a "hard" step function for either shadow =1 if occluded, else shadow =0). The softer sigmoid function provides a softer slope that is differentiable, such that soft map Ms contains values from 0 to 1.

Color C of camera pixel p_c can then be updated according to the following relationship: C'iV_c) = (1 - M_s(p_c)) C{p_c) where Ms = 1 if pixel p_c is in the occlusion shadow, and Ms=0 if pixel p_c receives light directly from virtual emitter. By applying the soft map value to the color value C, the actual color of pixel P_c takes into account the possibility the intersection point V may be occluded from the virtual emitter. Note that while the 3D simulation is presented here as pixel per pixel for clarity, the simulation is performed using matrix operations that are leveraged to cover the whole image at once.

[0029] In summary, the soft shadow mapping as described above extends differentiable ray tracing to virtual scenes with light sources that emit patterns and images to implement a soft differentiable version of shadow mapping.

[0030] In an embodiment, statistical noise 120 may optionally be applied by pipeline 100 between the 3D simulation stage 110 and stereo matching stage 130. To do so, the user has obtained a statistical model 123 of the noise for the real sensor. Alternatively, the user has access to real images, from which the noise model can be learned and applied to the captured pattern image by pipeline 100. In an aspect, the virtual camera image 121 generated by ray tracing as described above is received by statistical noise module 120, and statistical noise 123 is applied to using the well-known reparameterization trick to keep this stage 120 of the pipeline 100 differentiable.

[0031] FIGs. 4 and 5 show an example of a differentiable stereo block-matching process in accordance with embodiments of this disclosure. With occlusions taken care of in the previous stage, the original pattern image and the captured pattern image can be compared to determine the depth of the scene. This operation consists of comparing stereo-images to measure their disparity map, from which the depth map can be inferred. As part of a differentiable pipeline, the analysis includes a differentiable aspect. While other works have proposed differentiable solutions to regress disparity maps from stereo images, they relate to CNN-based solutions that are learning based and not simulation based, requiring a training phase to properly perform the task. In an embodiment of this disclosure, a novel differentiable implementation applied to disparity regression is presented that can work in data-agnostic settings, with data-based optimization that is only optional.

[0032] As shown in FIG. 4, the captured stereo images are the pattern images as seen from the virtual camera viewpoint 401 and the virtual emitter viewpoint 411, which are based on the sensor light pattern 420 from virtual sensor 410. For an embodiment in which optional statistical noise 123 is applied, camera viewpoint image 401 would include such noise in the form of noisy pixels having altered light intensity and/or color. However, for illustrative simplicity, the noise is omitted in FIG. 4. Also, the object regions 321c, 322c, 321e, 322e enhanced in FIG. 3 are omitted in FIG. 4 to reflect how stereo matching is applied based entirely on the extraction of the sensor light pattern, without boundaries drawn for virtual objects in the virtual 3D scene.

[0033] In FIG. 5, a block matching algorithm is illustrated for the 3D matching stage. In an embodiment, images are unfolded (i.e., split into overlapping pixel blocks), and the blocks from one image (e.g., the virtual camera image) are compared to blocks in the other image (e.g., the virtual emitter image) to find the best equivalent according to a matching score. The difference of indices in the matched blocks gives a discrepancy map that can be converted into the depth map. Referring to FIG. 5, a pixel block extraction for camera pixel p_cof camera viewpoint 501 is shown. For each pixel in the camera viewpoint image 501, its equivalent in the emitter viewpoint image 511 is to be found. For robustness, the pixel block centered on the camera pixel p_c is identified and the best pixel block match in the emitter image is sought. Since the focal planes for the camera and emitter are aligned, only a block in the horizontal epipolar stripe need to be considered. Furthermore, the horizontal search range is reduced by taking into account the properties of the sensor and of the scene. Depth and disparity values are linked according to a formulation. Taking into account the depth range of the sensor or of the scene, the disparity range is limited accordingly. Stereo-data is defined by sensor original pattern image I₀ (i.e., emitter view 511) and the captured pattern image I_c after projection onto the scene (i.e., camera view 501). The following steps are executed for each pixel p_c in I_c.

(1) A k x k pixel block B_c is extracted, denoted by:

block ^size·

(2) Block candidates from I₀ are extracted. To limit their number, two hypotheses are leveraged: (a) I_c and I₀ share the same focal plane (c.f. camera and emitter aligned along baseline axis) so the candidates he in the horizontal epipolar stripe; (b) the effective range of disparity values is tied to the sensor and scene properties, i.e., the disparity range [d_min ; d_max\ = f-b being the effective depth value range as shown by ray trace

-^zmax

depths Z_min and Z_max in FIG. 4. Therefore, the list S₀ of block candidates for the stripe can be defined as:

(3) Matching costs C are computed, denoted by: C_{c d} = E(E_C, B_{o d}), e.g., with L correlation function.

The correlation function is illustrated graphically in FIG. 5 between pixel block 502 for camera view and block candidates 512 for emitter view. Based on the disparity range, the block candidates 512 are compared to pixel block 502. A matching score/cost is determined for each block candidate. Considering all the pixels in the image, a disparity cost volume 521 of the size of the image times the disparity range is determined (i.e., the cost calculation is repeated for all image pixels).

(4) To reduce the disparity cost volume in a differentiable manner, a softargmax operator is applied, denoted by: d_c = softargmax C_{c d} d with softargmax x = , b 6 1 temperature parameter controlling the pickiness

of the operator. An example of a disparity map after reduction 522 is shown in FIG. 5.

(5) The predicted disparity d to depth value z conversion is computed, denoted by: z = f b d where: z is depth (mm) b is baseline distance (mm) f is focal length (pixels) d is disparity (pixels)

An example of a depth map 523 is shown in FIG. 5, with small errors E linked to suboptimal sensor parameters (e.g., scarce pattern, pattern with ambiguities, inadequate block size for stereo matching, self-occlusion, etc.). Such errors can be corrected through optimization of the parameters using backpropagation through the differentiable pipeline, as described later below.

[0034] The stereo block-matching algorithm steps above is performed over the whole image at once, leveraging matrix operations (e.g., unfolding operations to extract the pixel blocks and stripes). Stereo block-matching algorithms typically rely on an additional refinement to achieve sub-pixel disparity accuracy. However, most refinement solutions are based on global optimization operations that are not directly differentiable. To improve the accuracy of the differentiable pipeline 100, without trading off its differentiability, the following steps are performed.

[0035] Let n_sub be an hyperparameter representing the pixel fraction to consider during refinement. A set of emitter view image versions {I_0,iji-1 ^are generated, shifting the pixels by

— — in each I_{o i}. The aforementioned stereo block-matching process is performed between I_c and ⁿsub each I_{o i} , aggregating the resulting cost volumes and regressing the refined disparity map accordingly.

[0036] Unlike CNN-based methods that leverage local and down-sampling operations, pipeline 100 has an important computational footprint. For example, applied to the whole image at once, step (2) of the stereo matching stage 130 implies the generation of a tensor of dimensions h x w x D x k x k with h x w image size, and D = d_max d_min number of effective disparity values. To optimize the process, the following solutions are implemented.

[0037] Since block-matching cost values are computed independently for each pixel p_c in I_c, it is possible to distribute their computation over several devices (GPUs), splitting I_c and I₀ into several regions (partially overlapping to ensure proper consideration of all block candidates). [0038] Unlike real-life scenarios, simulation pipeline 100 has access to the ground-truth depth values from the 3D scene. Knowing the effective depth range [z_min ; z_max] in the target scene, disparity D is limited accordingly: [d_min ; d_max\

[0039] Returning to FIG. 1 , backpropagation 150 is applied for optimization one or more input parameters 101 of a real sensor. If real images of known scenes captured by the target depth sensor are available along with the corresponding 3D virtual scene, the parameters of pipeline 100 can be optimized to render more accurate/realistic synthetic data. To start the optimization, the user specifies which parameters should be considered fixed (e.g., the pattern image, the baseline distance, the intrinsic parameters, etc.) and which parameters should be inferred/optimized (e.g., the statistical noise model, the intensity of the projected light pattern, etc.). Similar to how neural networks are trained to perform their task, the pipeline can then be optimized as follows:

For each available real image: step (1): Using the corresponding 3D scene and current pipeline parameters, render a synthetic depth image 142. step (2): Compare the real image 141 and synthetic image 142, computing their distance £ as the loss that the pipeline should minimize. step (3): Backpropagate the loss through the pipeline and update the target parameters accordingly (e.g., gradient descent). step (4): Repeat the training over the dataset until convergence.

[0040] FIG. 6 shows an example of an implementation of the pipeline 100 as a training data generator in accordance with embodiments of this disclosure. The differentiable property of pipeline 100 can be leveraged to fine-tune parameters to render more realistic depth images, or images that are more useful to the training of specific recognition models. Unsupervised techniques, e.g. adversarial frameworks, can be applied to that end. The framework shown in FIG. 6 relates to an adversarial training scheme with a generative network G for generating synthetic images XG, a network T for performing task prediction data y, and a discriminator network D for detecting real versus fake predictions d. Network G has an objective to optimize non-fixed parameters to confuse network T, which maximizes a loss function value £r(y, yo), with y=T(xo; qt). The loss £ value represents training loss that is related to the recognition task and the network T has to minimize at training time in order to optimize its parameters qt through gradient descent. The overall objective of the training scheme is denoted by: minma x

with (x_G,y_G) = G(m; 0_C).

However, even though network G is built to render realistic depth images, its parameters 0_G should be constrained to keep the generated depth images relevant. Otherwise, network G could converge toward the generation of images much too challenging and irrelevant (e.g., if the object/camera 3D positions are part of 0_G, the pipeline could learn to place the 3D models out of the visual field of the virtual sensor, returning empty images that T could not recognize).

[0041] Constraints for each parameter can simply be provided by the user, as range of valid values (e.g., for the reflectance factor of objects) or as additional loss functions to prune away irrelevant images (e.g., fixing a minimal visibility ratio for objects in the generated images). However, if a small set of real depth images are available, those can be used to automatically constrain the simulation parameters in an unsupervised manner. The discriminator network D is trained against the simulation pipeline 100 deployed as network G according to the following secondary objective:

with L_D(9_G, e_D ) = E_Xr log D(x_r; 0_D) + E_{m Xc} [log(l — D(G(m; 0_G); 0_D)]

By evaluating if the synthetic images look similar to the real ones (if available), the discriminator D automatically constrains the parameters of network G (e.g., network G would then not converge toward the generation of empty images, because it would be easy for D to predict that empty images are not from the real set). Training of adversarial network 600 is easier than previous adversarial schemes for recognition models, since network 600 relies on a generative model with pipeline 100 that is a near-exact simulation of the target depth sensors. The generative network parameters 0_G directly map to real phenomena and are easily constrainable.

[0042] FIG. 7 shows an example of a computer environment within which embodiments of the disclosure may be implemented. A computing device 710 includes a processor 715 and memory 711 (e.g., a non-transitory computer readable media) on which is stored various computer applications, modules or executable programs. In an embodiment, computing device includes one or more of the following modules: a differentiable 3D simulation module 701, a noise application module 702, differentiable stereo matching module 705, loss function module 706, and adversarial training module 707, which execute functionality of respective stages of pipeline 100, such as - differentiable 3D simulation 110, noise application 120, differentiable stereo matching 130, and loss functions 140 shown in FIG. 1, and adversarial training shown in FIG. 6.

[0043] As shown in FIG. 7, as an alternative computer implementation of local modules 701, 702, 705, 706, 707, one or more of differentiable 3D simulation module 741, a noise application module 742, differentiable stereo matching module 745, loss function module 746, and adversarial training module 747 may be deployed as cloud-based or web-based operations, or as a divided operation shared by local modules 701, 702, 705, 706, 707 and web-based modules 741, 742, 745, 746, 747.

[0044] Local storage 721 may store inputs 101, data for intermediate results, and values of loss function 140 during pipeline 100 processing. Some or all of training data for neural networks of pipeline 100 may be kept in local storage 721.

[0045] A network 760, such as a local area network (LAN), wide area network (WAN), or an internet based network, connects, via network interface 722, training data 751 to modules 701, 702, 705, 706 of computing device 710 and to cloud based modules 741, 742, 745, 746.

[0046] User interface module 714 provides an interface between modules 701, 702, 705, 706, 707 and user interface 730 devices, such as display device 731 and user input device 732. GUI engine 713 drives the display of an interactive user interface on display device 731, allowing a user to receive visualizations of analysis results and assisting user entry of inputs, learning objectives, and parameter/domain constraints for pipeline 100 via modules 701, 741, 702, 742, 705, 745, 706, 746, 707, 747.

[0047] Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0048] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer readable medium instructions.

[0049] The program modules, applications, computer-executable instructions, code, or the like depicted in FIG. 7 as being stored in the system memory 711 are merely illustrative and not exhaustive and that processing described as being supported by any particular module may alternatively be distributed across multiple modules or performed by a different module. In addition, various program module(s), script(s), plug-in(s), Application Programming Interface(s) (API(s)), or any other suitable computer-executable code hosted locally on the computer system 710, remote network devices storing modules 741, 742, 745, 746 and/or hosted on other computing device(s) accessible via one or more of the network(s) 760, may be provided to support functionality provided by the program modules, applications, or computer-executable code and/or additional or alternate functionality. Further, functionality may be modularized differently such that processing described as being supported collectively by the collection of program modules depicted in FIG. 7 may be performed by a fewer or greater number of modules, or functionality described as being supported by any particular module may be supported, at least in part, by another module. In addition, program modules that support the functionality described herein may form part of one or more applications executable across any number of systems or devices in accordance with any suitable computing model such as, for example, a client-server model, a peer- to-peer model, and so forth. In addition, any of the functionality described as being supported by any of the program modules depicted in FIG. 7 may be implemented, at least partially, in hardware and/or firmware across any number of devices.

[0050] It should further be appreciated that the computer system 710 may include alternate and/or additional hardware, software, or firmware components beyond those described or depicted without departing from the scope of the disclosure. More particularly, it should be appreciated that software, firmware, or hardware components depicted as forming part of the computer system 710 are merely illustrative and that some components may not be present or additional components may be provided in various embodiments. While various illustrative program modules have been depicted and described as software modules stored in system memory 711, it should be appreciated that functionality described as being supported by the program modules may be enabled by any combination of hardware, software, and/or firmware. It should further be appreciated that each of the above-mentioned modules may, in various embodiments, represent a logical partitioning of supported functionality. This logical partitioning is depicted for ease of explanation of the functionality and may not be representative of the structure of software, hardware, and/or firmware for implementing the functionality. Accordingly, it should be appreciated that functionality described as being provided by a particular module may, in various embodiments, be provided at least in part by one or more other modules. Further, one or more depicted modules may not be present in certain embodiments, while in other embodiments, additional modules not depicted may be present and may support at least a portion of the described functionality and/or additional functionality. Moreover, while certain modules may be depicted and described as sub-modules of another module, in certain embodiments, such modules may be provided as independent modules or as sub-modules of other modules.

[0051] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the functionality and/or processing capabilities described with respect to a particular device or component may be performed by any other device or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like can be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase “based on,” or variants thereof, should be interpreted as “based at least in part on.”

[0052] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Claims

CLAIMS What is claimed is:

1. A system for generating synthetic depth scans from a virtual scene, comprising: a processor; and a non-transitory memory having stored thereon modules executed by the processor, the modules comprising: a differentiable 3D simulation module configured to: receive input parameters related to generating a virtual scene to be scanned, a virtual emitter and a virtual camera to simulate a depth scan sensor to perform a virtual scan of the virtual scene; and execute a differentiable shadow mapping using ray traces from the virtual camera and the virtual emitter and computing a soft shadow map with a sigmoid operator for color correction of a pixel; generate a first virtual image representing the virtual camera viewpoint and a second virtual image representing the virtual emitter viewpoint based on the differentiable shadow mapping; and a differentiable stereo-matching module that produces a synthetic depth scan and configured to: generate a disparity map between the first and second virtual images, the disparity map based on a pixel block extraction from the first virtual image, the pixel block centered around a pixel of interest, and further based on a matching cost function of candidate pixel blocks of the second virtual image, and generate a depth map from the disparity map, wherein for each pixel, a depth value is a ratio of the disparity value from the disparity map to sensor focal length and a baseline distance between the virtual emitter and the virtual camera.

2. The system of claim 1, wherein the modules further comprise: a synthetic noise module configured to apply statistical noise to the first virtual image using a reparameterization trick.

3. The system of claim 1, wherein the differentiable stereo matching module processes at once the first and second virtual images as a whole.

4. The system of claim 1, wherein differentiable stereo matching module is further configured to perform a sub-pixel disparity refinement using a hyperparameter representing a pixel fraction, wherein the candidate block pixels of the second virtual image are shifted by the hyperparameter.

5. The system of claim 1, wherein the modules further comprise: a loss function module configured to measure a loss between the depth map and a real image captured by the depth scan sensor of the actual scene, wherein the loss is used for a backpropagation of a differentiable pipeline to optimize the input parameters used by the differentiable 3D simulation module and the differentiable stereo matching module.

6. The system of claim 1, wherein a series of depth maps are generated and used as synthetic depth images for training data in an adversarial training network, the network comprising: task prediction network to predict task prediction data, and a discriminator network to detecting real versus fake predictions, wherein the task prediction network maximizes a loss function value representing a training loss related to a recognition task for optimizing parameters through a gradient descent.

7. A method for generating a synthetic depth scan from a virtual scene, comprising: receiving input parameters related to generating a virtual scene to be scanned, a virtual emitter and a virtual camera to simulate a depth scan sensor to perform a virtual scan of the virtual scene; and executing a differentiable shadow mapping using ray traces from the virtual camera and the virtual emitter and computing a soft shadow map with a sigmoid operator for color correction of a pixel; generating a first virtual image representing the virtual camera viewpoint and a second virtual image representing the virtual emitter viewpoint based on the differentiable shadow mapping; generating a disparity map between the first and second virtual images, the disparity map based on a pixel block extraction from the first virtual image, the pixel block centered around a pixel of interest, and further based on a matching cost function of candidate pixel blocks of the second virtual image; and generating a depth map from the disparity map, wherein for each pixel, a depth value is a ratio of the disparity value from the disparity map to sensor focal length and a baseline distance between the virtual emitter and the virtual camera.

8. The method of claim 7, further comprising: applying statistical noise to the first virtual image using a reparameterization trick.

9. The method of claim 7, wherein the first and second virtual images are processed as a whole.

10. The method of claim 7, further comprising: performing a sub-pixel disparity refinement using a hyperparameter representing a pixel fraction, wherein the candidate block pixels of the second virtual image are shifted by the hyperparameter.

11. The method of claim 7, further comprising: measuring a loss between the depth map and a real image captured by the depth scan sensor of the actual scene, wherein the loss is used for a backpropagation of a differentiable pipeline to optimize the input parameters.

12. The method of claim 7, wherein a series of depth maps are generated and used as synthetic depth images for training data in an adversarial training network, the network comprising: task prediction network to predict task prediction data, and a discriminator network to detecting real versus fake predictions, wherein the task prediction network maximizes a loss function value representing a training loss related to a recognition task for optimizing parameters through a gradient descent.