WO2018080533A1

WO2018080533A1 - Real-time generation of synthetic data from structured light sensors for 3d object pose estimation

Info

Publication number: WO2018080533A1
Application number: PCT/US2016/059668
Authority: WO
Inventors: Ziyan Wu; Kai Ma; Benjamin PLANCHE; Shanhui Sun; Vivek Kumar Singh; Stefan Kluckner; Terrance Chen; Jan Ernst
Original assignee: Siemens Aktiengesellschaft
Priority date: 2016-10-31
Filing date: 2016-10-31
Publication date: 2018-05-03

Abstract

A computer-implemented method for generating synthetic data from structured light sensors for 3D object pose estimation includes using a simulation platform to render a plurality of pattern images of a 3D model corresponding to plurality of viewpoints. One or more effects are added to the plurality of pattern images and a plurality of depth maps are generated by matching the plurality of pattern images with a raw projector pattern using a block-matching process.

Description

REAL-TIME GENERATION OF SYNTHETIC DATA FROM STRUCTURED LIGHT SENSORS FOR 3D OBJECT POSE ESTIMATION

TECHNICAL FIELD

[1] The present invention relates generally to methods, systems, and apparatuses associated with the generation of synthetic data from structured light sensors. The disclosed methods, systems, and apparatuses may be applied to 3D object pose estimation and similar applications.

BACKGROUND

[2] In single-shot structured light imaging applications, an imaging sensor (e.g., a video camera) is used to acquire a single image of an object illuminated with a particular structured- light pattern. Based on the acquired data, the shape and distance of the object may be reconstructed. In order to enable a mobile device with structured light 3D sensor to recognize an object and estimate its 3D pose, an algorithm needs to be trained using deep learning techniques and large amounts of labeled data from the 3D sensor. In real world scenarios, it is usually very difficult to collect large amounts of real data on-site with the target objects and accurately label their ground-truth poses. This becomes even more difficult if training requires data with expected background variations.

[3] With a 3D rendering engine and the computer-aided design (CAD) models of the target objects, synthetic depth data from a simulated sensor with accurate ground-truth poses information can easily be obtained. Realistic 3D sensor data simulations with CAD data is crucial for 3D object recognition and pose recognition applications based on deep learning, where large amounts of accurately labeled data is required. However the synthetic data generated by current environmental simulation platforms fail to capture the characteristics of sensors such as quantization effects, lens distortions, structured noise, etc., which may severely affect the performance of the 3D object pose estimation algorithms trained based on the data due to the big gap between synthetic data and real sensor data..

[4] Accordingly, it is desired to provide techniques for generating realistic synthetic 3D single-shot structured light sensor data with 3D models in real-time. SUMMARY

[5] Embodiments of the present invention address and overcome one or more of the above shortcomings and drawbacks, by providing methods, systems, and apparatuses related to generating realistic synthetic 3D single-shot structured light sensor data with 3D models in realtime. Briefly, the techniques described herein generate depth images from 3D models of objects replicating realistic capture scenarios, thereby facilitating robust pose retrieval. These techniques utilize a computational framework to synthetically generate depth images from 3D models, incorporating realistic capture environments. Additionally, a data generation pipeline is described herein which is designed to be insensitive to the choice of retrieval algorithm or feature space or intermediate representation.

[6] According to some embodiments, a computer-implemented method for generating synthetic data from structured light sensors for 3D object pose estimation includes using a simulation platform to render a plurality of pattern images of a 3D model corresponding to plurality of viewpoints. Next, one or more effects are added to the pattern images. These effects may include, for example, a radial and tangential lens distortion effect, a lens scratch and gain effect, a motion blur effect, and an independent and identically distributed random noise effect. Then, depth maps are generated by matching the pattern images with a raw projector pattern using a block- matching process. In some embodiments, the method further includes performing smoothing and trimming process on the depth maps according to a measurement range corresponding to sensor specifications associated with a simulated camera used by the simulation platform in rendering the pattern images. A hole-filling operation may also be performed during the smoothing and trimming process to reduce a proportion of missing data in the depth maps.

[7] Various enhancements, refinements, or other modifications may be made to how the simulation is implemented in different embodiments. For example, in some embodiments the simulation platform simulates surface characteristics of the 3D model using a predetermined list of corresponding materials during rendering of the pattern images. The simulation platform may model spot light projectors projecting a desired high resolution pattern during rendering of the pattern images. These spot light projectors may include, for example, a red channel spot light projector, a blue channel spot light projector, and a green channel spot light projector. The simulation platform may additionally (or alternatively) use background 3D models for depth data simulation when rendering the pattern images of the 3D model. In some embodiments, the simulation platform simulates one or more ambient light sources when rendering the pattern images of the 3D model.

[8] In some embodiments, the block-matching process used in the aforementioned method applies a Sum of Absolute Difference block-matching process. The block-matching process may further include setting a maximum disparity number according to the pattern images; and converting sub-pixel disparity values in the pattern images into depth values based on (a) intrinsic parameters of a simulated camera used by the simulation platform in rendering the pattern images; (b) intrinsic parameters of a simulated projector used by the simulation platform in rendering the pattern images; and/or (c) a baseline distance value. Additionally, a calibration process may be performed on a real structured light sensor to obtain the intrinsic parameters of the simulated camera used by the simulation platform in rendering the pattern images, as well as the intrinsic parameters of the simulated projector used by the simulation platform in rendering the pattern images.

[9] According to another aspect of the present invention, an article of manufacture generating synthetic data from structured light sensors for 3D object pose estimation includes a non-transitory, tangible computer-readable medium holding computer-executable instructions for performing the aforementioned method, with or without the various additional features discussed above.

[10] According to other embodiments, a system for generating synthetic data from single- shot structured light sensors for 3D object pose estimation includes a simulation platform, compute shaders and a block matching component. The simulation platform is configured to render a plurality of pattern images of a 3D model corresponding to plurality of viewpoints. One of the compute shaders is configured to add one or more effects to the plurality of pattern images. The block matching component is configured to generate a plurality of depth maps by matching the plurality of pattern images with a raw projector pattern. Finally, the other compute shader is configured to perform a smoothing and trimming process on the plurality of depth maps according to a measurement range corresponding to sensor specifications associated with the simulated camera used in rendering the plurality of pattern images. [11] Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[12] The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:

[13] FIG. 1 shows an end-to-end pipeline for generating synthetic depth scans, as may be employed in some embodiments;

[14] FIG. 2 provides a list of types of noising impairing structured light sources;

[15] FIG. 3 provides an example classification of 3D imaging techniques and their corresponding parameters, which may be applied to the pipeline shown in FIG. 1 according to some embodiments;

[16] FIG. 4 shows an example of the rendering and reconstruction results for a phase-shift multi-shot structured light sensor;

[17] FIG. 5 shows comparisons between the synthetic images generated by the pipeline discussed herein and images generated with state of the art techniques;

[18] FIG. 6 provides a comparison of depth contour maps from various sources;

[19] FIG. 7 shows the structure of an example Convolutional Neural Network (CNN) that is employed in some embodiments;

[20] FIG. 8 illustrates an example method that applies the pipeline shown in FIG. 1 to generate synthetic data from single-shot structured light sensors, according to some

embodiments; [21] FIG. 9 provides an example of a parallel processing memory architecture that may be utilized to perform computations related to execution of the pipeline discussed herein, according to some embodiments of the present invention,

DETAILED DESCRIPTION

[22] Systems, methods, and apparatuses are described herein which relate generally to various techniques related to generating realistic synthetic 3D structured light sensor data with CAD models in real-time. Automatic pose recognition from depth images has recently seen a surge in approaches that rely solely on synthetically generated data from 3D models. However, generating realistic data for robust recognition is a difficult problem as it involves modeling vital factors such as: sensor noise, material reflectance, surface geometry, etc. The techniques described herein utilize a framework that synthetically generates realistic data from 3D models, comprehensively including training data preparation step used in existing state-of-the-art pose retrieval algorithms. Compared to existing approaches such as statistically modeling the sensor noise, or simulating the reconstruction pipeline solely base on geometry information, the techniques described herein provide an end-to-end pipeline that can accurately simulate the target, environment, sensor and analytically processing modules, thereby achieving realistic synthetic depth data.

[23] FIG. 1 shows an end-to-end pipeline 100 for generating synthetic depth scans, as may be employed in some embodiments. This pipeline 100 can be defined as a sequence of procedures directly inspired by the underlying mechanisms performed by the sensors that are being simulated. These mechanisms may include, for example, pattern projection and capture, followed by stereo-matching between the acquired image and original pattern, and scan reconstruction.

[24] To realistically generate synthetic depth data, we need first to understand the causes behind the various kinds of noise one can find in the scans generated by real structured light sensors. The table in FIG. 2 summarizes the results of this analysis, listing the different kinds of noise impairing structured light sensors, and their sources and characteristics. This table shows how each step of the sensing process introduces its own artifacts. Time of Flight (ToF) sensors also share many of the listed noise types. One can observe that several types of noise are related to lighting and surface material properties (Axial Noise, Lateral Noise, Specular Surface, Non- specular Surface and Structural Noise) or to the sensor structure (Shadow Noise), impacting the projection and capture of the pattern(s). Further errors and approximations may then be introduced during the block-matching and hole-filling operations leading to the final depth output. The synthetic data generation pipeline shown in FIG. 1 is built to take into account these behaviors, by using the proper rendering parameters and applying the same reconstruction procedure.

[25] Simulation Platform 105 is used to reproduce the realistic pattern projection and capture mechanism. Simulation Parameters 110 allow the Simulation Platform 105 to behave like a large panel of depth sensors. FIG. 3 provides an example classification of 3D imaging techniques and their corresponding parameters, which may be applied to the pipeline 100 shown in FIG. 1 in some embodiments. In general, any kind of pattern can be provided as an image asset for the projection in order to adapt to the single-shot or multi-shot depth sensing device one wants to simulate. Moreover, the intrinsic parameters as well as the extrinsic parameters of the camera and projector are adjustable.

[26] In some embodiments, the Simulation Platform 105 performs a calibration of real structured light sensors in order to obtain their intrinsic and extrinsic parameters, as well as the reconstruction of the corresponding pattern(s) with the help of an extra red-green-blue (RGB) camera. Once the original pattern is obtained, the pipeline can automatically generate a square binary version of it, followed by other different reference versions for later use in the block matching procedure, according to the image resolution of the camera. Once in possession of the Simulation Parameters 110, they can be used as input to Simulation Platform 105 in order to initialize the simulation. The 3D model is then specified, for example, as a pre-existing CAD file or a file in a similar 3D modeling format.

[27] In addition to the Simulation Parameters 110 discussed above, the material(s) associated with the 3D model may be specified in some embodiments. Although not all 3D models come with realistic textures, the quality of the synthetic results highly depends on characteristics such as reflectance. Thus, the quality of the end-result will depend heavily on the amount and quality of material information provided to the simulation. [28] Given also a list of viewpoints (in different formats such as projection matrices), the platform performs each pattern capture and projection, simulating realistic illumination sources and shadows, taking into account surface and material characteristics. In addition to the object model, the 3D scene is thus populated with additional items which enhance the robustness of the data collection capabilities of the system. These items include a spot light projector and a perspective camera. The spot light projector uses the desired high resolution pattern as light cookie. The perspective camera is configured with the intrinsic and extrinsic parameters of the real sensor, separated from the projector by the provided baseline distance in the horizontal plane of the simulated device. In some embodiments, additional light sources (e.g., ambient light) are added to the 3D scene to simulate the effect of environmental illuminations. Additionally (or alternatively), other 3D models (e.g. ground, occluding objects, etc.) may be added to ornament the simulated scene.

[29] By using a virtual light projector using the provided pattern(s), and a virtual camera with the proper optical characteristics, the exact light projection/capture which the real devices are based on may be reproduced, thereby obtaining out of the 3D engine a captured image with the chosen resolution, similar to the intermediate output of the devices (e.g. the infrared captures from Microsoft Kinect or Occipital Structure).

[30] The simulation pipeline shown in FIG. 1 can generally be applied to model any type of light sensor including, without limitation, single-shot structured light sensors, multi-shot structured light sensors, and time-of-flight sensors. FIG. 4 shows an example of the rendering and reconstruction results for a phase-shift multi-shot structured light sensor, displaying the artifacts caused by too-fine vertical structures and reflections. Images (a) depict renderings of projected patterns under realistic lighting and surface materials. Image (b) is a color rendering of the target object (presented here in grey scale) and image (c) shows the ideal depth data. Image (d) is the reconstructed depth map generated with the pipeline 100 shown in FIG. 1.

[31] The intermediate results generated by the Simulation Platform 105, captured in realtime by the virtual camera, are then used as input into a Pre-Processing stage 115. This stage 115 feeds the simulation results into a compute shader layer to get closer to the original quality, impinged by imaging sensor noise. The Pre-Processing stage 115 may add noise such as radial and tangential lens distortion, lens scratch and grain, motion blur, and independent and identically distributed random noise. The technique for adding noise may vary according to the particular type(s) of noise being added; however, in general, any technique generally known in the art for adding noise may be employed by the Pre-Processing stage 115.

[32] Relying on the principles of stereo vision, the rendered picture is then matched with its reference pattern image at Reconstruction stage 120, in order to extract the depth information from their disparity map. The pattern emitted by the projector and the resulting capture from the sensor are here used as the stereo stimuli, with these two virtual eyes (the projector and the camera) being separated by the horizontal baseline distance. Given baseline distance b and the depth value z is a direct function of the disparity d:

where /is the focal length in pixel:

\ f _:S h«riK<snts! stereo (2) \ /_!f vesical stereo

[33] During the Reconstruction stage 120, the disparity map is computed by applying a block-matching process using small Sum of Absolute Differences (SAD) windows to find the correspondences - a simple but efficient method for pictures with highly-textured or unique patterns. Correspondences may be computed by sliding the window horizontally (if the camera and projector are configured horizontally) or vertically (if the camera and projector are configured vertically), along the epipolar line (given the horizontal alignment of the two viewpoints, ensured by a rectification step). The function value of SAD for the location (x; y) on the image captured by the camera may be computed by:

where w is the window size, I_s is the image from the camera, and I_t is the pattern image. The matched location on the pattern image can be obtained by:

The disparity value d can be computed by:

1% onzost&i stereo (5)

"^— 1 i¾,— vesica! stereo

[34] Based on pixel offsets, each disparity value is an integer. The results may be refined by interpolating between the closest matching block and its neighbors, thereby achieving a sub- pixel fine tuning. To improve the performance, the horizontal pixel-wise search range can be reduced by taking into account the characteristics of the simulated device. Indeed, based on Equation (1), the possible disparity range is directly bound to the operational depth range of the sensor. Given the minimum and maximum values the device can output, we can limit the sliding to the corresponding pixel range.

[35] Continuing with reference to FIG. 1, during a Post-Processing stage 125, the depth maps undergo post-processing through another compute shader layer, where they will be smoothed and trimmed according to the measurement range from the sensor's specifications. Imitating once more the operations done by the real devices, a hole-filling step is performed to reduce the missing data proportion.

[36] FIG. 5 shows comparisons between the synthetic images generated by the pipeline discussed herein and images generated with a state of the art technique known as BlenSor. As shown in the FIG. 5, the pipeline's decision implementation of the complete image acquisition process provides a great deal of benefit in terms of quality. In particular, BlenSor' s synthetic image preserves all fine details which will be smooth-out by the sensor in block-matching, up to the window-size. Incongruous artifacts can also be seen at the edge of surfaces, which cannot be found in real captured data.

[37] FIG. 6 provides a comparison of depth contour maps from various sources. These sources including, respectively from left to right: pure depth buffer from the 3D engine, BlenSor' s simulated depth data, BlenSor' s noisy data (Perlin noise added by their simulator), synthetic data from the pipeline discussed herein, and captured real data at roughly the same pose. Note that there are deviations between the real chair and its CAD model. The depth contour map of BlenSor's synthetic data is very similar to the unrealistically pure depth buffer from the used 3D engine. Data generated by the pipeline described herein has contour maps much closer to those from real scans.

[38] Background is one of the major elements one would expect in real depth images.

Most of the existing synthetic depth data generation pipelines chose to ignore background addition (e.g. by alpha-compositing) which may cause significant discrepancy between synthetic data and real data, and bias the learner. Background modeling is hence another key component in the pipeline discussed herein. The types of backgrounds supported by the pipeline may include, without limitation, static synthetic background based on predefined geometry, dynamic synthetic background based on predefined geometry and motion, synthetic background with large amounts of random primitive shapes, and real captured background (e.g. from public datasets).

[39] A core feature of the Simulation Platform 105 shown in FIG. 1 is robust support for a variety of camera poses. According to some embodiments of the present invention, a six-degree- of-freedom (6-DOF) camera pose recognition problem is formulated from a single 2.5D image into an image retrieval problem. During the training stage, N_p camera poses are discretized, the synthetic 2.5D image for each pose is generated using pipeline, and each picture is encoded via a (low dimension) image representation with its corresponding camera pose. In this way, a camera pose database for pose retrieval problems may be constructed. Given an unseen image, its image representation may be used to get its index in the saved database. In some embodiments, a K- nearest neighbor search (e.g., KD-Tree) is used to fulfill the indexing step.

[40] In the camera pose recognition framework discussed above, two components play an important role in ensuring a successful indexing. One is the 2.5D image simulation, and the other one is the image representation. A fitting data simulation process is considered as fitting when it minimizes the quality gap between synthetic and acquired depth data. On the other hand, a fitting image representation should carry discriminative pose information, resulting in successful searches. "Discriminativeness" for pose recognition problems can be defined as the distance between two image representations, which should be small when the two camera poses are close to each other and respectively big when they are far.

[41] In some embodiments, camera pose recognition is performed using a case-specific computer- crafted representations generated by a Convolutional Neural Network (CNN), and a bootstrapping scheme taking advantage of the other elements of the pipeline discussed herein. A custom LeNet structure may be utilized in order to learn discriminative 2.5D depth image representations. The proposed CNN structure is illustrated in FIG. 7. The output layer of the network may be used as image representation. In order to guide the computer in its learning, the aforementioned definition of a discriminative Euclidean distance is enforced, using a loss function over all the CNN weights w:

where Ltripiet is the triplet loss function and L_pairWi_Se is the pairwise loss function. The last term is the regularization one to avoid overfitting. A triplet is defined as (pj, p†, p ), with p_t one camera pose sampling point, a camera pose defined as close to p_t and p another camera pose defined as far from . A pair is defined as (p_t, p[) with p_t one camera pose sampling point and p[ its perturbation in terms of pose and noise conditions. Given /(^■) the CNN- generated representation and m a margin, the triplet loss function is defined as follows:

where /(^■) is the CNN-generated image representation and m is a margin. L_pairm_Se is a Euclidean distance loss function. A pair is defined as (pj, p ), where p_t is one camera pose sampling point and pi' is its perturbation in terms of pose and noise conditions. The implementation is based on the Caffe framework generally known in the art. The network weights w may be optimized by stochastic gradient descent on mini-batches.

[42] As mentioned above, the 6-DOF pose estimation problem of 3D objects may be addressed by considering an image retrieval pipeline, taking advantage of the power of emerging deep neural networks (such as CNNs) methodologies. Although deep neural networks can theoretically handle extremely large amounts of data, in practice, the bootstrapping strategy is commonly used in training them to achieve better performance and efficiency. In some instances, the triplet sampling space may be extremely large, hence a quantity of available training data much higher than what a CNN can handle for each epoch. Therefore, to efficiently train the CNN, the bootstrapping may be performed after the first set of epochs. Once the bootstrapping module captures an error case from the train/validation set, the input and queried neighbor may be labeled as p and p^~. The module may find another random sample p⁺ from the dataset and form a new triplet, correcting the model.

[43] FIG. 8 illustrates an example method 800 that applies the pipeline shown in FIG. 1 to generate synthetic data from structured light sensors, according to some embodiments. Starting a step 805, a simulation platform (see FIG. 1) is used to render a plurality of pattern images of a 3D model corresponding to plurality of viewpoints. As discussed above, the simulation platform is designed to provide robust capabilities for rendering the images under a variety of conditions. Thus, in some embodiments, the simulation platform simulates surface characteristics of the 3D model using a predetermined list of corresponding materials during rendering of the pattern images. In other embodiments, the simulation platform models one or more spot light projectors projecting a desired high resolution pattern during rendering of the pattern images. Colored pattern projectors can be modeled by three overlapping spot light projectors in the simulation environment. Each projector projects pattern with single color channel. Thus, in some embodiments, the aforementioned spot light projectors may include a red channel spot light projector, a blue channel spot light projector, and a green channel spot light projector.

Additional light sources (e.g. ambient light) can be added to the simulated environment to simulate the effect of environmental illuminations. This effect can be further refined by using background 3D models for depth data simulation.

[44] Continuing with reference to FIG. 8, at step 810, a compute shader layer adds one or more effects to the pattern images rendered at step 805. These effects may include, for example, one or more of a radial and tangential lens distortion effect, a lens scratch and gain effect, a motion blur effect. Additionally, independent and identically distributed random noise may be added to the image to simulate the imaging sensor noise. Next, at step 815, a plurality of depth maps is generated by matching the pattern images with a raw projector pattern using a block- matching process. In some embodiments, this block-matching process applies a sum of absolute difference block-matching process. During the block-matching process, in one embodiment, a maximum disparity number is set according to pattern images. Next, the sub-pixel disparity values may be generated based on that number. Then, the sub-pixel disparity values may be converted into real depth values based on, for example, the intrinsic and extrinsic parameters of the simulated camera and sensor used by the simulation platform in rendering pattern images, as well as a baseline distance value.

[45] Returning to FIG. 8, at step 820, a smoothing and trimming process is performed on the depth maps according to a measurement range corresponding to sensor specifications associated with the simulated camera used by the simulation platform in rendering the pattern images. In some embodiments, during step 820, a hole-filling operation is performed during the smoothing and trimming process to reduce a proportion of missing data in the depth maps.

[46] The pipeline discussed herein may be applied in a variety of ways. For example, in some embodiments, the pipeline is integrated into a diagnostics system applicable to mechanical devices (e.g., railroad cars). A user uses a portable computing device with a structured light sensor to acquire images of the mechanical device. Based on these images, individual parts may be identified based on the image recognition functionality provided by the pipeline. Once a part is identified, the diagnostics system may provide relevant information to the user. This information could include, for example, the name and model number of each particular part in the image. The user, in conjunction with the diagnostic system, may then use this information to perform maintenance on the mechanical device. For example, based on the image recognition results provided by the pipeline, the diagnostics system may automatically place an order for a new replacement part or send a request to a 3D printer to enable printing of the replacement part.

[47] FIG. 9 provides an example of a parallel processing memory architecture 900 that may be utilized to perform computations related to execution of the pipeline discussed herein, according to some embodiments of the present invention. This architecture 900 may be used in embodiments of the present invention where NVIDIA™ CUDA (or a similar parallel computing platform) is used. The architecture includes a host computing unit ("host") 905 and a graphics processing unit (GPU) device ("device") 910 connected via a bus 915 (e.g., a PCIe bus). The host 905 includes the central processing unit, or "CPU" (not shown in FIG. 9) and host memory 925 accessible to the CPU. The device 910 includes the graphics processing unit (GPU) and its associated memory 920, referred to herein as device memory. The device memory 920 may include various types of memory, each optimized for different memory usages. For example, in some embodiments, the device memory includes global memory, constant memory, and texture memory.

[48] Parallel portions of a deep learning application may be executed on the architecture

900 as "device kernels" or simply "kernels." A kernel comprises parameterized code configured to perform a particular function. The parallel computing platform is configured to execute these kernels in an optimal manner across the architecture 900 based on parameters, settings, and other selections provided by the user. Additionally, in some embodiments, the parallel computing platform may include additional functionality to allow for automatic processing of kernels in an optimal manner with minimal input provided by the user.

[49] The processing required for each kernel is performed by grid of thread blocks

(described in greater detail below). Using concurrent kernel execution, streams, and

synchronization with lightweight events, the architecture 900 of FIG. 9 (or similar architectures) may be used to parallelize training of a deep neural network. For example, in some

embodiments, the operations of the simulation platform may be partitioned such that multiple kernels execute simulate different configurations simultaneously (e.g., different viewpoints, lighting, textures, materials, effects, etc.). In other embodiments, the deep learning network itself may be implemented such that various operations performed with training and use of the network are done in parallel.

[50] The device 910 includes one or more thread blocks 930 which represent the computation unit of the device 910. The term thread block refers to a group of threads that can cooperate via shared memory and synchronize their execution to coordinate memory accesses. For example, in FIG. 9, threads 940, 945 and 950 operate in thread block 930 and access shared memory 935. Depending on the parallel computing platform used, thread blocks may be organized in a grid structure. A computation or series of computations may then be mapped onto this grid. For example, in embodiments utilizing CUD A, computations may be mapped on one-, two-, or three-dimensional grids. Each grid contains multiple thread blocks, and each thread block contains multiple threads. For example, in FIG. 9, the thread blocks 930 are organized in a two dimensional grid structure with m+l rows and n+l columns. Generally, threads in different thread blocks of the same grid cannot communicate or synchronize with each other. However, thread blocks in the same grid can run on the same multiprocessor within the GPU at the same time. The number of threads in each thread block may be limited by hardware or software constraints. To address this limitation, pipeline operations may be configured in various manners to optimize use of the parallel computing platform. For example, in some

embodiments, processing of different viewpoints by the simulation platform, operations performed by the compute shaders, or operations associated with the block matching process may be partitioned over thread blocks automatically by the parallel computing platform software. However, in other embodiments, the individual thread blocks can be selected and configured to optimize training of the deep learning network. For example, in one embodiment, each thread block is assigned a subset of training data with overlapping values.

[51] Continuing with reference to FIG. 9, registers 955, 960, and 965 represent the fast memory available to thread block 930. Each register is only accessible by a single thread. Thus, for example, register 955 may only be accessed by thread 940. Conversely, shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Thus, shared memory 935 is designed to be accessed, in parallel, by each thread 940, 945, and 950 in thread block 930. Threads can access data in shared memory 935 loaded from device memory 920 by other threads within the same thread block (e.g., thread block 930). The device memory 920 is accessed by all blocks of the grid and may be implemented using, for example, Dynamic Random- Access Memory (DRAM).

[52] Each thread can have one or more levels of memory access. For example, in the architecture 900 of FIG. 9, each thread may have three levels of memory access. First, each thread 940, 945, 950, can read and write to its corresponding registers 955, 960, and 965.

Registers provide the fastest memory access to threads because there are no synchronization issues and the register is generally located close to a multiprocessor executing the thread.

Second, each thread 940, 945, 950 in thread block 930, may read and write data to the shared memory 935 corresponding to that block 930. Generally, the time required for a thread to access shared memory exceeds that of register access due to the need to synchronize access among all the threads in the thread block. However, like the registers in the thread block, the shared memory is typically located close to the multiprocessor executing the threads. The third level of memory access allows all threads on the device 910 to read and/or write to the device memory. Device memory requires the longest time to access because access must be synchronized across the thread blocks operating on the device. Thus, in some embodiments, the processing of each seed point is coded such that it primarily utilizes registers and shared memory and only utilizes device memory as necessary to move data in and out of a thread block.

[53] The embodiments of the present disclosure may be implemented with any combination of hardware and software. For example, aside from parallel processing architecture presented in FIG. 9, standard computing platforms (e.g., servers, desktop computer, etc.) may be specially configured to perform the techniques discussed herein. In addition, the embodiments of the present disclosure may be included in an article of manufacture (e.g., one or more computer program products) having, for example, computer-readable, non-transitory media. The media may have embodied therein computer readable program code for providing and facilitating the mechanisms of the embodiments of the present disclosure. The article of manufacture can be included as part of a computer system or sold separately.

[54] While various aspects and embodiments have been disclosed herein, other aspects and embodiments will be apparent to those skilled in the art. The various aspects and

embodiments disclosed herein are for purposes of illustration and are not intended to be limiting, with the true scope and spirit being indicated by the following claims.

[55] An executable application, as used herein, comprises code or machine readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine readable instruction, sub-routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters. [56] A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.

[57] The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity.

[58] The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be

implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof. No claim element herein is to be construed under the provisions of 35 U.S.C. 112, sixth paragraph, unless the element is expressly recited using the phrase "means for."

Claims

1. A computer- implemented method for generating synthetic data from structured light sensors for 3D object pose estimation, the method comprising: using a simulation platform to render a plurality of pattern images of a 3D model corresponding to plurality of viewpoints; adding one or more effects to the plurality of pattern images; and generating a plurality of depth maps by matching the plurality of pattern images with a raw projector pattern using a block-matching process.

2. The method of claim 1, wherein the simulation platform simulates surface characteristics of the 3D model using a predetermined list of corresponding materials during rendering of the plurality of pattern images.

3. The method of claim 1, wherein the simulation platform models one or more spot light projectors projecting a desired high resolution pattern during rendering of the plurality of pattern images.

4. The method of claim 3, wherein the one or more spot light projectors comprise a red channel spot light projector, a blue channel spot light projector, and a green channel spot light projector.

5. The method of claim 1, wherein the simulation platform uses one or more background 3D models for depth data simulation when rendering the plurality of pattern images of the 3D model.

6. The method of claim 1 , wherein the simulation platform simulates one or more ambient light sources when rendering the plurality of pattern images of the 3D model.

7. The method of claim 1, wherein the one or more effects comprise a radial and tangential lens distortion effect, a lens scratch and gain effect, a motion blur effect, and an independent and identically distributed random noise effect.

8. The method of claim 1, wherein the block-matching process applies a Sum of Absolute Difference block-matching process.

9. The method of claim 8, wherein the block-matching process further comprises:

setting a maximum disparity number according to the plurality of pattern images;

converting sub-pixel disparity values in the plurality of pattern images into depth values based on (a) intrinsic parameters of a simulated camera used by the simulation platform in rendering the plurality of pattern images; (b) intrinsic parameters of a simulated projector used by the simulation platform in rendering the plurality of pattern images; and (c) a baseline distance value.

10. The method of claim 9, further comprising:

performing a calibration process on a real structured light sensor to obtain (a) the intrinsic parameters of the simulated camera used by the simulation platform in rendering the plurality of pattern images; (b) the intrinsic parameters of the simulated projector used by the simulation platform in rendering the plurality of pattern images.

11. The method of claim 1, further comprising:

performing a smoothing and trimming process on the plurality of depth maps according to a measurement range corresponding to sensor specifications associated with a simulated camera used by the simulation platform in rendering the plurality of pattern images.

12. The method of claim 11, wherein a hole-filling operation is performed during the smoothing and trimming process to reduce a proportion of missing data in the plurality of depth maps.

13. An article of manufacture generating synthetic data from structured light sensors for 3D object pose estimation, the article of manufacture comprising a non-transitory, tangible computer-readable medium holding computer-executable instructions for performing a method comprising:

rendering a plurality of pattern images of a 3D model using a simulated camera and a simulated projector, wherein each pattern images corresponds to a distinct view point;

adding one or more effects to the plurality of pattern images; and

generating a plurality of depth maps by matching the plurality of pattern images with a raw projector pattern.

14. The article of manufacture of claim 13, wherein rendering the plurality of pattern images comprises simulating surface characteristics of the 3D model using a predetermined list of corresponding materials during rendering of the plurality of pattern images.

15. The article of manufacture of claim 13, wherein rendering the plurality of pattern images comprises modeling one or more spot light projectors projecting a desired high resolution pattern during rendering of the plurality of pattern images.

16. The article of manufacture of claim 13, wherein generation of the plurality of depth maps comprises:

converting sub-pixel disparity values in the plurality of pattern images into depth values based on (a) intrinsic parameters of the simulated camera used in rendering the plurality of pattern images; (b) intrinsic parameters of the simulated projector used in rendering the plurality of pattern images; and (c) a baseline distance value.

17. The article of manufacture of claim 13, further comprising:

performing a calibration process on a real structured light sensor to obtain (a) intrinsic parameters of the simulated camera used in rendering the plurality of pattern images and (b) intrinsic parameters of the simulated projector used in rendering the plurality of pattern images.

18. The article of manufacture of claim 13, further comprising:

performing a smoothing and trimming process on the plurality of depth maps according to a measurement range corresponding to sensor specifications associated with the simulated camera used in rendering the plurality of pattern images.

19. The article of manufacture of claim 18, wherein a hole-filling operation is performed during the smoothing and trimming process to reduce a proportion of missing data in the plurality of depth maps.

20. A system for generating synthetic data from single-shot structured light sensors for 3D object pose estimation, the system comprising: a simulation platform configured to render a plurality of pattern images of a 3D model corresponding to plurality of viewpoints; a first compute shader configured to add one or more effects to the plurality of pattern images; and a block matching component configured to generate a plurality of depth maps by matching the plurality of pattern images with a raw projector pattern; and a second compute shader configured to perform a smoothing and trimming process on the plurality of depth maps according to a measurement range corresponding to sensor specifications associated with the simulated camera used in rendering the plurality of pattern images.