WO2020086217A1

WO2020086217A1 - Learning keypoints and matching rgb images to cad models

Info

Publication number: WO2020086217A1
Application number: PCT/US2019/053827
Authority: WO
Inventors: Georgios Georgakis; Srikrishna KARANAM; Ziyan Wu; Jan Ernst
Original assignee: Siemens Aktiengesellschaft; Siemens Corporation
Priority date: 2018-10-26
Filing date: 2019-09-30
Publication date: 2020-04-30

Abstract

A neural network or system can be configured to learn keypoint locations and respective descriptors associated with each keypoint location. The network can include a CAD domain and a picture or RGB-D domain. The CAD domain can include a first branch of the network and a second branch of the network. The CAD domain can be configured to train on pairs of depth images rendered from CAD models of CAD objects, so as to learn viewpoint-invariant features of the CAD objects. The picture domain can include a third branch of the network and a fourth branch of the network. The picture domain can be configured to train on pairs of images of objects, for instance a depth image and its corresponding RGB image, so as to learn modularity-invariant features of the objects. At test time, the network can identify an RGB image, for instance a pose or category defined by the RGB image, using a database created from depth images rendered from CAD models.

Description

LEARNING KEYPOINTS AND MATCHING RGB IMAGES TO CAD MODELS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application Serial No. 62/750,847 filed October 26, 2018, the disclosure of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] This application relates to image learning and matching techniques. The technology described herein is particularly well-suited for, but not limited to, matching RGB (red-green-blue) images of object to corresponding computer-assisted design (CAD) models.

BACKGROUND

[0003] RGB (red-green-blue) images or data refers to data that can be combined in various proportions to obtain any color in the visible spectrum, for example, to represent colors on a computer display. Various tasks, such as identifying parts in an assembly of parts, require RGB images to be captured. But capturing the RGB data to identify parts can be an arduous procedure. For example, in some cases, the data may need to be manually captured and manually annotated at many different assembly sites of interest, for instance sites for assembling different train engines or turbines. Capturing and annotating RGB data in this manner, or in similar manners, can be cumbersome (or in some cases, impossible) and can result in errors. In contrast to RGB images, however, computer-aided design (CAD) models of parts and assemblies are typically readily available, and such models can be the source of numerous synthetically rendered depth images. Various technical challenges arise, however, in attempting to use CAD models to perform various tasks such as identifying a part that is captured in an RGB image. SUMMARY

[0004] Embodiments of the invention address and overcome one or more of the described-herein shortcomings by providing methods, systems, and apparatuses that learn RGB (red-blue-green) images and depth images rendered from CAD models so as to match RGB images of objects to corresponding CAD models. In particular, a system described herein can be trained to identify an RGB image, for instance a pose or category defined by the RGB image, at test time using a database created from depth images rendered from CAD models.

[0005] In an example aspect, a quadruplet neural network is configured to learn keypoint locations and respective descriptors associated with each keypoint location. The quadruplet neural network includes a CAD domain and a picture domain. The CAD domain includes a first branch of the network and a second branch of the network. The CAD domain is configured to train on pairs of CAD depth images rendered from CAD models of CAD objects, so as to learn viewpoint-invariant features of the CAD objects. The picture domain includes a third branch of the network and a fourth branch of the network. The picture domain is configured to train on pairs of RGB images of RGB objects, so as to learn modularity-invariant features of the RGB objects.

[0006] In particular, continuing with the example aspect, the first branch is configured to train on first CAD depth images rendered from the CAD models of the CAD objects, and the second branch is configured train on second CAD depth images rendered from the CAD models of the CAD objects. Further, each first CAD depth image corresponds to a respective second CAD depth image so as define the pairs of CAD depth images, such that the first CAD depth image in a respective pair includes a particular CAD object in a first pose, and the second CAD depth image in the respective pair defines the particular CAD object in a second pose that is different than the first pose. The picture domain includes a third branch of the network and a fourth branch of the network. The picture domain is configured to train on pairs of RGB images of RGB objects, so as to learn modularity-invariant features of the RGB objects. In particular, the picture domain can be configured to train on an RGB image and its corresponding depth image, which can be collectively referred to as RGB-D images. Thus, each RGB depth image corresponds to a respective RGB image so as to define the pairs of RGB images, such that the depth image in a respective RGB-D pair includes a particular RGB object in a particular pose, and the RGB image in the respective pair also includes the particular RGB object in the particular pose.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:

[0008] FIG. 1 is a block diagram of a quadruplet convolutional neural network according to embodiments of the present disclosure.

[0009] FIG. 2 is another block diagram that shows an example of pose estimation that can be performed by the network shown in FIG. 1 during an example testing phase, according to embodiments of this disclosure.

[0010] FIG. 3 is another block diagram that shows an example of making keypoint predictions that can be performed by a network according to embodiments of this disclosure.

[0011] FIGs. 4A-C depict example keypoint predictions that can be made in accordance with the example depicted in FIG. 3.

[0012] FIG. 5 shows an example of a computing environment within which

embodiments of the disclosure may be implemented.

DETAILED DESCRIPTION

[0013] Embodiments of the invention address and overcome one or more of the described-herein shortcomings or technical problems by providing methods, systems, and apparatuses that learn how to match RGB (red-green-blue) images of an object to respective depth images rendered from computer-aided design (CAD) models. The RGB images may be captured by consumer-grade cameras. Embodiments described herein can also estimate a given object’s pose and/or category.

[0014] In an example embodiment, a quadruplet neural network includes a CAD domain and a picture domain. The CAD domain includes a first branch of the network and a second branch of the network. The CAD domain is configured to train on pairs of CAD depth images rendered from CAD models of CAD objects, so as to learn viewpoint-invariant features of the CAD objects. In particular, the first branch is configured to train on first CAD depth images rendered from the CAD models of the CAD objects, and the second branch is configured to train on second CAD depth images rendered from the CAD models of the CAD objects. Further, each first CAD depth image corresponds to a respective second CAD depth image so as define the pairs of depth images, such that the first CAD depth image in a respective pair includes a particular CAD object in a first pose, and the second CAD depth image in the respective pair defines the particular CAD object in a second pose that is different than the first pose. Continuing with the example embodiment, the picture domain includes a third branch of the network and a fourth branch of the network. The picture domain is configured to train on pairs of RGB images of RGB objects, so as to learn modularity- invariant features of the RGB objects. In particular, the picture domain can be configured to train on an RGB image and its corresponding depth image, which can be collectively referred to as RGB-D images. As used herein, unless otherwise specified, RGB-D refers to a pair images that includes an RGB image and a depth image that are from the same viewpoint at the same time. In particular, each RGB depth image corresponds to a respective RGB image so as to define the pairs of RGB images, such that the RGB depth image in a respective pair includes a particular RGB object in a particular pose, and the RGB image in the respective pair also includes the particular RGB object in the particular pose.

[0015] The disclosed methods and systems present an improvement to the functionality of the computer used to perform such a computer-based task.

Furthermore, the disclosed methods and systems can solve 3D object pose estimation problems without relying on annotated RGB images. To do so, it is recognized herein that readily available CAD models of objects are rich sources of data that can provide a large number of synthetically rendered depth images. These rendered CAD depth images can be used to match RGB images for object estimation, thereby eliminating the use of real-world textures for CAD models or explicit 3D pose annotations for RGB images, which can be expensive. In particular, methods and systems disclosed herein can learn how to select keypoints and enforce viewpoint and modularity invariance across RGB images and CAD model renderings. Further, after training, methods and systems described herein can reliably estimate the pose of objects in RGB images, and can generalize to object instances that were not seen during training.

[0016] It is recognized herein that CAD models of objects are often available, and can be used to generate synthetically rendered depth images. The objects may define parts for an assembly or an assembly itself. In some examples, such images can form the basis of a representation of parts for a given assembly. It is also recognized herein, however, that matching these representations of objects or parts to RGB images of the objects or parts can be burdensome or unreliable. In particular, for example, it can be difficult to obtain RGB images having accurate three-dimensional (3D) pose

annotations. Furthermore, there can be a significant appearance difference between RGB images and CAD models. In various example embodiments, however, a system or network is trained to match an RGB image at test time to an object in a database created from depth images rendered from CAD models.

[0017] Referring initially to FIG.1 , a neural network or system, for instance a quadruplet convolutional neural network or system 100, can learn viewpoint-invariant and modality-invariant features. The neural network 100 can include various branches, for instance a first branch 102a, a second branch 102b, a third branch 102c, and a fourth branch 102d. The neural network 100 can include a CAD domain or portion 101 that trains on CAD models, for instance CAD depth images rendered from CAD models. The CAD domain 101 can be configured to train on pairs of CAD depth images rendered from CAD models of CAD objects, so as to learn viewpoint-invariant features of the CAD objects. In accordance with the illustrated example, the CAD domain 101 includes the first branch 102a and the second branch 102b. The neural network 100 can further include a picture (or RGB-D) domain or portion 103 that trains on RGB images. The picture domain 103 can be configured to train on pairs of RGB images of RGB objects, so as to learn modularity-invariant features of the RGB objects. In accordance with the illustrated example, the picture domain 103 includes the third branch 102c and the fourth branch 102d.

[0018] As described further herein, the neural network 100 can learn viewpoint- invariant features using rendered CAD depth images, for instance a pair of rendered CAD depth images, that include pose perturbation from a CAD model. The neural network 100 can learn modality-invariant features from an aligned pair of images, for instance RBG images. As used herein, unless otherwise specified, a modality-invariant feature refers to a feature that is reflected independently of the modality through which it has been apprehended. By way of example, a modality-invariant feature or object is represented the same in a CAD model and an RGB image. Similarly, a view-invariant feature refers to a feature that is reflected independently of the pose of the object that defines the features. Thus, a modality-invariant feature can refer to a transformation of the input or parts of the input (e.g., an RGB image or a real/synthetic depth image) that has the same value if the underlying object (and/or reference point of the object) is the same, regardless of whether the modality is RGB or depth. Similarly, a pose-invariant feature can refer to a feature that has the same value if the underlying object (and/or reference point of the object) is the same, regardless of the perspective in which the object is seen.

[0019] With continuing reference to FIG. 1 , the neural network 100 is trained during a training stage. The neural network 100 can be trained using CAD models of one or more objects, for instance a CAD model of a three-dimensional (3D) object. In particular, in accordance with the illustrated example, the first branch 102a is configured to train on first CAD depth images 104a rendered from CAD models of CAD objects, the second branch 102b is configured to train on second CAD depth images 104b rendered from CAD models of CAD objects, the third branch 102c is configured to train on third depth images 110a, for instance RGB depth images that have a corresponding RGB image, and the fourth branch 102d is configured to train on RGB images 110b. Further, each first CAD depth image 104a corresponds to a respective second CAD depth image 104b so as define the pairs of depth images, such that the first CAD depth image 104a in a respective pair includes a particular CAD object in a first pose, and the second CAD depth image in the respective pair defines the particular CAD object in a second pose that is different than the first pose. In contrast, each depth image 110a corresponds to a respective RGB image 110b so as to define the pairs of RGB-D images, such the RGB depth image 110a in a respective pair includes a particular RGB object in a particular pose, and the RGB image 110b in the respective pair also includes the particular RGB object in the particular pose.

[0020] During an example training stage, a CAD model of a given object is sampled so as to generate depth images of the object. In particular, the object can be sampled with virtual camera poses from a view-sphere around the object, such that the depth images are of various perspectives of the object. The training stage includes multiple training instances. In an example training instance, the depth images are sampled to create a pair of depth images. As shown, the pair of depth images includes the first depth image 104a of a CAD object 106, and the second depth image 104b of the object 106. The example object 106 is a bed, though it will be understood that the object 106 can be any object having a CAD model as desired, and all such objects are

contemplated as being within the scope of this disclosure.

[0021] In accordance with the illustrated example, the first depth image 104a is input to the first branch 102a of the neural network 100, and the second depth image 104b is input to the second branch 102b of the neural network 100. Further, the first depth image 104a defines a first pose of the object 106, and the second depth image 104b defines a second pose of the object 106 that is different than the first pose of the object 106. As used herein, unless otherwise specified, the pose of a given object refers to the position and orientation of the object. Thus, in accordance with the illustrated example, the inputs to the first and second branches 102a and 102b during an example training instance are from the same modality (e.g., CAD depth images), but include different poses of the same object. A training stage may include multiple, for instance hundreds or thousands, training instances, such that multiple pairs of depth images are input into the first and second branches 102a and 102b of the network 100. The pairs of depth images that are input to the first and second branches 102a and 102b during the training stage may be of the same object. Alternatively, at least one pair, for instance multiple pairs, of the CAD depth images that are input to the first and second branches 102a and 102b may of a different object with respect to other pairs that are input to the first and second branches 102a and 102b.

[0022] With continuing reference to FIG. 1 , pictures, for instance RGB images, can also be obtained during the example training stage and input into the third and fourth branches 102c and 102d of the neural network 100. In some cases, a dataset is available that includes RGB images that do not have annotations that indicate the respective poses of the object in the RGB images. During an example training stage, the dataset is sampled so as to generate pairs of images of one or more objects, for instance an RGB object 108. For example, a pair of RGB images can include a first RGB or RGB depth image 110a of the object 108 and a second RGB image 110b of the object 108. As used herein, unless otherwise specified, RGB-D refers to a pair of images that includes a depth image and its corresponding RGB image that are from the same viewpoint at the same time. The depth image can be an image channel in which each pixel relates to distance between an image plane and the corresponding object in the RGB image. Each of the RGB images in a pair of RGB images can include the object 108 in the same pose. In particular, the depth image 110a can include the object 108 in a particular pose, and the second RGB image 110b can also include the object 108 in the particular pose. Further, the RGB depth image 110a of a given pair that is input to the third branch 102c can include a specific RGB image with depth indications, and the second RGB image 110b of the given pair that is input to the fourth branch 102d can be the same specific RGB image without the depth indications. Thus, in accordance with the illustrated example, the inputs to the third and fourth branches 102c and 102d during an example training instance are from different modalities (e.g., RGB and depth), but include the same pose of the same object.

[0023] A training stage may include multiple training instances, such that multiple pairs of RGB images are input into the third and fourth branches 102c and 102d of the network 100. The pairs of RGB images that are input to the third and fourth branches 102c and 102d during the training stage may be of the same object. Alternatively, at least one pair, for instance multiple pairs, of the RGB images that are input to the third and fourth branches 102c and 102d may of a different object with respect to other pairs that are input to the third and fourth branches 102c and 102d. The pairs of RGB images that are input to the third and fourth branches 102c and 102d may include the same object or different objects as included in the pairs of depth images that are input to the first and second branches 102a and 102b.

[0024] During training, the quadruplet network 100 can learn view-invariant and modality-invariant features in two different pairs of branches, and, in cases in which there are no 3D pose annotations, the learned knowledge can be transferred internally in the network 100. In particular, an abundance of rendered CAD depth pairs can be taken advantage of in the first and second branches 102a and 102b to learn view- invariant features and keypoint generation. Then, the third branch 102c can leverage the knowledge learned form the first and second branches 102a and 102b, and attempt to transfer the knowledge to the fourth branch 102d.

[0025] Still referring to FIG. 1 , each branch of the quadruplet network 100 can use a backbone convolutional neural network (CNN) to learn features or representations of the objects. In an example training instance, the first branch 102a can use a first CNN 112a to learn first local features 114a of the CAD objects (e.g., object 106) rendered in the first depth images 104a. The second branch 102b can use a second CNN 112b to learn second local features 114b of the CAD objects rendered in the second depth images 104b. By way of further example, in a training instance, the third branch 102c can use a third CNN 112c to learn local features of the RGB objects in the depth images 110a, and the fourth branch 102d can use a fourth CNN 112d to learn local features of the RGB objects in the RGB images 110b. In accordance with the illustrated example, feature weights and biases are shared between the first CNN 112a, the second CNN 112b, and the third CNN 112c. The weights and biases of the networks 112a-c are shared. In particular, in some cases, the trainable parameters for weights and biases, are shared between the first, second, and third CNN 112a, 112b, and 112c.

[0026] In particular, in an example training instance, the first CNN 112a can generate the first local features 114a of the first CAD depth image 104a, and the second CNN 112b can generate second local features 114b of the second CAD depth image 104b.

In some cases, local feature representations within a given pair are aligned with one another by applying a triplet loss on the respective local feature representations that are extracted from the first and second branches 102a and 102b. For example, the local features 114a that are extracted from the first branch 102a can be aligned with the local features 114b that extracted from the second branch 102b, so as to generate aligned local features. The aligned local features 122 can be generated by applying a triplet loss to the local features 114a and 114b.

[0027] The triplet loss method involves using the known camera poses of the rendered pairs of depth images from the CAD domain 101 and sampling of training keypoint triplets (anchor-positive-negative). Specifically, for a randomly selected keypoint as an anchor, the closest keypoint in 3D from the paired image is found and used as a positive. The keypoint in 3D that is farther way is selected as the negative. It will be understood that the positive example is actually the projection of the anchor’s 2D location to the paired depth image. The Triplet loss then optimizes the representation such that the feature distance between the anchor-positive is smaller than the feature distance between the anchor-negative plus a certain margin. Traditionally, the margin hyper-parameter is manually defined as a constant throughout the training procedure.

In some examples, however, the 3D information can be leveraged, and the margin can be defined to be the difference of the 3D distance between the anchor and negative as compared to the 3D distance between the anchor and positive. It is recognized herein that this may ensure that the learned feature distances are proportional to the 3D distances between the examples. In some examples, the triplet loss might only affect the backbone CNN during training and not the keypoint proposal network (KPN), which is now discussed.

[0028] Each branch of the quadruplet network 100 can also use a respective keypoint proposal network (KPN) to generate keypoint proposals. For example, the first branch 102a can include a first KPN 116a for the first depth images 104a, the second branch 102b can include a second KPN 116b for the second depth images 104b, the third branch 102c can include a third KPN 116c for the RGB-D images 110a, and the fourth branch 102d can include a fourth KPN 116d for the RGB image 110b. A keypoint generally refers to an important area or region of an object or image. The keypoint can refer to a particular location and a neighborhood around this location. In accordance with the illustrated example, feature weights and biases are shared between the first KPN 116a, the second KPN 116b, and the third KPN 116c. The weights and biases of the networks 116a-c are shared. In particular, in some cases, the trainable parameters for weights and biases are shared between the first, second, and third KPN 116a, 116b, and 116c.

[0029] In various examples, a keypoint proposal includes a two-dimensional (2D) location, a 2D displacement vector, a spatial extension (2D region), a score, and a feature representation (e.g., feature representations 118 and 120) extracted from its 2D region. The 2D displacement vector can be regressed independently and applied on the 2D location to get the final position of a particular keypoint. This may allow various keypoints to shift around their initial positions during training and optimize toward a respective precise keypoint localization. The local feature representation 118 of the first depth image 104a can include the first depth image 104a and indications of keypoints 124 overlayed on the first depth image 104a. Similarly, the local feature representation 120 of the second depth image 104b can include the second depth image 104b and indications of keypoints 126 overlayed on the second depth image 104b. Thus, the local feature representations can refer to the transformation of the RGB or depth images into the learned feature space (via 112a-c), which can then (e.g., separately for each candidate keypoint) be pooled with the respective region from the KPN.

[0030] In various embodiments, the KPN is trained to generate keypoints on rendered depth images that are optimal toward estimating the relative poses defined by respective images within a given pair or images. In accordance with the illustrated example, the first KPN 116a generates keypoints 124 associated with the first depth image 104a, and the second KPN 116b generates keypoints 126 associated with the second depth image 104b. Each of the generated keypoints 124 and 126 define a respective 3D location. Using the 3D locations of the keypoints 124 and 126, associations or correspondences can be established between keypoints 124 and keypoints 126, as further described below. Furthermore, embodiments are described herein that optimize the selection of a set of keypoints based on a pose estimation objective. The keypoints can be transferred from the depth images to the RGB domain 103, which includes the third and fourth branches 102c and 102d. [0031] Referring to FIGs. 1 and 3, keypoint predictions are learned, for instance by the KPN 116a and KPN 116b, so that corresponding sets of keypoints 124 and 126 can be generated that are optimal for estimating the relative pose between the object in the first depth image 104a and the object in the second depth image 104b within a pair of depth images. Thus, a relative pose of the object 106 in the first branch 102a as compared to the pose of the object 106 in the second branch 102b can be estimated. In particular, for example, based on a first set of keypoints 124 and a second set of keypoints 126, a correspondence or association between the sets of the keypoints is established in 3D space. In particular, at 130, a rotation R and translation t can be estimated based on the first set of keypoints 124 and the second set of keypoints 126. The correspondence or association between the keypoints of the different images can define the rotation R and translation t. For example, the keypoints 124 of the first depth image 104a can be projected onto the second depth image 104b when the first depth image 104a is rotated about a rotation axis according to the rotation R , and translated along a translation axis according to the translation t. Alternatively, in some cases, the rotation R and translation t can be defined such that keypoints 126 of the second depth image 104b can be projected onto the first depth image 104a when the second depth image 104b is rotated in accordance with the rotation R, and translated in accordance with the translation t. Thus, the rotation R and translation t can define how the pose of the object 106 in one of the first and second depth images 104a and 104b relates to the pose of the object 106 in the other of the first and second depth images 104a and 104b.

[0032] Referring in particular to FIG. 3, the object 106 from one of the branches 102a and 102b, in particular the keypoints from one of the branches 102a, can be rotated and translated in accordance with the estimated relative pose. After the rotation and translation, the keypoints from the one of the branches 102a and 102b can be projected onto the depth image from the other one of the branches 102a and 102b, so as to define a projection representation 128. For example, as shown in FIG. 3, the keypoints 124 of the first depth image 104a are rotated and translated in accordance with the rotation R and translation t, respectively, and projected onto the local feature

representation 120 of the second depth image 104b, so as to generate the projection representation 128. As show in FIG. 3, the projection representation 128 can include the local feature representation 120 generated by the second branch 102b and the keypoints 124 generated by the first branch 104a overlayed on the local feature representation 120, and thus the second depth image 104b. In this case, the keypoints 124 define projected keypoints that originate from a different depth image than the depth image from which they are generated, and the keypoints 126 define original keypoints that are not rotated or translated. Alternatively, the projection representation 128 can include the local feature representation 118 generated by the first branch 102a and the keypoints 126 generated by the second branch 104b overlayed on the local feature representation 118, and thus the first image 104a, such that the keypoints 126 define the projected keypoints and the keypoints define the original keypoints 124.

[0033] In some examples, the system 100 determines whether there is any misalignment between the original keypoints and the projected keypoints. If there is misalignment between corresponding keypoints, the system 100 can compute an error, for instance a re-projection error, which can be used to penalize the initial keypoint predictions, at 132. In an example, a relative pose estimation loss is formulated at 130, which correspondence separately based on their re-projection error. An objective is that the KPNs will be trained to produce the sets of keypoints that can estimate the optimal relative pose between the two depth images with a given pair of depth images. For example, referring again to the example depicted in FIG. 3, the keypoints 124 and 126 can define initial keypoint predictions. As shown, the keypoints 124 define the projected keypoints on the projection representation 128, and the keypoints 126 define the original keypoints on the projection representation 128. Thus, any misalignment (re- projection error) of the projection representation 128 between the projected keypoints 126 and the corresponding original keypoints 124 can be used to penalize the initial keypoint predictions. The penalty can be seen from the loss function for ( R , t), which is related to the difference between the re-projected matched keypoint p and its matched counterpart q. For example, in some examples, if \\ (Rpi + t) - qi\\² is zero for a particular matched pair i, then the error is zero, and the match is perfect. If the error is non-zero, the re-projection differs from the matched counterpart and is corrected

(penalized). [0034] As described above, in various examples, a relative pose of a given object in the first branch 102a as compared to the pose of the object in the second branch 102b is determined or estimated. The relative pose can define the rotation R and/or translation t. The relative pose can be determined based on the weighted orthogonal Procrustes problem. For a weighted set of corresponding points, the weighted orthogonal Procrustes problem can be solved to find a rigid transformation for optimal alignment. In particular, given a set of keypoints that correspond to another set of keypoints, the weighted orthogonal Procrustes problem can be solved so as to find the rotation R and translation t for which the re-projection error of the weighted

correspondences is minimal. For example, two sets of corresponding points can be represented as Q = {q^ q₂, ... , q_n } and T = {p^ p₂, ... , p_n}. A function for the relative pose, which can also be referred to as a loss function, can be represented as:

(R, t) = argmin

(1 )

RESO(d),tER^d

With respect to equation '\ , w_i = scoref + scoref for each i^th correspondence. Further, in accordance with an example, score^A and score⁸ are the predicted keypoint scores that belong to correspondence i from the first and second branches 102a and 102b, respectively. Given the set of correspondences and their weights, it is recognized herein that there is a closed-form solution for estimating the rotation R and the translation t, which depend on w[l]. It is further recognized herein that, in some examples, correspondences with a high re-projection error should have low weights, and thus a low predicted keypoint probability. Conversely, in some cases,

correspondences with a low re-projection error should have high weights, and thus a high predicted keypoint probability. In some examples, the correspondences are weighted as part of the optimization. The correspondences can be weighted with the sum of the predicted scores of their keypoints p, q (e.g., w_t = scoref + score ), and part of the optimization objective can be to predict the individual keypoint scores for p and q so that good correspondences are realized.

[0035] In various examples, an objective is to optimize the loss function (equation 1 ) with respect to the estimated keypoint probabilities. To do so, in an example, the derivative of the loss with respect to the weights w is calculated as:

With respect to equation

[0036] In an example, computing an estimation for the rotation R includes computing a singular value decomposition (SVD). In some cases, the gradient computation can be approximated by using the Taylor series first order approximation: /(w₀) = /(w) +

———— (w₀ - w). The terms in the approximation can rearranged, and the original derivative computation (equation 2) can be substituted such that:

= f(w) +

w(/(w₀ )-/(w)) , _w|_{iere WQ} _ _{w + r}andu(-a, a), and a is a small fraction, such as 0_.0001 for example. This formulation of

can allow the derivative to be computed for each correspondence, based on the re-projection error. Further, during training of the network 100, the keypoint specific gradient can be back-propagated. In various examples,— is the gradient, and the index i indicates a specific keypoint.

[0037] Referring again to FIG. 1 , the learned keypoint predictions can be transferred between branches of the network 100. The learned viewpoint-invariant features and keypoint proposals from the first and second branches 102a and 102b can be

transferred to the fourth branch 102d, using the third branch 102c as a bridge. To do so, the weights between the first branch 102a, the second branch 102b, and the third branch 102c are shared. Further, the outputs of the third and fourth branches 102c and 102d can be compared and penalized based on any misalignments. Thus, without being bound by theory, the backbone CNN and the KPN can be forced to generate similar outputs, for instance as similar outputs as possible.

[0038] To perform the transfer of keypoint learning from rendered depth images to the RGB domain, in some cases, objective functions can be employed that compare the outputs of the third and fourth branches 102c and 102d during a forward pass of the network 100. In particular, at 131 , a Euclidean loss can attempt to align the global features between the backbone CNNs 112c and 112d. Global features can refer to the output of the CNNs 112a-d. The global features can be vectors in a n-dim vector space. As used in this context, aligning refers to a loss that is imposed during optimization that is low if two vectors are close in Euclidean distance (e.g., \\v₀ - Vil |==0) and large if two vectors are far from each other in Euclidean distance. This can also be referred to as “L2-loss.” Then, at 133, a keypoint consistency constraint can be enforced, which may require the KPN 116d from the fourth branch 102d to produce the same output as the KPN 116c from the third branch 102c.

[0039] In an example embodiment, the keypoint consistency constraint is formulated as a cross-entropy loss, which is equivalent to a log loss with binary labels of {0,1 }:

iog(i - yd (3)

[0040] With respect to equation 3, yl is the ground-truth label and y_t is the

prediction. It some cases, it is desirable to enforce the consistency constraint, and therefore the loss can be defined as:

[0041] With respect to equation 4, y are the keypoint predictions from the third branch 102c, which serve as the ground-truth, and yf are the keypoint predictions from the fourth branch 102d. Thus, any misalignment between the keypoint predictions of the two branches (e.g., branch 102c and 102d) are penalized. Further, the fourth branch 102d may be forced to imitate the outputs of the third branch 102c.

[0042] Thus, as described above, various formulations may be used in the system 100 so as to match a given RGB image with a rendered depth match, and so as to estimate poses of an object. The learned keypoints can be used to extract local features from a set of rendered depth images (e.g., see FIGs. 4A-C). In doing so, a repository or database of local features can be created. The repository can be indexed by the 3D location associated with a local feature. The repository or database may describe the representation of a given object from multiple viewpoints using learned keypoints and modality-invariant features. During an example test time, referring to FIG. 2, given a query RGB image 105, the 2D keypoints and their features can be extracted from the given RGB image 105, at 140. The 2D keypoints can be matched to respective local features in the repository, for instance keypoints from an example CAD image 107, so as to establish respective 2D to 3D correspondences or associations, at 142. Then, the 6-degree-of-freedom object pose in the given RGB image 105 can be estimated, at 144. The pose of the object can be estimated using various techniques, such as RANSAC for example.

[0043] Thus, in accordance with various embodiments, techniques are described herein to learn how to predict a set of useful keypoints on rendered depth images of objects, and then transfer the knowledge to RGB images of objects, which can be captured by typically available consumer-grade cameras. Without being bound by theory, such a keypoint prediction process can lead to increased matching accuracy between rendered depth and RGB images, which in turn can lead to increased accuracy for tasks such as object pose estimation and spare part identification. When estimating the 3D pose of objects through feature matching, the generation of repeatable and well- placed keypoints can increase the pose accuracy and also reduce the runtime of the matching procedure. It is recognized herein that the definition of“well-placed” keypoints is ambiguous. Consequently, described herein is an optimization scheme that informs the keypoint generation for the particular task. That is, in some cases, there is an assumption that there is a desire to estimate the set of keypoints in each image which, when matched to other images, produce the optimal 3D object pose. In practice, in various examples, the optimal learned keypoint locations avoid noise and clutter, be well spread out in the image, and produce discriminative features which can be matched reliably in other images.

[0044] It is further recognized herein that learning the described keypoint generation process directly on the RGB images can be challenging, as real RGB data with 3D pose annotations, for tasks such as part identification in an assembly of parts, are often scarce. To avoid the time-consuming process of collecting and manually annotating RGB data, among other reasons, CAD models of these objects/parts are leveraged, so as to generate a large number of synthetically rendered depth images with 3D pose annotations. The images are then used to learn the desired keypoint generation. In some cases, the goal is to actually learn the keypoints on the RGB images. Therefore, in some cases, this knowledge is transferred from the CAD domain to the RGB domain. [0045] FIG. 5 illustrates an example of a computing environment within which embodiments of the present disclosure may be implemented. A computing environment 500 includes a computer system 510 that may include a communication mechanism such as a system bus 521 or other communication mechanism for communicating information within the computer system 510. The computer system 510 further includes one or more processors 520 coupled with the system bus 521 for processing the information.

[0046] The processors 520 may include one or more central processing units

(CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as described herein is a device for executing machine- readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general purpose computer. A processor may include any type of suitable processing unit including, but not limited to, a central processing unit, a microprocessor, a Reduced Instruction Set Computer (RISC) microprocessor, a

Complex Instruction Set Computer (CISC) microprocessor, a microcontroller, an

Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array

(FPGA), a System-on-a-Chip (SoC), a digital signal processor (DSP), and so forth.

Further, the processor(s) 520 may have any suitable microarchitecture design that includes any number of constituent components such as, for example, registers, multiplexers, arithmetic logic units, cache controllers for controlling read/write operations to cache memory, branch predictors, or the like. The microarchitecture design of the processor may be capable of supporting any of a variety of instruction sets. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device.

[0047] The system bus 521 may include at least one of a system bus, a memory bus, an address bus, or a message bus, and may permit exchange of information (e.g., data (including computer-executable code), signaling, etc.) between various

components of the computer system 510. The system bus 521 may include, without limitation, a memory bus or a memory controller, a peripheral bus, an accelerated graphics port, and so forth. The system bus 521 may be associated with any suitable bus architecture including, without limitation, an Industry Standard Architecture (ISA), a Micro Channel Architecture (MCA), an Enhanced ISA (EISA), a Video Electronics Standards Association (VESA) architecture, an Accelerated Graphics Port (AGP) architecture, a Peripheral Component Interconnects (PCI) architecture, a PCI-Express architecture, a Personal Computer Memory Card International Association (PCMCIA) architecture, a Universal Serial Bus (USB) architecture, and so forth.

[0048] Continuing with reference to FIG. 5, the computer system 510 may also include a system memory 530 coupled to the system bus 521 for storing information and instructions to be executed by processors 520. The system memory 530 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 531 and/or random access memory (RAM) 532. The RAM 532 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The ROM 531 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 530 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 520. A basic input/output system 533 (BIOS) containing the basic routines that help to transfer information between elements within computer system 510, such as during start-up, may be stored in the ROM 531. RAM 532 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 520. System memory 530 may additionally include, for example, operating system 534, application programs 535, and other program modules 536. Application programs 535 may also include a user portal for development of the application program, allowing input parameters to be entered and modified as necessary.

[0049] The operating system 534 may be loaded into the memory 530 and may provide an interface between other application software executing on the computer system 510 and hardware resources of the computer system 510. More specifically, the operating system 534 may include a set of computer-executable instructions for managing hardware resources of the computer system 510 and for providing common services to other application programs (e.g., managing memory allocation among various application programs). In certain example embodiments, the operating system 534 may control execution of one or more of the program modules depicted as being stored in the data storage 540. The operating system 534 may include any operating system now known or which may be developed in the future including, but not limited to, any server operating system, any mainframe operating system, or any other proprietary or non-proprietary operating system.

[0050] The computer system 510 may also include a disk/media controller 543 coupled to the system bus 521 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 541 and/or a removable media drive 542 (e.g., floppy disk drive, compact disc drive, tape drive, flash drive, and/or solid state drive). Storage devices 540 may be added to the computer system 510 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire). Storage devices 541 , 542 may be external to the computer system 510.

[0051] The computer system 510 may also include a field device interface 565 coupled to the system bus 521 to control a field device 566, such as a device used in a production line. The computer system 510 may include a user input interface or GUI 561 , which may comprise one or more input devices, such as a keyboard, touchscreen, tablet and/or a pointing device, for interacting with a computer user and providing information to the processors 520.

[0052] The computer system 510 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 520 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 530. Such instructions may be read into the system memory 530 from another computer readable medium of storage 540, such as the magnetic hard disk 541 or the removable media drive 542. The magnetic hard disk 541 and/or removable media drive 542 may contain one or more data stores and data files used by

embodiments of the present disclosure. The data store 540 may include, but are not limited to, databases (e.g., relational, object-oriented, etc.), file systems, flat files, distributed data stores in which data is stored on more than one node of a computer network, peer-to-peer network data stores, or the like. The data stores may store various types of data such as, for example, skill data, sensor data, or any other data generated in accordance with the embodiments of the disclosure. Data store contents and data files may be encrypted to improve security. The processors 520 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 530. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.

[0053] As stated above, the computer system 510 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term“computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 520 for execution. A computer readable medium may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 541 or removable media drive 542. Non-limiting examples of volatile media include dynamic memory, such as system memory 530. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 521. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. [0054] Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar

programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example,

programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0055] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer readable medium instructions.

[0056] The computing environment 500 may further include the computer system 510 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 580. The network interface 570 may enable communication, for example, with other remote devices 580 or systems and/or the storage devices 541 , 542 via the network 571. Remote computing device 580 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 510. When used in a networking environment, computer system 510 may include modem 672 for establishing communications over a network 571 , such as the Internet. Modem 672 may be connected to system bus 521 via user network interface 570, or via another appropriate mechanism.

[0057] Network 571 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of

connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 510 and other computers (e.g., remote computing device 580). The network 571 may be wired, wireless or a

combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art.

Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 571.

[0058] It should be appreciated that the program modules, applications, computer- executable instructions, code, or the like depicted in FIG. 5 as being stored in the system memory 530 are merely illustrative and not exhaustive and that processing described as being supported by any particular module may alternatively be distributed across multiple modules or performed by a different module. In addition, various program module(s), script(s), plug-in(s), Application Programming Interface(s) (API(s)), or any other suitable computer-executable code hosted locally on the computer system 510, the remote device 580, and/or hosted on other computing device(s) accessible via one or more of the network(s) 571 , may be provided to support functionality provided by the program modules, applications, or computer-executable code depicted in FIG. 5 and/or additional or alternate functionality. Further, functionality may be modularized differently such that processing described as being supported collectively by the collection of program modules depicted in FIG. 5 may be performed by a fewer or greater number of modules, or functionality described as being supported by any particular module may be supported, at least in part, by another module. In addition, program modules that support the functionality described herein may form part of one or more applications executable across any number of systems or devices in accordance with any suitable computing model such as, for example, a client-server model, a peer- to-peer model, and so forth. In addition, any of the functionality described as being supported by any of the program modules depicted in FIG. 5 may be implemented, at least partially, in hardware and/or firmware across any number of devices.

[0059] It should further be appreciated that the computer system 510 may include alternate and/or additional hardware, software, or firmware components beyond those described or depicted without departing from the scope of the disclosure. More particularly, it should be appreciated that software, firmware, or hardware components depicted as forming part of the computer system 510 are merely illustrative and that some components may not be present or additional components may be provided in various embodiments. While various illustrative program modules have been depicted and described as software modules stored in system memory 530, it should be appreciated that functionality described as being supported by the program modules may be enabled by any combination of hardware, software, and/or firmware. It should further be appreciated that each of the above-mentioned modules may, in various embodiments, represent a logical partitioning of supported functionality. This logical partitioning is depicted for ease of explanation of the functionality and may not be representative of the structure of software, hardware, and/or firmware for implementing the functionality. Accordingly, it should be appreciated that functionality described as being provided by a particular module may, in various embodiments, be provided at least in part by one or more other modules. Further, one or more depicted modules may not be present in certain embodiments, while in other embodiments, additional modules not depicted may be present and may support at least a portion of the described functionality and/or additional functionality. Moreover, while certain modules may be depicted and described as sub-modules of another module, in certain embodiments, such modules may be provided as independent modules or as sub-modules of other modules. [0060] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the

functionality and/or processing capabilities described with respect to a particular device or component may be performed by any other device or component. Further, while various illustrative implementations and architectures have been described in

accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like can be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase“based on,” or variants thereof, should be interpreted as“based at least in part on.”

[0061] Although embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the disclosure is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the embodiments. Conditional language, such as, among others,“can,”“could,”“might,” or“may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments could include, while other embodiments do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment.

[0062] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Claims

CLAIMS What is claimed is:

1. A quadruplet neural network configured to learn keypoints, the quadruplet (CNN) comprising:

a computer-aided design (CAD) domain comprising a first branch of the network and a second branch of the network, the CAD domain configured to train on pairs of CAD depth images rendered from CAD models of CAD objects, so as to learn viewpoint-invariant features of the CAD objects; and

a picture domain comprising a third branch of the network and a fourth branch of the network, the picture domain configured to train on pairs of images of objects, so as to learn modularity-invariant features of the objects.

2. The quadruplet network as recited in claim 1 , wherein:

the first branch is configured to train on first CAD depth images rendered from the CAD models of the CAD objects;

a second branch configured to train on second CAD depth images rendered from the CAD models of the CAD objects;

a third branch configured to train on third depth images; and

a fourth branch configured to train on RGB images,

wherein each first CAD depth image corresponds to a respective second CAD depth image so as define the pairs of CAD depth images, such that the first CAD depth image in a respective pair includes a particular CAD object in a first pose, and the second CAD depth image in the respective pair defines the particular CAD object in a second pose that is different than the first pose.

3. The quadruplet neural network as recited in claim 2, wherein each third depth image corresponds to a respective RGB image so as to define the pairs of images of objects, such the third depth image in a respective pair includes a particular object in a particular pose, and the RGB image in the respective pair also includes the particular object in the particular pose.

4. The quadruplet neural network as recited in claim 2, wherein:

the first branch includes a first convolutional neural network (CNN) configured to learn features of the first CAD depth images;

the second branch incudes a second CNN configured to learn features of the second CAD depth images;

the third branch includes a third CNN configured to learn features of the third depth images; and

trainable feature parameters are shared between the first CNN, the second CNN, and the third CNN.

5. The quadruplet neural network as recited in claim 4, wherein the fourth branch includes a fourth CNN configured to learn features of the RGB images, and the network is further configured to align the features learned by the third CNN with the features from the fourth CNN, thereby also aligning the features learned by the first and second CNN with the fourth CNN.

6. The quadruplet neural network as recited in claim 5, wherein:

the first branch includes a first keypoint proposal network (KPN) configured to generate and learn keypoints based on the features of the first CAD depth images generated by the first CNN;

the second branch incudes a second KPN configured to generate and learn keypoints based on the features of the second CAD depth images generated by the second CNN; the third branch includes a third KPN configured to generate and learn keypoints based on the features of the third depth images; and

trainable parameters are shared between the first KPN, the second KPN, and the third KPN.

7. The quadruplet neural network as recited in claim 6, wherein the network is further configured to estimate a relative pose of the first CAD depth images as compared to respective second CAD depth images, based on the keypoints generated by the first KPN and the keypoints generated by the second KPN.

8. The quadruplet neural network as recited in claim 6, wherein the fourth branch further includes a fourth KPN configured to perform in accordance with a keypoint consistency constraint, such that the fourth KPN generates keypoints that imitate the keypoints that third KPN generates.

9. The quadruplet neural network as recited in claim 6, wherein the network further comprises a database configured to store representations of CAD objects, the representations based on the learned keypoints.

10. The quadruplet neural network as recited in claim 9, the network further configured to extract two-dimension (2D) keypoints from a new RGB image defining an RGB object, and match the 2D keypoints from the new RGB image to a select representation of the representations of CAD objects stored in the database, so as to estimate a pose of the RGB object.

11. The quadruplet neural network as recited in claim 2, wherein the third depth images and RGB images that train the third and fourth branches, respectively, lack any pose annotations.

12. A method comprising

obtaining a pair of CAD depth images rendered from a computer-aided design (CAD) model of a first object, the pair of CAD depth images including a first CAD depth image that defines the object in a first pose, and a second CAD depth image that defines the object in a second pose that is different than the first pose;

obtaining a pair of pictures of a second object that is different than the first object, the pair of pictures including a third depth image of the second object in a third pose, and an RGB image of the second object that is also in the third pose;

training a network, using the pair of CAD depth images and the pair of depth and RGB pictures, wherein the training comprises:

inputting the first CAD depth image and the second CAD depth image into a first and second branch of the network, respectively, so as to learn viewpoint- invariant features of the first object; and

inputting the third depth image and the RGB image into a third and fourth branch of the network, respectively, so as to learn modularity-invariant features of the second object.

13. The method as recited in claim 12, wherein obtaining the pair of depth images comprises:

sampling virtual camera poses around the object so as to render the pair of depth images.

14. The method as recited in claim 12, wherein training the network further

comprises:

extracting first local features from the first branch;

extracting second local features from the second branch; and

applying a triplet loss on the first and second local features so as to align the first and second local features with each other.

15. The method as recited in claim 12, the method further comprising training the network toward a pose estimation objective, the pose estimation objective intended to minimize an error when keypoints generated by the first branch are compared to keypoints generated by the second branch after the keypoints generated by the first branch are rotated and translated in accordance with a relative pose so as to minimize a re-projection error.

16. The method as recited in claim 12, the method further comprising sharing weights between the first, second, and third branches of the network.