WO2019125453A1

WO2019125453A1 - Training a convolutional neural network using taskirrelevant data

Info

Publication number: WO2019125453A1
Application number: PCT/US2017/067766
Authority: WO
Inventors: Varun MANJUNATHA; Georgios Georgakis; Kuan-Chuan Peng; Ziyan Wu; Jan Ernst
Original assignee: Siemens Aktiengesellschaft
Priority date: 2017-12-21
Filing date: 2017-12-21
Publication date: 2019-06-27

Abstract

Examples of techniques for training a convolutional neural network are disclosed. In one example implementation according to aspects of the present disclosure, a computer-implemented method includes receiving, by a processing device, a taskirrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image. The method further includes feeding, by the processing device, the irrelevant depth image into the source CNN. The method further includes feeding, by the processing device, the irrelevant RGB image into a target CNN. The method further includes performing, by the processing device, a first Euclidean loss to encourage features of the irrelevant depth image and features of the irrelevant RGB image to be similar. The method further includes training, by the processing device, the source CNN based at least in part on the first Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image.

Description

TRAINING A CONVOLUTIONAL NEURAL NETWORK USING TASK-

IRRELEVANT DATA

BACKGROUND

[0001] The present invention generally relates to machine learning systems, and more specifically, to training a convolutional neural network using task-irrelevant data.

[0002] Information that is useful to solve practical tasks often exists in different domains, in which the information is captured by various sensors. As used herein, a “domain” can refer to either a modality or a dataset. For example, in one scenario, a 3-D layout of a room can be captured by a depth sensor or inferred from RGB (red-blue- green) images. In real-world scenarios, however, access to data from certain domains(s) is often limited.

SUMMARY

[0003] Embodiments of the present invention are directed to a computer-implemented method for training a convolutional neural network. A non-limiting example of the computer-implemented method includes receiving, by a processing device, a task- irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image. The method further includes feeding, by the processing device, the irrelevant depth image into the source CNN. The method further includes feeding, by the processing device, the irrelevant RGB image into a target CNN. The method further includes performing, by the processing device, a first Euclidean loss to encourage features of the irrelevant depth image and features of the irrelevant RGB image to be similar. The method further includes training, by the processing device, the source CNN based at least in part on the first Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image. [0004] Embodiments of the present invention are directed to a system. A non-limiting example of the system includes a memory comprising computer readable instructions and a processing device for executing the computer readable instructions for performing a method for training a convolutional neural network. A non-limiting example of the method includes receiving, by the processing device, a task-irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image. The method further includes feeding, by the processing device, the irrelevant depth image into the source CNN. The method further includes feeding, by the processing device, the irrelevant RGB image into a target CNN. The method further includes performing, by the processing device, a first Euclidean loss to encourage features of the irrelevant depth image and features of the irrelevant RGB image to be similar. The method further includes training, by the processing device, the source CNN based at least in part on the first Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image.

[0005] Embodiments of the invention are directed to a computer program product. A non-limiting example of the computer program product includes a computer readable storage medium having program instructions embodied therewith. The program instructions are executable by a processing device to cause the processor to perform a method for training a convolutional neural network. A non-limiting example of the method includes receiving, by the processing device, a task-irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image. The method further includes feeding, by the processing device, the irrelevant depth image into the source CNN. The method further includes feeding, by the processing device, the irrelevant RGB image into a target CNN. The method further includes performing, by the processing device, a first Euclidean loss to encourage features of the irrelevant depth image and features of the irrelevant RGB image to be similar. The method further includes training, by the processing device, the source CNN based at least in part on the first Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image. [0006] Additional technical features and benefits are realized through the techniques of the present invention. Embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed subject matter. For a better understanding, refer to the detailed description and to the drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

[0007] The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:

[0008] FIG. 1 depicts a block diagram of a processing system for implementing the techniques described herein according to aspects of the present disclosure;

[0009] FIG. 2 depicts a processing system for training a convolutional neural network using task-irrelevant data, according to aspects of the present disclosure;

[0010] FIG. 3 depicts a base image, a near image, and a distant image used in training a convolutional neural network, according to aspects of the present disclosure;

[0011] FIG. 4A depicts a block diagram of a source convolutional neural network to be trained, according to aspects of the present disclosure;

[0012] FIG. 4B depicts a block diagram of a technique to extract noise-agnostic representations for a source convolutional neural network, according to aspects of the present disclosure;

[0013] FIG. 4C depicts a block diagram of a task-irrelevant training technique for a source convolutional neural network, according to aspects of the present disclosure; [0014] FIG. 4D depicts a block diagram of training a source convolutional neural network, according to aspects of the present disclosure; and

[0015] FIG. 5 depicts a flow diagram of a method for training a convolutional neural network using task-irrelevant data, according to aspects of the present disclosure.

[0016] The diagrams depicted herein are illustrative. There can be many variations to the diagram or the operations described therein without departing from the spirit of the invention. For instance, the actions can be performed in a differing order or actions can be added, deleted or modified. Also, the term“coupled” and variations thereof describes having a communications path between two elements and does not imply a direct connection between the elements with no intervening elements/connections between them. All of these variations are considered a part of the specification.

[0017] In the accompanying figures and following detailed description of the disclosed embodiments, the various elements illustrated in the figures are provided with two or three digit reference numbers. With minor exceptions, the leftmost digit(s) of each reference number correspond to the figure in which its element is first illustrated.

DETAILED DESCRIPTION

[0018] Various embodiments of the invention are described herein with reference to the related drawings. Alternative embodiments of the invention can be devised without departing from the scope of this invention. Various connections and positional relationships (e.g., over, below, adjacent, etc.) are set forth between elements in the following description and in the drawings. These connections and/or positional relationships, unless specified otherwise, can be direct or indirect, and the present invention is not intended to be limiting in this respect. Accordingly, a coupling of entities can refer to either a direct or an indirect coupling, and a positional relationship between entities can be a direct or indirect positional relationship. Moreover, the various tasks and process steps described herein can be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein. [0019] The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms“comprises,”“comprising,” “includes,”“including,”“has,”“having,”“contains” or“containing,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, a process, a method, an article, or an apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.

[0020] Additionally, the term“exemplary” is used herein to mean“serving as an example, instance or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms“at least one” and“one or more” may be understood to include any integer number greater than or equal to one, i.e. one, two, three, four, etc. The terms“a plurality” may be understood to include any integer number greater than or equal to two, i.e. two, three, four, five, etc. The term“connection” may include both an indirect“connection” and a direct“connection.”

[0021] The terms“about,”“substantially,”“approximately,” and variations thereof, are intended to include the degree of error associated with measurement of the particular quantity based upon the equipment available at the time of filing the application. For example,“about” can include a range of ± 8% or 5%, or 2% of a given value.

[0022] For the sake of brevity, conventional techniques related to making and using aspects of the invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details. [0023] It is understood in advance that the present disclosure is capable of being implemented in conjunction with any other type of computing environment now known or later developed. For example, FIG. 1 illustrates a block diagram of a processing system 100 for implementing the techniques described herein. In examples, processing system 100 has one or more central processing units (processors) 121 a, 121 b, 121 c, etc. (collectively or generically referred to as processor(s) 121 and/or as processing device(s)). In aspects of the present disclosure, each processor 121 can include a reduced instruction set computer (RISC) microprocessor. Processors 121 are coupled to system memory (e.g., random access memory (RAM) 124) and various other components via a system bus 133. Read only memory (ROM) 122 is coupled to system bus 133 and may include a basic input/output system (BIOS), which controls certain basic functions of processing system 100.

[0024] Further illustrated are an input/output (I/O) adapter 127 and a communications adapter 126 coupled to system bus 133. I/O adapter 127 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 123 and/or a tape storage drive 125 or any other similar component. I/O adapter 127, hard disk 123, and tape storage device 125 are collectively referred to herein as mass storage 134. Operating system 140 for execution on processing system 100 may be stored in mass storage 134. A network adapter 126 interconnects system bus 133 with an outside network 136 enabling processing system 100 to communicate with other such systems.

[0025] A display (e.g., a display monitor) 135 is connected to system bus 133 by display adaptor 132, which may include a graphics adapter to improve the performance of graphics intensive applications and a video controller. In one aspect of the present disclosure, adapters 126, 127, and/or 232 may be connected to one or more I/O busses that are connected to system bus 133 via an intermediate bus bridge (not shown). Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the

Peripheral Component Interconnect (PCI). Additional input/output devices are shown as connected to system bus 133 via user interface adapter 128 and display adapter 132. A keyboard 129, mouse 130, and speaker 131 may be interconnected to system bus 133 via user interface adapter 128, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit.

[0026] In some aspects of the present disclosure, processing system 100 includes a graphics processing unit 137. Graphics processing unit 137 is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. In general, graphics processing unit 137 is very efficient at manipulating computer graphics and image processing, and has a highly parallel structure that makes it more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.

[0027] Thus, as configured herein, processing system 100 includes processing capability in the form of processors 121, storage capability including system memory (e.g., RAM 124), and mass storage 134, input means such as keyboard 129 and mouse 30, and output capability including speaker 131 and display 135. In some aspects of the present disclosure, a portion of system memory (e.g., RAM 124) and mass storage 134 collectively store an operating system to coordinate the functions of the various components shown in processing system 100.

[0028] Turning now to an overview of technologies that are more specifically relevant to aspects of the invention, the present disclosure describes techniques for training a convolutional neural network using task-irrelevant data. According to the techniques disclosed herein, domain adaptation can be performed using deep neural networks. In computer vision tasks, there often arises a situation in which data available at the testing time is unavailable during the training time. For example, it may be desired to recognize parts of a machine during a testing time using both the RGB and depth representations but commonly only a rendered depth representation of the part is available during the training time, while the RGB representation is not. Synthesizing RGB images for their usage during the training time is very challenging. [0029] Existing domain adaptation approaches project features from the source domain to the target domain (or project both domains to a common domain). However, none of the existing approaches mimic feature outputs of a target domain without task-relevant target domain data at training time.

[0030] The present techniques address this problem by using synthetic RGB features instead of the actual RGB images of the part themselves. This is done by training a neural network that receives depth images during the training time and outputs features that a hypothetical neural network would have produced when trained on RGB images of the very same scene. The present disclosure describes techniques for learning synthetic RGB features using deep neural network-based domain adaptation. In particular, the present techniques are particularly trained to be robust to small changes in viewpoint and noise.

[0031] Aspects of the present disclosure can utilize machine learning functionality to accomplish the various operations described herein. More specifically, the present techniques can incorporate and utilize rule- based decision making and AI reasoning to accomplish the various operations described herein. The phrase“machine learning” broadly describes a function of electronic systems that learn from data. A machine learning system, engine, or module can include a trainable machine learning algorithm that can be trained, such as in an external cloud environment, to learn functional relationships between inputs and outputs that are currently unknown, and the resulting model can be used to determine surrogate computer program usage. In one or more embodiments, machine learning functionality can be implemented using an artificial neural network (ANN) having the capability to be trained to perform a currently unknown function. In machine learning and cognitive science, ANNs are a family of statistical learning models inspired by the biological neural networks of animals, and in particular the brain. ANNs can be used to estimate or approximate systems and functions that depend on a large number of inputs. Convolutional neural networks (CNN) are a class of deep, feed-forward ANN that are particularly useful at analyzing visual imagery. [0032] ANNs can be embodied as so-called“neuromorphic” systems of interconnected processor elements that act as simulated“neurons” and exchange“messages” between each other in the form of electronic signals. Similar to the so-called“plasticity” of synaptic neurotransmitter connections that carry messages between biological neurons, the connections in ANNs that carry electronic messages between simulated neurons are provided with numeric weights that correspond to the strength or weakness of a given connection. The weights can be adjusted and tuned based on experience, making ANNs adaptive to inputs and capable of learning. For example, an ANN for handwriting recognition is defined by a set of input neurons that can be activated by the pixels of an input image. After being weighted and transformed by a function determined by the network's designer, the activation of these input neurons are then passed to other downstream neurons, which are often referred to as“hidden” neurons. This process is repeated until an output neuron is activated. The activated output neuron determines which character was read.

[0033] As used herein, a source modality refers to the modality that abstract features are learned and to be transferred from. A target modality refers to the modality that the abstract features are to be transferred to. Task-relevant data is the data directly applicable and related to an end objective. For example, if the task is classifying images of cats and dogs, any image containing either a cat or a dog is considered as task-relevant data. Task- irrelevant data is the data not directly applicable and has no direct relation to the end objective. For example, if the task is classifying images of cats and dogs, any image not containing either a cat or a dog is considered as task-irrelevant data. A source CNN is a CNN that takes source modality images as input, and a target CNN is a CNN that takes target modality images as input.

[0034] The target CNN is trained on a large dataset from the target modality (e.g., ImageNet), and the source CNN is able to mimic the feature outputs of the target CNN using source modality data only. One way to accomplish this is by minimizing Euclidean loss between source CNN features (i.e., input with the source data) and target CNN features (i.e., input with the target data). The present techniques increase robustness with respect to variance in pose and noise in the images. Furthermore, since the depth images are themselves rendered from CAD models, there is a subtle domain shift between depth data captured by a depth sensor (e.g., MICROSOFT KINECT) and depth data that is rendered from the 3D CAD models. To alleviate these problems, a unified architecture is described that makes the network not only produce target modality features but is also robust to noise and adapts to rendered depth data.

[0035] Example embodiments of the disclosure include or yield various technical features, technical effects, and/or improvements to technology. Example embodiments of the disclosure provide for training a source convolutional neural network using a task- irrelevant image pair that includes an irrelevant depth image and an irrelevant RGB image. The irrelevant depth image is fed into the source CNN and the irrelevant RGB image is fed into a target CNN. A Euclidean loss is then performed on the features of the task-irrelevant image pair to train the source CNN such that the depth features will be close to the RGB features after training. These aspects of the disclosure constitute technical features that yield the technical effect of training the source CNN to mimic target modality features from source modality data. As a result of these technical features and technical effects, a CNN in accordance with example embodiments of the disclosure represents can be trained without task-relevant RGB data. It should be appreciated that the above examples of technical features, technical effects, and improvements to technology of example embodiments of the disclosure are merely illustrative and not exhaustive.

[0036] FIG. 2 depicts a processing system 200 for training a convolutional neural network using task-irrelevant data, according to aspects of the present disclosure. The processing system includes a processing device 202, a memory 204, a CNN engine 210, a triplet loss engine 212, and a Euclidean loss engine 214.

[0037] The various components, modules, engines, etc. described regarding FIG. 2 can be implemented as instructions stored on a computer-readable storage medium, as hardware modules, as special-purpose hardware (e.g., application specific hardware, application specific integrated circuits (ASICs), application specific special processors (ASSPs), field programmable gate arrays (FPGAs), as embedded controllers, hardwired circuitry, etc.), or as some combination or combinations of these. According to aspects of the present disclosure, the engine(s) described herein can be a combination of hardware and programming. The programming can be processor executable instructions stored on a tangible memory, and the hardware can include the processing device 202 for executing those instructions. Thus a system memory (e.g., the memory 204) can store program instructions that when executed by the processing device 202 implement the engines described herein. Other engines can also be utilized to include other features and functionality described in other examples herein.

[0038] The CNN engine 210 trains a CNN using task-irrelevant data. The processing system 200 receives images 220, including task-relevant images and task-irrelevant images. The task-relevant images can include, for example, base images, near images, distant images, noisy images, and/or noiseless images while the task- irrelevant images can include, for example, irrelevant depth images and/or irrelevant RGB images. The CNN engine 210 translates an image map to a feature vector. To train the CNN, the CNN engine 210 uses triplet loss and Euclidean loss techniques to process the images 220.

[0039] In particular, the triplet loss engine 212 ensures that the Euclidean distance between features from two similar images is lower by some margin than the Euclidean distance between features from dissimilar images. As described herein with reference to FIG. 4A, the triplet loss can be performed on a base image, a near image, and a distant image.

[0040] The Euclidean loss engine 214 minimizes the Euclidean distance between features of images. For example, as described herein with reference to FIG. 4B, the Euclidean loss engine 214 minimizes the Euclidean distance between the features of a noisy image and the features of a noiseless image to make the CNN robust to noise. In another example, as described herein with reference to FIG. 4C, the Euclidean loss engine 214 minimizes the Euclidean distance between the features of an irrelevant depth image and the features of an irrelevant RGB image.

[0041] Accordingly, the processing system 200 provides for mimicking target modality features for six degrees of freedom pose estimation from source modality data (i.e., task- irrelevant data/images) given source-target image pairs. The present techniques learn target modality features for six degrees of freedom pose estimation using source modality data that is invariant to slight variations in pose of the source modality data. The present techniques also learn target modality features for six degrees of freedom pose estimation based on source modality data that is robust to noise in the source modality data. In applications such as domain fusion for six degrees of freedom pose estimation in which one of the modalities is missing, the present techniques can index and retrieve images based on the other modality which is available.

[0042] FIG. 3 depicts a base image 301, a near image 302, and a distant image 303 used in training a convolutional neural network, according to aspects of the present disclosure. For a given CAD model of a part, it is possible to render two images (i.e., the base image 301 and the near image 302) of the part from two slightly different viewpoints. These two rendered images (taken from the two slightly different viewpoints) are similar not only in the human visual space but also in the feature space of the source CNN. These two rendered images (i.e., the base image 301 and the near image 302) are different, both in visual and in the feature space, to a third rendered image (e.g., the distant image 303) taken from a different camera viewpoint than the viewpoint of the base image 301 and the near image 302. Whereas the base image 301 and the near image 302 are taken from slightly different viewpoints (e.g., two locations near in proximity), the distant image 303 is taken from an altogether different viewpoint (e.g., a location not near in proximity as compared to the two locations from which the base image 301 and the near image 302 are captured).

[0043] The base image 301, the near image 302, and the distant image 303 are fed into a source CNN as illustrated in FIG. 4A. In particular, FIG. 4A depicts a block diagram of a source CNN 400 to be trained, according to aspects of the present disclosure. In this example, the source CNN 400 uses a triplet loss 410 to encourage similar feature outputs between the base image 301 and the near image 302 and dissimilar feature outputs between the base image 301 and the distant image 303.

[0044] More particularly, in the example of FIG. 4A, two images from very similar viewpoints and one image from a different viewpoint are passed into the source CNN 400 and a triplet loss is applied. The triplet loss ensures that the Euclidean distance between features from the two similar images (i.e., the base image 301 and the near image 302) is lower by some margin than the Euclidean distance between features from the two dissimilar images (i.e., the base image 301 and the distant image 303). Each of the three images (i.e., the base image 301, the near image 302, and the distant image 303) are passed into three copies of the same source CNN (i.e., the source CNN 400) and a weight sharing mechanism applies the same parameters to the source CNN 400.

[0045] FIG. 4B depicts a block diagram of a technique to extract noise-agnostic representations for the source CNN 400, according to aspects of the present disclosure. The source CNN 400 can be susceptible to noise, which reduces the accuracy of the source CNN 400. In order to make the source CNN 400 robust to noise, two images of a rendered depth scene are considered. One image (i.e., the noisy image 404) is corrupted by noise, while the other image (i.e., the noiseless image 405) is free from noise. Each of these images is passed into the source CNN 400 and a Euclidean loss 412 is performed. The Euclidean loss 412 minimizes the Euclidean distance between the corresponding feature representations. This encourages the source CNN 400 to provide the same output representation whether or not an image has noise. Each of the two images (i.e., the noisy image 404 and the noiseless image 405) are passed into two copies of the same source CNN (i.e., the source CNN 400) and a weight sharing mechanism applies the same parameters to the source CNN 400.

[0046] FIG. 4C depicts a block diagram of a task-irrelevant training technique for the source CNN 400, according to aspects of the present disclosure. To encourage the source CNN 400 to mimic the features of the target CNN, an irrelevant depth image 406 and an irrelevant RGB image 407 are fed into the source CNN 400 and the target CNN 420 respectively. As used herein, the term“irrelevant” in terms of the depth image 406 and the RGB image 407 indicates that the images are unrelated to a task of interest. For example, the task of interest may relate to a particular part rendered in a CAD model. Accordingly, the irrelevant depth image 406 and the irrelevant RGB image 407 relate to something other than the task of interest (e.g., a scene of an office, a different part, etc.). The irrelevant images 406, 407 can be referred to collectively as a task-irrelevant image pair.

[0047] Each of the images of previous examples (e.g., the images 301-303, 404-405) passed into the source CNN 400 are rendered depth images. Similarly, the irrelevant depth image 406 is a rendered depth image that represents the depth domain (e.g., for the source CNN 400). However, the irrelevant RGB image 407 is an RGB domain image that represents the RGB domain for the target CNN 420.

[0048] The irrelevant depth image 406 and the irrelevant RGB image 407 are RGB depth pairs of task- irrelevant images. That is, the irrelevant depth image 406 is a depth domain image of an object, scene, etc., and the irrelevant RGB image 407 is an RGB domain image of the same object, scene, etc. Euclidean loss 414 is performed on the features of the two images 406, 407 to minimize the Euclidean distance between the features of the two images 406, 407. This enables the source CNN 400 to be trained based on mapping or correlation between the depth and RGB domains from the images 406, 407 respectively. This enables the source CNN 400 to be trained to mimic the target domain when task-relevant RGB data is unavailable. For example, since an RGB equivalent image of a rendered CAD scene is not available, the task-irrelevant data (e.g., the RGB depth pairs of the images 406, 407) are used to mimic the missing task-relevant data. The task-irrelevant data image pairs 406, 407 can be of any object, scene, etc. and can be obtained through publically available datasets. [0049] Unlike in the examples of FIG. 4A and 4B, each of the two images (i.e., the irrelevant depth image 406 and the irrelevant RGB image 407) are passed respectively into the source CNN 400 and the target CNN 420. Different modalities apply different parameters to the source CNN 400 and the target CNN 420.

[0050] As shown in FIG. 4D, the techniques of FIGS. 4A, 4B, and 4C can be combined to train the source CNN 400. In the example of FIG. 4D, the weights of the target CNN 420 are fixed during training, while the weights of the source CNN 400 are shared. Because the source CNN 400 consumes both rendered data and real depth data, its weights are adjusted to handle both types of data (rendered and depth). The resulting network can take in rendered CAD depth images and produce RGB features that the target CNN 420 would have produced if it has been fed with RGB images.

[0051] The techniques described herein can be applied in sensor fusion. One such example scenario in which it is desired to retrieve a part from a database is as follows. At the time of querying the database to retrieve the part, a sensor is used, which provides both RGB and depth images of the part. However, at the time that the database was constructed, the RGB image is not available— only the depth image rendered from a CAD model of the part is available. Thus, only a depth feature from a depth CNN is available. Accordingly, the present techniques are applied while creating the database. The rendered depth image is fed into the trained source CNN (e.g., the trained source CNN 400), which mimics the RGB representation of a hypothetical RGB image of the same object. Both representations of the part are then indexed in the database.

[0052] This provides redundancy to improve retrieval of part information from the database. Furthermore, at the time the database is queried to retrieve the part, if only one of the two representations is available (e.g., only RGB and not depth, or vice versa), the present techniques can still retrieve the part based on the available representation.

[0053] FIG. 5 depicts a flow diagram of a method 500 for training a source convolutional neural network using task-irrelevant data, according to aspects of the present disclosure. The method 500 can be performed by any suitable processing system (e.g., the processing system 100 or the processing system 200), by any suitable processing device (e.g., the processor 121, the processing device 202), or by any suitable combinations thereof.

[0054] At block 502, the CNN engine 210 receives a task-irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image. At block 504, the CNN engine 210 feeds the irrelevant depth image into the source CNN (e.g., the source CNN 400). At block 506, the CNN engine 210 feeds the irrelevant RGB image into a target CNN (e.g., the target CNN 420). In examples, the source CNN is in a depth domain, and wherein the target CNN is in an RGB domain.

[0055] At block 508, the Euclidean loss engine 214 performs a Euclidean loss to enforce the similarity of the two input features between the irrelevant depth image and the irrelevant RGB image. Performing the first Euclidean loss can include minimizing a Euclidean distance between the features of the irrelevant depth image and the features of the irrelevant RGB image. At block 510, the CNN engine 210 trains the source CNN based at least in part on the Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image.

[0056] In examples, the source CNN can also be trained to be robust to noisy input providing for feature similarity. In a feature similarity example, encouraging similar feature outputs between the base and near images and dissimilar feature outputs between the base and distant images can be accomplished. In such cases, the method 500 includes receiving, by the processing device, a base image, a near image, and a distant image. The method 500 further includes performing, by the processing device, a triplet loss to determine whether a Euclidean distance between features of the base image and the near image is lower than the Euclidean distance between features of the base image and the distant image. The source CNN can then be trained based at least in part on the triplet loss. [0057] In an example of increasing the robustness against noisy input, the method 500 includes receiving, by the processing device, a noisy image and a noiseless image. The method 500 then includes performing, by the processing device, a second Euclidean loss to minimize a Euclidean distance between a feature representation of the noisy image and a feature representation of the noiseless image. The source CNN can then be trained based at least in part on the second Euclidean loss.

[0058] Additional processes also may be included, and it should be understood that the processes depicted in FIG. 5 represent illustrations, and that other processes may be added or existing processes may be removed, modified, or rearranged without departing from the scope and spirit of the present disclosure.

[0059] The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions. [0060] The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments described herein.

Claims

CLAIMS What is claimed is:

1. A computer-implemented method for training a source convolutional neural network (CNN), the method comprising: receiving, by a processing device, a task-irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image; feeding, by the processing device, the irrelevant depth image into the source

CNN; feeding, by the processing device, the irrelevant RGB image into a target CNN; performing, by the processing device, a first Euclidean loss to encourage features of the irrelevant depth image and features of the irrelevant RGB image to be similar; and training, by the processing device, the source CNN based at least in part on the first Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image.

2. The computer-implemented method of claim 1, wherein the source CNN is in a depth domain, and wherein the target CNN is in an RGB domain.

3. The computer-implemented method of claim 1, wherein performing the first Euclidean loss comprises minimizing a Euclidean distance between the features of the irrelevant depth image and the features of the irrelevant RGB image.

4. The computer-implemented method of claim 1, further comprising: receiving, by the processing device, a base image, a near image, and a distant image; and performing, by the processing device, a triplet loss to determine whether a Euclidean distance between features of the base image and the near image is lower than the Euclidean distance between features of the base image and the distant image.

5. The computer-implemented method of claim 4, wherein training the source CNN is based at least in part on the triplet loss.

6. The computer-implemented method of claim 1, further comprising: receiving, by the processing device, a noisy image and a noiseless image; and performing, by the processing device, a second Euclidean loss to minimize a Euclidean distance between a feature representation of the noisy image and a feature representation of the noiseless image.

7. The computer-implemented method of claim 6, wherein training the source CNN is based at least in part on the second Euclidean loss.

8. A system comprising: a memory comprising computer readable instructions; and a processing device for executing the computer readable instructions for performing a method for training a source convolutional neural network (CNN), the method comprising: receiving, by the processing device, a task-irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image; feeding, by the processing device, the irrelevant depth image into the source CNN; feeding, by the processing device, the irrelevant RGB image into a target

CNN; performing, by the processing device, a first Euclidean loss to encourage features of the irrelevant depth image and features of the irrelevant RGB image to be similar; and training, by the processing device, the source CNN based at least in part on the first Euclidean loss between the features of the irrelevant depth image and the features of the irrelevant RGB image.

9. The system of claim 8, wherein the source CNN is in a depth domain, and wherein the target CNN is in an RGB domain.

10. The system of claim 8, wherein performing the first Euclidean loss comprises minimizing a Euclidean distance between the features of the irrelevant depth image and the features of the irrelevant RGB image.

11. The system of claim 8, wherein the method further comprises: receiving, by the processing device, a base image, a near image, and a distant image; and performing, by the processing device, a triplet loss to determine whether a Euclidean distance between features of the base image and the near image is lower than the Euclidean distance between features of the base image and the distant image.

12. The system of claim 11 , wherein training the source CNN is based at least in part on the triplet loss.

13. The system of claim 8, wherein the method further comprises: receiving, by the processing device, a noisy image and a noiseless image; and performing, by the processing device, a second Euclidean loss to minimize a Euclidean distance between a feature representation of the noisy image and a feature representation of the noiseless image.

14. The system of claim 13, wherein training the source CNN is based at least in part on the second Euclidean loss.

15. A computer program product comprising: a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method for training a source convolutional neural network (CNN), the method comprising: receiving, by the processing device, a task-irrelevant image pair comprising an irrelevant depth image and an irrelevant RGB image; feeding, by the processing device, the irrelevant depth image into the source CNN; feeding, by the processing device, the irrelevant RGB image into a target

16. The computer program product of claim 15, wherein the source CNN is in a depth domain, and wherein the target CNN is in an RGB domain.

17. The computer program product of claim 15, wherein performing the first Euclidean loss comprises minimizing a Euclidean distance between the features of the irrelevant depth image and the features of the irrelevant RGB image.

18. The computer program product of claim 15, wherein the method further comprises: receiving, by the processing device, a base image, a near image, and a distant image; and performing, by the processing device, a triplet loss to determine whether a Euclidean distance between features of the base image and the near image is lower than the Euclidean distance between features of the base image and the distant image.

19. The computer program product of claim 18, wherein training the source CNN is based at least in part on the triplet loss.

20. The computer program product of claim 15, wherein the method further comprises: receiving, by the processing device, a noisy image and a noiseless image; and performing, by the processing device, a second Euclidean loss to minimize a Euclidean distance between a feature representation of the noisy image and a feature representation of the noiseless image.