WO2019240900A1

WO2019240900A1 - Attention loss based deep neural network training

Info

Publication number: WO2019240900A1
Application number: PCT/US2019/031507
Authority: WO
Inventors: Rajib MONDAL; Rajat Vikram SINGH; Kuan-Chuan Peng; Ziyan Wu; Jan Ernst
Original assignee: Siemens Aktiengesellschaft; Siemens Corporation
Priority date: 2018-06-12
Filing date: 2019-05-09
Publication date: 2019-12-19

Abstract

Examples of techniques for attention loss based deep neural network (DNN) training are described herein. An aspect includes receiving an input image. Another aspect includes receiving a pixel level label corresponding to the input image. Another aspect includes determining by a DNN a probable class of the input image. Another aspect includes determining an attention map of the DNN corresponding to the probable class of the input image. Another aspect includes determining an attention loss of the DNN based on the attention map and the pixel level label. Yet another aspect includes updating weights of the DNN based on the attention loss.

Description

ATTENTION LOSS BASED DEEP NEURAL NETWORK TRAINING

BACKGROUND

[0001] The present techniques relate to neural networks. More specifically, the techniques relate to attention loss based deep neural network (DNN) training.

[0002] A neural network may include a plurality of processing elements arranged in layers. Interconnections are made between successive layers in the neural network. A neural network may have an input layer, an output layer, and any appropriate number of intermediate layers. The intermediate layers may allow solution of nonlinear problems by the neural network. A layer of a neural network may generate an output signal which may be determined based on a weighted sum of any input signals the layer receives. The input signals to a layer of a neural network may be provided from the neural network input, or from the output of any other layer of the neural network.

SUMMARY

[0003] According to an embodiment described herein, a system can include a processor to receive an input image. The processor can also receive a pixel level label corresponding to the input image. The processor can also determine, by a DNN, a probable class of the input image. The processor can also determine an attention map of the DNN corresponding to the probable class of the input image. The processor can also determine an attention loss of the DNN based on the attention map and the pixel level label. The processor can also update weights of the DNN based on the attention loss.

[0004] According to another embodiment described herein, a method can include receiving, by a processor, an input image. The method can also include receiving, by the processor, a pixel level label corresponding to the input image. The method can also include determining by a DNN a probable class of the input image. The method can also include determining an attention map of the DNN corresponding to the probable class of the input image. The method can also include determining an attention loss of the DNN based on the attention map and the pixel level label. The method can also include updating weights of the DNN based on the attention loss.

[0005] According to another embodiment described herein, computer program product may include a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method including receiving, an input image. The method can also include receiving a pixel level label corresponding to the input image. The method can also include determining by a DNN a probable class of the input image. The method can also include determining an attention map of the DNN corresponding to the probable class of the input image. The method can also include determining an attention loss of the DNN based on the attention map and the pixel level label. The method can also include updating weights of the DNN based on the attention loss.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] Fig. 1 is a block diagram of an example computer system for use in conjunction with attention loss based deep neural network training;

[0007] Fig. 2 is a block diagram of an example system for attention loss based deep neural network training;

[0008] Fig. 3 is a process flow diagram of an example method for attention loss based deep neural network training;

[0009] Figs. 4A-B illustrate example input images and associated multiple bounding boxes for embodiments of attention loss based deep neural network training;

[0010] Fig. 5 illustrates an example mask and inverse mask for embodiments of attention loss based deep neural network training; and [0011] Fig. 6 illustrates an example attention map and scaled attention map for embodiments of attention loss based deep neural network training.

DETAILED DESCRIPTION

[0012] Embodiments of attention loss based deep neural network (DNN) training are provided, with exemplary embodiments being discussed below in detail. The weights of a DNN, which are used to determine a weighted sum, may be determined based on a training process. A DNN may be trained by feeding the DNN a succession of known input patterns (including but not limited to images) and comparing the output of the DNN to a corresponding expected output pattern (e.g., a class of an object depicted in an input image). The DNN may learn by measuring a difference between the expected output pattern and the output pattern that was produced by the current state of the DNN for the corresponding input pattern. The weights of DNN may be adjusted based on the measured difference. DNN training may be iterative process, requiring a relatively large amount of input patterns to be sequentially fed into the DNN. When the weights of a DNN are set to appropriate levels by the training, an input pattern at the input layer of the DNN may successively propagate through the intermediate layers of the DNN, to give a correct corresponding output pattern for the input pattern.

[0013] A DNN may provide visual recognition and classification of objects in images that have been captured by sensors (e.g., a camera), including but not limited to red- green-blue (RGB) images. Such a DNN may be trained to classify objects in input images into a predetermined number of classes. In some embodiments, a DNN may be trained based on two loss determinations: a classification loss that compares an actual class of an input image with the DNN class prediction for the input image; and an attention loss that is determined based on an inverse mask of a bounding box annotation of the input image, and an attention map of the DNN for the input image. The weights of the DNN are updated via successive training iterations based on the classification loss and the attention loss via backpropagation. [0014] In, for example, an industrial process, a DNN may classify objects, such as parts for a manufacturing process, that may be relatively visually similar. A bounding box on an image, which may encompass the object that is being classified within an input image, may be used to guide the attention of a DNN during training for the classification task. However, for widely distributed features on an image, a single bounding box covering the entire object may include background noise along with any distinguishing features of the object. Multiple smaller bounding boxes, each corresponding to one or more relevant features of the image, may be used to guide the attention of a DNN during training for image classification. Each of the multiple bounding box annotations may closely correspond to a relatively small area containing a relevant feature on the image in order to guide the attention of the DNN. Provision of multiple bounding boxes allows guiding of the attention of the DNN to multiple relevant features while ignoring background noise, which may improve classification performance of the DNN, especially for an object having relevant features that are relatively far apart in the image.

[0015] Multiple bounding boxes may be annotated on training input images by, for example, an expert regarding the objects that are being classified by the DNN. The multiple bounding boxes may be included in a pixel level label corresponding to the input image, and may be pixel level annotations on the image, e.g., an annotation

corresponding to each pixel indicating whether the pixel is inside or outside of a bounding box. The pixel level label including multiple bounding boxes may be compared to an attention map of the DNN. The attention map highlights any areas of the input image that were used by the DNN to determine the class of the input image; e.g., any portions of the input image that support the class determination by the DNN. The attention map may be generated by any appropriate attention map generator, including but not limited to Gradient-weighted Class Activation Mapping (Grad-CAM) or Grad- CAM++. A difference between the pixel level label including the multiple bounding box annotations and the attention map may give the attention loss of the DNN. [0016] In some embodiments, during training of a DNN, an input image may be fed to the DNN, and probable class of the input image may be determined by the DNN. The determined probable class and the actual class (e.g., an image level label) of the image are compared to determine the classification loss. In some embodiments, a signal is set wherein the actual class value is 1, and any other class values are 0. A gradient may be backpropagated to the DNN, and global average pooling may be performed on the gradient to determine a weight vector. The weight vector may be used to determine a weighted sum of the feature maps that are output by the DNN for the input image to determine an attention map of the DNN. An inverse mask of the pixel level label of the input image may be determined, and the attention map may be elementwise scaled with the inverse mask. The scaled attention map may therefore omit any portions of the attention map that are covered by the inverse mask, i.e., that are inside a bounding box, such that a penalty is only incurred for attention to pixels that are outside of a bounding box. The attention loss may be then calculated by summing up values in the scaled attention map. The classification loss and the attention loss may be used to train the DNN to focus on relevant areas in input images and make classification decisions accordingly.

[0017] Turning now to FIG. 1, a computer system 100 is generally shown in accordance with an embodiment. The computer system 100 can be an electronic, computer framework comprising and/or employing any number and combination of computing devices and networks utilizing various communication technologies, as described herein. The computer system 100 can be easily scalable, extensible, and modular, with the ability to change to different services or reconfigure some features independently of others. The computer system 100 may be, for example, a server, desktop computer, laptop computer, tablet computer, or smartphone. In some examples, computer system 100 may be a cloud computing node. Computer system 100 may be described in the general context of computer system executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system 100 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

[0018] As shown in FIG. 1, the computer system 100 has one or more central processing units (CPU(s)) lOla, lOlb, lOlc, etc. (collectively or generically referred to as processor(s) 101). The processors 101 can be a single-core processor, multi-core processor, computing cluster, or any number of other configurations. The processors 101, also referred to as processing circuits, are coupled via a system bus 102 to a system memory 103 and various other components. The system memory 103 can include a read only memory (ROM) 104 and a random access memory (RAM) 105. The ROM 104 is coupled to the system bus 102 and may include a basic input/output system (BIOS), which controls certain basic functions of the computer system 100. The RAM is read- write memory coupled to the system bus 102 for use by the processors 101. The system memory 103 provides temporary memory space for operations of said instructions during operation. The system memory 103 can include random access memory (RAM), read only memory, flash memory, or any other suitable memory systems.

[0019] The computer system 100 comprises an input/output (I/O) adapter 106 and a communications adapter 107 coupled to the system bus 102. The I/O adapter 106 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 108 and/or any other similar component. The I/O adapter 106 and the hard disk 108 are collectively referred to herein as a mass storage 110.

[0020] Software 111 for execution on the computer system 100 may be stored in the mass storage 110. The mass storage 110 is an example of a tangible storage medium readable by the processors 101, where the software 111 is stored as instructions for execution by the processors 101 to cause the computer system 100 to operate, such as is described herein below with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail. The communications adapter 107 interconnects the system bus 102 with a network 112, which may be an outside network, enabling the computer system 100 to communicate with other such systems. In one embodiment, a portion of the system memory 103 and the mass storage 110 collectively store an operating system, which may be any appropriate operating system, to coordinate the functions of the various components shown in FIG. 1.

[0021] Additional input/output devices are shown as connected to the system bus 102 via a display adapter 115 and an interface adapter 116 and. In one embodiment, the adapters 106, 107, 115, and 116 may be connected to one or more I/O buses that are connected to the system bus 102 via an intermediate bus bridge (not shown). A display 119 (e.g., a screen or a display monitor) is connected to the system bus 102 by a display adapter 115, which may include a graphics controller to improve the performance of graphics intensive applications and a video controller. A keyboard 121, a mouse 122, a speaker 123, etc. can be interconnected to the system bus 102 via the interface adapter 116, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit. Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Thus, as configured in Fig. 1, the computer system 100 includes processing capability in the form of the processors 101, and, storage capability including the system memory 103 and the mass storage 110, input means such as the keyboard 121 and the mouse 122, and output capability including the speaker 123 and the display 119.

[0022] In some embodiments, the communications adapter 107 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others. The network 112 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others.

An external computing device may connect to the computer system 100 through the network 112. In some examples, an external computing device may be an external Webserver or a cloud computing node.

[0023] It is to be understood that the block diagram of Fig. 1 is not intended to indicate that the computer system 100 is to include all of the components shown in Fig. 1. Rather, the computer system 100 can include any appropriate fewer or additional components not illustrated in Fig. 1 (e.g., additional memory components, embedded controllers, modules, additional network interfaces, etc.). Further, the embodiments described herein with respect to computer system 100 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0024] Fig. 2 is a block diagram of an example system 200 for attention loss based DNN training. System 200 may be implemented in conjunction with any appropriate computer system, such as computer system 100 of Fig. 1. System 200 includes an input image 201, which is used for training DNNs 203 A-B (also referred to as DNN 203 A-B). DNNs 203 A-B may include any appropriate number of layers. DNNs 203 A-B share their weights, and the weights are updated together. The input image 201 has an associated image level label 202. DNNs 203 A-B receives the input image 201, and the various layers of DNNs 203 A-B determine feature maps 204A-B based on the input image 201. The feature maps 204A-B are provided to classifiers 205A-B. The classifiers 205A-B determine, from a predetermined number of classes, a probable class of an object depicted in the input image 201 based on the feature maps 204A-B. Classifier 205A outputs the probable class determination to cross-entropy (CE) classification loss module 206. CE classification loss module 206 also receives the image level label 202, which gives the actual class of the input image 201. The CE classification loss module 206 determines a difference between the class determination from classifier 205A and the image level label 202, and outputs a classification loss based on the determined difference to summation module 212.

[0025] Classifier 205B provides the class determination to an attention map generator 208. The attention map generator 208 may be Grad-CAM or Grad-CAM++ in some embodiments. The attention map generator 208 also receives decision information 207 from DNN 203B regarding any areas of the input image 201 that were used by

DNN203B to determine the feature maps 204B. The decision information 207 from the DNN203B may be received from a last layer of the DNN 203B in some embodiments.

The attention map generator 208 determines an attention map 209 of the input image 201, which is input to attention loss module 211. Attention loss module 211 also receives pixel level label 210, which corresponds to input image 201. Pixel level label 210 may include multiple bounding boxes, each corresponding to a relevant feature of the object in the input image 201. The attention loss module 211 determines an attention loss based on the attention map 209 and the pixel level label 210, and outputs an attention loss to summation module 212. The determination of the attention loss by attention loss module 211 may be based on application of an inverse mask of the pixel level label 210 to the attention map 209. The summation module 212 sums the classification loss from CE classification loss module 206, and the attention loss from attention loss module 211, and provides gradient backpropagation and weight update signal 213 to the DNN 203 A-B.

The weights of the DNN 203 A-B are updated based on the gradient backpropagation and weight update signal 213. A next input image 201, having a respective corresponding image level label 202 and pixel level label 210, may then be used for further training of the updated DNN 203 A-B. Fig. 2 is discussed in further detail below with respect to Fig. 3.

[0026] It is to be understood that the block diagram of Fig. 2 is not intended to indicate that the system 200 is to include all of the components shown in Fig. 2. Rather, the system 200 can include any appropriate fewer or additional components not illustrated in Fig. 2 (e.g., additional modules, signals, computer systems, processors, memory components, embedded controllers, computer networks, network interfaces, data inputs, etc.). Further, the embodiments described herein with respect to system 200 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.

[0027] Fig. 3 is a process flow diagram of an example method 300 for attention loss based DNN training. Method 300 is discussed with reference to Fig. 2, and may be implemented in conjunction with any appropriate computer system, such as computer system 100 of Fig. 1. In block 301, training data corresponding to a class to be identified by DNN, such as DNN 203 A-B, is generated. The training data may include input images, such as input image 201, having associated image level labels, such as image level label 202, and associated pixel level labels, such as pixel level label 210. Each input image may be an image of an object belonging to a particular class; the DNN 203 A- B may be trained to classify objects into any appropriate number of classes. The image level label corresponding to an input image gives an actual class of the object. The pixel level label may include annotations including multiple bounding boxes. Each of the multiple bounding boxes may correspond to a relevant feature of the object that may be used by the DNN to determine the class of the object. The training data that is generated in block 301 may be generated by, for example, an expert in the objects that the DNN is being trained to classify. Examples of input images corresponding to different classes are shown in Fig. 4 A, and examples of multiple bounding boxes corresponding to the input images are shown in Fig. 4B, which are discussed in further detail below.

[0028] In block 302, an input image 201 corresponding to a class that the DNN 203 A-B is being trained to recognize is provided to the DNN 203 A-B. The DNN 203 A- B determines feature maps 204 A-B of the input image 201, and provides the feature maps 204A-B to classifiers 205A-B. The classifiers 205A-B determine a probable class, selected from a predetermined number of classes, for the input image 201 based on the feature maps 204A-B.

[0029] In block 303, CE classification loss module 206 receives the determined class from the classifier 205A and the image level label 202. The image level label 202 gives the actual class of the input image 201. The CE classification loss module 206 determines a difference between the determined class from the classifier 205A and the image level label 202, and outputs a classification loss to summation module 212 based on the determined difference. In some embodiments, if the DNN 203 A-B correctly classified the input image 201, the classification loss may be zero; however, if the DNN 203 A-B incorrectly classified the input image 201, the classification loss may be greater than zero.

[0030] In block 304, attention map generator 208 determines an attention map 209 based on the determined class from the classifier 205B and decision information 207 from the DNN 203B. The attention map generator 208 may be Grad-CAM or Grad- CAM++ in some embodiments. Decision information 207 from DNN 203B indicates any areas of the input image 201 that were used by DNN 203B to determine the feature maps 204B. The decision information 207 from the DNN203B may be received from a last layer of the DNN 203B in some embodiments. An example attention map 209 is discussed in further detail below with respect to Fig. 6.

[0031] In block 305, the attention loss module 211 receives the attention map 209 and the pixel level label 210. The pixel level label 210 may include the multiple bounding boxes that were annotated on the input image 201 in block 301 of method 300. The attention loss module 211 may determine an inverse mask of the pixel level label 210.

An example of an inverse mask is discussed below in further detail with respect to Fig. 5. The attention loss module 209 may apply the inverse mask to the attention map 209 in order to determine the attention loss. The attention loss module 209 outputs the determined attention loss to the summation module 212. An example of application of an inverse mask to the attention map as may be performed in block 305 is illustrated below with respect to Fig. 6.

[0032] The attention loss calculation of block 305 by attention loss module 211 may penalize the DNN 203 A-B for focusing attention on features of the input image 201 that are outside of the inverse mask of the pixel level label 210 (i.e., pixels that are outside of a bounding box), but may not penalize attention inside the inverse mask (i.e., pixels that are inside of a bounding box). In some embodiments, the attention loss calculation may include:

Maski_nv Mask (Eq. 1);

AttMapscaied = AttMap ° Maskinv (Eq. 2); and

The attention map 209 may be elementwise scaled with the inverse mask, as denoted with the ° Hadamard operator in Eq. 2. Thus, AttMapscaied is a map that leaves any region located inside of a bounding box in the pixel level label 210. The attention loss may then be calculated by summing up all elements in å i åjAttMap_SCaied(ij), wherein i and j are variables that are step through each pixel in the scaled attention map. The penalty for each pixel may be determined based on the attention focused on the pixel by the DNN 203 A-B, and whether the pixel is located in the penalty area (e.g., outside of a bounding box).

[0033] In block 306, the summation module 212 sums the classification loss from CE classification loss module 206 and the attention loss from attention loss module 211, and outputs a gradient backpropagation and weight update signal 213 for the input image 201 to the DNN 203 A-B. In block 307, the weights of the DNN 203 A-B are updated based on the gradient backpropagation and weight update signal 213. [0034] In block 308, blocks 302-307 may be repeated with additional input images 201 until the training of DNN 203 A-B is determined to be completed. Each additional input image may have a respective image level label 202 and pixel level label 210. The training of DNN 203 A-B may be determined to be completed based on, for example, the DNN 203 A-B achieving a classification accuracy threshold corresponding to relatively low classification losses and attention losses for subsequent images.

[0035] The process flow diagram of Fig. 3 is not intended to indicate that the operations of the method 300 are to be executed in any particular order, or that all of the operations of the method 300 are to be included in every case. Additionally, the method 300 can include any suitable number of additional operations.

[0036] Fig. 4A illustrates example input images 400A-B, and Fig. 4B illustrates example multiple bounding boxes 401/402 that are annotated onto the example input images 400A-B. The input images 400A-B of Fig. 4A may each correspond to separate input images 201 that are used to train a DNN 203A-B according to method 300 of Fig.

3. Bounding boxes 401/402 as shown in Fig. 4B may be included in respective pixel level labels 210 corresponding to the input images 400 A-B. As shown in Fig. 4A, in some embodiments, the general features of two objects that are to be classified by the DNN 203 A-B may be similar, e.g. both objects may be rotationally symmetrical. To distinguish the two objects in different settings, multiple bounding box annotations (bounding boxes 401 corresponding to image 400A in Fig. 4B, and bounding boxes 402 corresponding to image 400B in Fig. 4B) that cover relevant features may be generated in block 301 of method 300.

[0037] Figs. 4A-B are shown for illustrative purposes only. For example, an input image may show any appropriate object, the multiple bounding boxes may have any appropriate number, and each bounding box may have any appropriate shape, size, and location within the mask. [0038] Fig. 5 illustrates an example mask 500A and inverse mask 500B for embodiments of attention loss based DNN training. Mask 500A includes multiple bounding boxes 501, corresponding to bounding boxes 401 as shown in image 400 A of Fig. 4B. Inverse mask 500B is the inverse of mask 500A. Inverse mask 500B also includes bounding boxes 502, and may be used by attention loss module 211 of Fig. 2 to determine the attention loss, in order to ensure that only areas outside of a bounding box incur a penalty. Areas in the attention map 209 corresponding to image 400A of Fig. 4A that are located outside of a bounding box 502 (e.g., the white area of inverse mask 500B) may be used to determine the attention loss in block 305 of method 300 of Fig. 3, while areas inside of a bounding box 502 (e.g., the black areas of inverse mask 500B) may not be used to determine the attention loss in block 305 of method 300 of Fig. 3.

The inverse mask 500B may be used to scale the attention map 209, which gives an attention map that is only non-zero outside of the mask. The attention loss that is determined in block 305 of method 300 of Fig. 3 may be the summation over all values in the scaled attention map.

[0039] Fig. 5 is shown for illustrative purposes only. For example, a mask and an inverse mask as are shown in Fig. 5 may each include any appropriate number of bounding boxes, each having any appropriate shape, size, and location within the mask.

[0040] Fig. 6 illustrates an example attention map 600 A, and an example scaled attention map 600B, for attention loss based DNN training. Attention map 600A may correspond to attention map 209 of Fig. 2. Area 601 indicates an area of the image that was used by the DNN 203 A-B to classify the object in the image. Scaled attention map 600B illustrates application of inverse mask 602, which corresponds to a bounding box 502 as show in inverse mask 500B of Fig. 5B, to the attention map 600A. The penalty area 603, which is outside of the inverse mask 602 in image 600B, is used to calculate the attention loss by attention loss module 211 in block 305 of Fig. 3. While a single attention area and bounding box are shown in attention map 600A and scaled attention map 600B, this is for illustrative purposes only; an attention map such as attention map 600A may include any appropriate number of attention areas, which may be scaled as shown in scaled attention map 600B based on any appropriate inverse mask including any appropriate number of bounding boxes, each bounding box having any appropriate shape, size, and location.

[0041] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the

functionality and/or processing capabilities described with respect to a particular system, system component, device, or device component may be performed by any other system, device, or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like may be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase“based on,” or variants thereof, should be interpreted as“based at least in part on.”

[0042] The present disclosure may be a system, a method, apparatus, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.

[0043] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0044] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0045] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0046] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

[0047] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a

programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

[0048] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0049] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware- based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0050] The descriptions of the various embodiments of the present techniques have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

CLAIMS What is claimed is:

1. A system, comprising a processor configured to: receive an input image; receive a pixel level label corresponding to the input image; determine, by a deep neural network (DNN), a probable class of the input image; determine an attention map of the DNN corresponding to the probable class of the input image; determine an attention loss of the DNN based on the attention map and the pixel level label; and update weights of the DNN based on the attention loss.

2. The system of claim 1 , wherein the probable class comprises a class that the DNN is being trained to recognize using the input image, and wherein the probable class is one of a predetermined number of classes that are recognized by the DNN.

3. The system of claim 1, wherein the attention map highlights one or more areas of the input image that were used by the DNN to determine the probable class.

4. The system of claim 3, wherein the pixel level label comprises a plurality of bounding boxes, each bounding box corresponding to a respective feature of an object depicted in the input image.

5. The system of claim 4, wherein determining the attention loss comprises: determining an inverse mask of the pixel level label; and scaling the attention map according to the inverse mask, wherein scaling the attention map according to the inverse mask comprises determining a number of pixels in the scaled attention map that are highlighted in the attention map and that are not inside a bounding box in the pixel level label.

6. The system of claim 1, wherein the attention map is determined based on feature maps and decision information corresponding to the input image from the DNN by one of Gradient-weighted Class Activation Mapping (Grad-CAM) or Grad-CAM++.

7. The system of claim 1, further configured to: receive an image level label corresponding to the input image; determine a classification loss based on a difference between the image level label and the probable class; and update the weights of the DNN based on a sum of the classification loss and the attention loss.

8. A computer-implemented method, comprising: receiving, by a processor of a computer, an input image; receiving, by the processor, a pixel level label corresponding to the input image; determining, by a deep neural network (DNN), a probable class of the input image; determining an attention map of the DNN corresponding to the probable class of the input image; determining an attention loss of the DNN based on the attention map and the pixel level label; and updating weights of the DNN based on the attention loss.

9. The computer-implemented method of claim 8, wherein the probable class comprises a class that the DNN is being trained to recognize using the input image, and wherein the probable class is one of a predetermined number of classes that are recognized by the DNN.

10. The computer- implemented method of claim 8, wherein the attention map highlights one or more areas of the input image that were used by the DNN to determine the probable class.

11. The computer- implemented method of claim 10, wherein the pixel level label comprises a plurality of bounding boxes, each bounding box corresponding to a respective feature of an object depicted in the input image.

12. The computer- implemented method of claim 11, wherein determining the attention loss comprises: determining an inverse mask of the pixel level label; and scaling the attention map according to the inverse mask, wherein scaling the attention map according to the inverse mask comprises determining a number of pixels in the scaled attention map that are highlighted in the attention map and that are not inside a bounding box in the pixel level label.

13. The computer- implemented method of claim 8, wherein the attention map is determined based on feature maps and decision information corresponding to the input image from the DNN by one of Gradient-weighted Class Activation Mapping (Grad- CAM) or Grad-CAM++.

14. The computer- implemented method of claim 8, further comprising: receiving an image level label corresponding to the input image; determining a classification loss based on a difference between the image level label and the probable class; and updating the weights of the DNN based on a sum of the classification loss and the attention loss.

15. A computer program product comprising: a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method comprising:

receiving an input image; receiving a pixel level label corresponding to the input image; determining, by a deep neural network (DNN), a probable class of the input image; determining an attention map of the DNN corresponding to the probable class of the input image; determining an attention loss of the DNN based on the attention map and the pixel level label; and updating weights of the DNN based on the attention loss.

16. The computer program product of claim 15, wherein the probable class comprises a class that the DNN is being trained to recognize using the input image, and wherein the probable class is one of a predetermined number of classes that are recognized by the DNN.

17. The computer program product of claim 15, wherein the attention map highlights one or more areas of the input image that were used by the DNN to determine the probable class.

18. The computer program product of claim 17, wherein the pixel level label comprises a plurality of bounding boxes, each bounding box corresponding to a respective feature of an object depicted in the input image.

19. The computer program product of claim 18, wherein determining the attention loss comprises: determining an inverse mask of the pixel level label; and scaling the attention map according to the inverse mask, wherein scaling the attention map according to the inverse mask comprises determining a number of pixels in the scaled attention map that are highlighted in the attention map and that are not inside a bounding box in the pixel level label.

20. The computer program product of claim 15, wherein the attention map is determined based on feature maps and decision information corresponding to the input image from the DNN by one of Gradient-weighted Class Activation Mapping (Grad- CAM) or Grad-CAM++.