WO2019240900A1 - Attention loss based deep neural network training - Google Patents

Attention loss based deep neural network training Download PDF

Info

Publication number
WO2019240900A1
WO2019240900A1 PCT/US2019/031507 US2019031507W WO2019240900A1 WO 2019240900 A1 WO2019240900 A1 WO 2019240900A1 US 2019031507 W US2019031507 W US 2019031507W WO 2019240900 A1 WO2019240900 A1 WO 2019240900A1
Authority
WO
WIPO (PCT)
Prior art keywords
dnn
input image
attention
attention map
level label
Prior art date
Application number
PCT/US2019/031507
Other languages
French (fr)
Inventor
Rajib MONDAL
Rajat Vikram SINGH
Kuan-Chuan Peng
Ziyan Wu
Jan Ernst
Original Assignee
Siemens Aktiengesellschaft
Siemens Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft, Siemens Corporation filed Critical Siemens Aktiengesellschaft
Publication of WO2019240900A1 publication Critical patent/WO2019240900A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present techniques relate to neural networks. More specifically, the techniques relate to attention loss based deep neural network (DNN) training.
  • DNN deep neural network
  • a neural network may include a plurality of processing elements arranged in layers. Interconnections are made between successive layers in the neural network.
  • a neural network may have an input layer, an output layer, and any appropriate number of intermediate layers. The intermediate layers may allow solution of nonlinear problems by the neural network.
  • a layer of a neural network may generate an output signal which may be determined based on a weighted sum of any input signals the layer receives. The input signals to a layer of a neural network may be provided from the neural network input, or from the output of any other layer of the neural network.
  • a system can include a processor to receive an input image.
  • the processor can also receive a pixel level label corresponding to the input image.
  • the processor can also determine, by a DNN, a probable class of the input image.
  • the processor can also determine an attention map of the DNN corresponding to the probable class of the input image.
  • the processor can also determine an attention loss of the DNN based on the attention map and the pixel level label.
  • the processor can also update weights of the DNN based on the attention loss.
  • a method can include receiving, by a processor, an input image.
  • the method can also include receiving, by the processor, a pixel level label corresponding to the input image.
  • the method can also include determining by a DNN a probable class of the input image.
  • the method can also include determining an attention map of the DNN corresponding to the probable class of the input image.
  • the method can also include determining an attention loss of the DNN based on the attention map and the pixel level label.
  • the method can also include updating weights of the DNN based on the attention loss.
  • computer program product may include a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method including receiving, an input image.
  • the method can also include receiving a pixel level label corresponding to the input image.
  • the method can also include determining by a DNN a probable class of the input image.
  • the method can also include determining an attention map of the DNN corresponding to the probable class of the input image.
  • the method can also include determining an attention loss of the DNN based on the attention map and the pixel level label.
  • the method can also include updating weights of the DNN based on the attention loss.
  • FIG. 1 is a block diagram of an example computer system for use in conjunction with attention loss based deep neural network training
  • FIG. 2 is a block diagram of an example system for attention loss based deep neural network training
  • FIG. 3 is a process flow diagram of an example method for attention loss based deep neural network training
  • FIGs. 4A-B illustrate example input images and associated multiple bounding boxes for embodiments of attention loss based deep neural network training
  • Fig. 5 illustrates an example mask and inverse mask for embodiments of attention loss based deep neural network training
  • Fig. 6 illustrates an example attention map and scaled attention map for embodiments of attention loss based deep neural network training.
  • Embodiments of attention loss based deep neural network (DNN) training are provided, with exemplary embodiments being discussed below in detail.
  • the weights of a DNN which are used to determine a weighted sum, may be determined based on a training process.
  • a DNN may be trained by feeding the DNN a succession of known input patterns (including but not limited to images) and comparing the output of the DNN to a corresponding expected output pattern (e.g., a class of an object depicted in an input image).
  • the DNN may learn by measuring a difference between the expected output pattern and the output pattern that was produced by the current state of the DNN for the corresponding input pattern.
  • the weights of DNN may be adjusted based on the measured difference.
  • DNN training may be iterative process, requiring a relatively large amount of input patterns to be sequentially fed into the DNN.
  • weights of a DNN are set to appropriate levels by the training, an input pattern at the input layer of the DNN may successively propagate through the intermediate layers of the DNN, to give a correct corresponding output pattern for the input pattern.
  • a DNN may provide visual recognition and classification of objects in images that have been captured by sensors (e.g., a camera), including but not limited to red- green-blue (RGB) images.
  • sensors e.g., a camera
  • RGB red- green-blue
  • Such a DNN may be trained to classify objects in input images into a predetermined number of classes.
  • a DNN may be trained based on two loss determinations: a classification loss that compares an actual class of an input image with the DNN class prediction for the input image; and an attention loss that is determined based on an inverse mask of a bounding box annotation of the input image, and an attention map of the DNN for the input image.
  • the weights of the DNN are updated via successive training iterations based on the classification loss and the attention loss via backpropagation.
  • a DNN may classify objects, such as parts for a manufacturing process, that may be relatively visually similar.
  • a bounding box on an image which may encompass the object that is being classified within an input image, may be used to guide the attention of a DNN during training for the classification task.
  • a single bounding box covering the entire object may include background noise along with any distinguishing features of the object.
  • Multiple smaller bounding boxes, each corresponding to one or more relevant features of the image may be used to guide the attention of a DNN during training for image classification.
  • Each of the multiple bounding box annotations may closely correspond to a relatively small area containing a relevant feature on the image in order to guide the attention of the DNN. Provision of multiple bounding boxes allows guiding of the attention of the DNN to multiple relevant features while ignoring background noise, which may improve classification performance of the DNN, especially for an object having relevant features that are relatively far apart in the image.
  • Multiple bounding boxes may be annotated on training input images by, for example, an expert regarding the objects that are being classified by the DNN.
  • the multiple bounding boxes may be included in a pixel level label corresponding to the input image, and may be pixel level annotations on the image, e.g., an annotation
  • the pixel level label including multiple bounding boxes may be compared to an attention map of the DNN.
  • the attention map highlights any areas of the input image that were used by the DNN to determine the class of the input image; e.g., any portions of the input image that support the class determination by the DNN.
  • the attention map may be generated by any appropriate attention map generator, including but not limited to Gradient-weighted Class Activation Mapping (Grad-CAM) or Grad- CAM++.
  • Gd-CAM Gradient-weighted Class Activation Mapping
  • a difference between the pixel level label including the multiple bounding box annotations and the attention map may give the attention loss of the DNN.
  • an input image may be fed to the DNN, and probable class of the input image may be determined by the DNN.
  • the determined probable class and the actual class (e.g., an image level label) of the image are compared to determine the classification loss.
  • a signal is set wherein the actual class value is 1, and any other class values are 0.
  • a gradient may be backpropagated to the DNN, and global average pooling may be performed on the gradient to determine a weight vector.
  • the weight vector may be used to determine a weighted sum of the feature maps that are output by the DNN for the input image to determine an attention map of the DNN.
  • An inverse mask of the pixel level label of the input image may be determined, and the attention map may be elementwise scaled with the inverse mask.
  • the scaled attention map may therefore omit any portions of the attention map that are covered by the inverse mask, i.e., that are inside a bounding box, such that a penalty is only incurred for attention to pixels that are outside of a bounding box.
  • the attention loss may be then calculated by summing up values in the scaled attention map.
  • the classification loss and the attention loss may be used to train the DNN to focus on relevant areas in input images and make classification decisions accordingly.
  • the computer system 100 can be an electronic, computer framework comprising and/or employing any number and combination of computing devices and networks utilizing various communication technologies, as described herein.
  • the computer system 100 can be easily scalable, extensible, and modular, with the ability to change to different services or reconfigure some features independently of others.
  • the computer system 100 may be, for example, a server, desktop computer, laptop computer, tablet computer, or smartphone.
  • computer system 100 may be a cloud computing node.
  • Computer system 100 may be described in the general context of computer system executable instructions, such as program modules, being executed by a computer system.
  • program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types.
  • Computer system 100 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote computer system storage media including memory storage devices.
  • the computer system 100 has one or more central processing units (CPU(s)) lOla, lOlb, lOlc, etc. (collectively or generically referred to as processor(s) 101).
  • the processors 101 can be a single-core processor, multi-core processor, computing cluster, or any number of other configurations.
  • the processors 101 also referred to as processing circuits, are coupled via a system bus 102 to a system memory 103 and various other components.
  • the system memory 103 can include a read only memory (ROM) 104 and a random access memory (RAM) 105.
  • the ROM 104 is coupled to the system bus 102 and may include a basic input/output system (BIOS), which controls certain basic functions of the computer system 100.
  • BIOS basic input/output system
  • the RAM is read- write memory coupled to the system bus 102 for use by the processors 101.
  • the system memory 103 provides temporary memory space for operations of said instructions during operation.
  • the system memory 103 can include random access memory (RAM), read only memory, flash memory, or any other suitable memory systems.
  • the computer system 100 comprises an input/output (I/O) adapter 106 and a communications adapter 107 coupled to the system bus 102.
  • the I/O adapter 106 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 108 and/or any other similar component.
  • SCSI small computer system interface
  • the I/O adapter 106 and the hard disk 108 are collectively referred to herein as a mass storage 110.
  • Software 111 for execution on the computer system 100 may be stored in the mass storage 110.
  • the mass storage 110 is an example of a tangible storage medium readable by the processors 101, where the software 111 is stored as instructions for execution by the processors 101 to cause the computer system 100 to operate, such as is described herein below with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail.
  • the communications adapter 107 interconnects the system bus 102 with a network 112, which may be an outside network, enabling the computer system 100 to communicate with other such systems.
  • a portion of the system memory 103 and the mass storage 110 collectively store an operating system, which may be any appropriate operating system, to coordinate the functions of the various components shown in FIG. 1.
  • Additional input/output devices are shown as connected to the system bus 102 via a display adapter 115 and an interface adapter 116 and.
  • the adapters 106, 107, 115, and 116 may be connected to one or more I/O buses that are connected to the system bus 102 via an intermediate bus bridge (not shown).
  • a display 119 e.g., a screen or a display monitor
  • the computer system 100 includes processing capability in the form of the processors 101, and, storage capability including the system memory 103 and the mass storage 110, input means such as the keyboard 121 and the mouse 122, and output capability including the speaker 123 and the display 119.
  • processing capability in the form of the processors 101, and, storage capability including the system memory 103 and the mass storage 110, input means such as the keyboard 121 and the mouse 122, and output capability including the speaker 123 and the display 119.
  • the communications adapter 107 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others.
  • the network 112 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others.
  • An external computing device may connect to the computer system 100 through the network 112.
  • an external computing device may be an external Webserver or a cloud computing node.
  • Fig. 1 the block diagram of Fig. 1 is not intended to indicate that the computer system 100 is to include all of the components shown in Fig. 1. Rather, the computer system 100 can include any appropriate fewer or additional components not illustrated in Fig. 1 (e.g., additional memory components, embedded controllers, modules, additional network interfaces, etc.). Further, the embodiments described herein with respect to computer system 100 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.
  • suitable hardware e.g., a processor, an embedded controller, or an application specific integrated circuit, among others
  • software e.g., an application, among others
  • firmware e.g., an application, among others
  • Fig. 2 is a block diagram of an example system 200 for attention loss based DNN training.
  • System 200 may be implemented in conjunction with any appropriate computer system, such as computer system 100 of Fig. 1.
  • System 200 includes an input image 201, which is used for training DNNs 203 A-B (also referred to as DNN 203 A-B).
  • DNNs 203 A-B may include any appropriate number of layers.
  • DNNs 203 A-B share their weights, and the weights are updated together.
  • the input image 201 has an associated image level label 202.
  • DNNs 203 A-B receives the input image 201, and the various layers of DNNs 203 A-B determine feature maps 204A-B based on the input image 201.
  • the feature maps 204A-B are provided to classifiers 205A-B.
  • the classifiers 205A-B determine, from a predetermined number of classes, a probable class of an object depicted in the input image 201 based on the feature maps 204A-B.
  • Classifier 205A outputs the probable class determination to cross-entropy (CE) classification loss module 206.
  • CE classification loss module 206 also receives the image level label 202, which gives the actual class of the input image 201.
  • the CE classification loss module 206 determines a difference between the class determination from classifier 205A and the image level label 202, and outputs a classification loss based on the determined difference to summation module 212.
  • Classifier 205B provides the class determination to an attention map generator 208.
  • the attention map generator 208 may be Grad-CAM or Grad-CAM++ in some embodiments.
  • the attention map generator 208 also receives decision information 207 from DNN 203B regarding any areas of the input image 201 that were used by
  • the DNN203B to determine the feature maps 204B.
  • the decision information 207 from the DNN203B may be received from a last layer of the DNN 203B in some embodiments.
  • the attention map generator 208 determines an attention map 209 of the input image 201, which is input to attention loss module 211.
  • Attention loss module 211 also receives pixel level label 210, which corresponds to input image 201. Pixel level label 210 may include multiple bounding boxes, each corresponding to a relevant feature of the object in the input image 201.
  • the attention loss module 211 determines an attention loss based on the attention map 209 and the pixel level label 210, and outputs an attention loss to summation module 212.
  • the determination of the attention loss by attention loss module 211 may be based on application of an inverse mask of the pixel level label 210 to the attention map 209.
  • the summation module 212 sums the classification loss from CE classification loss module 206, and the attention loss from attention loss module 211, and provides gradient backpropagation and weight update signal 213 to the DNN 203 A-B.
  • the weights of the DNN 203 A-B are updated based on the gradient backpropagation and weight update signal 213.
  • a next input image 201 having a respective corresponding image level label 202 and pixel level label 210, may then be used for further training of the updated DNN 203 A-B.
  • Fig. 2 is discussed in further detail below with respect to Fig. 3.
  • Fig. 2 the block diagram of Fig. 2 is not intended to indicate that the system 200 is to include all of the components shown in Fig. 2. Rather, the system 200 can include any appropriate fewer or additional components not illustrated in Fig. 2 (e.g., additional modules, signals, computer systems, processors, memory components, embedded controllers, computer networks, network interfaces, data inputs, etc.).
  • the embodiments described herein with respect to system 200 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.
  • suitable hardware e.g., a processor, an embedded controller, or an application specific integrated circuit, among others
  • software e.g., an application, among others
  • firmware e.g., an application, among others
  • Fig. 3 is a process flow diagram of an example method 300 for attention loss based DNN training.
  • Method 300 is discussed with reference to Fig. 2, and may be implemented in conjunction with any appropriate computer system, such as computer system 100 of Fig. 1.
  • training data corresponding to a class to be identified by DNN, such as DNN 203 A-B, is generated.
  • the training data may include input images, such as input image 201, having associated image level labels, such as image level label 202, and associated pixel level labels, such as pixel level label 210.
  • Each input image may be an image of an object belonging to a particular class; the DNN 203 A- B may be trained to classify objects into any appropriate number of classes.
  • the image level label corresponding to an input image gives an actual class of the object.
  • the pixel level label may include annotations including multiple bounding boxes. Each of the multiple bounding boxes may correspond to a relevant feature of the object that may be used by the DNN to determine the class of the object.
  • the training data that is generated in block 301 may be generated by, for example, an expert in the objects that the DNN is being trained to classify. Examples of input images corresponding to different classes are shown in Fig. 4 A, and examples of multiple bounding boxes corresponding to the input images are shown in Fig. 4B, which are discussed in further detail below.
  • an input image 201 corresponding to a class that the DNN 203 A-B is being trained to recognize is provided to the DNN 203 A-B.
  • the DNN 203 A- B determines feature maps 204 A-B of the input image 201, and provides the feature maps 204A-B to classifiers 205A-B.
  • the classifiers 205A-B determine a probable class, selected from a predetermined number of classes, for the input image 201 based on the feature maps 204A-B.
  • CE classification loss module 206 receives the determined class from the classifier 205A and the image level label 202.
  • the image level label 202 gives the actual class of the input image 201.
  • the CE classification loss module 206 determines a difference between the determined class from the classifier 205A and the image level label 202, and outputs a classification loss to summation module 212 based on the determined difference.
  • the classification loss may be zero; however, if the DNN 203 A-B incorrectly classified the input image 201, the classification loss may be greater than zero.
  • attention map generator 208 determines an attention map 209 based on the determined class from the classifier 205B and decision information 207 from the DNN 203B.
  • the attention map generator 208 may be Grad-CAM or Grad- CAM++ in some embodiments.
  • Decision information 207 from DNN 203B indicates any areas of the input image 201 that were used by DNN 203B to determine the feature maps 204B.
  • the decision information 207 from the DNN203B may be received from a last layer of the DNN 203B in some embodiments.
  • An example attention map 209 is discussed in further detail below with respect to Fig. 6.
  • the attention loss module 211 receives the attention map 209 and the pixel level label 210.
  • the pixel level label 210 may include the multiple bounding boxes that were annotated on the input image 201 in block 301 of method 300.
  • the attention loss module 211 may determine an inverse mask of the pixel level label 210.
  • the attention loss module 209 may apply the inverse mask to the attention map 209 in order to determine the attention loss.
  • the attention loss module 209 outputs the determined attention loss to the summation module 212.
  • An example of application of an inverse mask to the attention map as may be performed in block 305 is illustrated below with respect to Fig. 6.
  • the attention loss calculation of block 305 by attention loss module 211 may penalize the DNN 203 A-B for focusing attention on features of the input image 201 that are outside of the inverse mask of the pixel level label 210 (i.e., pixels that are outside of a bounding box), but may not penalize attention inside the inverse mask (i.e., pixels that are inside of a bounding box).
  • the attention loss calculation may include:
  • AttMapscaied AttMap ° Maskinv (Eq. 2);
  • the attention map 209 may be elementwise scaled with the inverse mask, as denoted with the ° Hadamard operator in Eq. 2.
  • AttMapscaied is a map that leaves any region located inside of a bounding box in the pixel level label 210.
  • the attention loss may then be calculated by summing up all elements in ⁇ i ⁇ jAttMap SCaied(ij) , wherein i and j are variables that are step through each pixel in the scaled attention map.
  • the penalty for each pixel may be determined based on the attention focused on the pixel by the DNN 203 A-B, and whether the pixel is located in the penalty area (e.g., outside of a bounding box).
  • the summation module 212 sums the classification loss from CE classification loss module 206 and the attention loss from attention loss module 211, and outputs a gradient backpropagation and weight update signal 213 for the input image 201 to the DNN 203 A-B.
  • the weights of the DNN 203 A-B are updated based on the gradient backpropagation and weight update signal 213.
  • blocks 302-307 may be repeated with additional input images 201 until the training of DNN 203 A-B is determined to be completed. Each additional input image may have a respective image level label 202 and pixel level label 210.
  • the training of DNN 203 A-B may be determined to be completed based on, for example, the DNN 203 A-B achieving a classification accuracy threshold corresponding to relatively low classification losses and attention losses for subsequent images.
  • the process flow diagram of Fig. 3 is not intended to indicate that the operations of the method 300 are to be executed in any particular order, or that all of the operations of the method 300 are to be included in every case. Additionally, the method 300 can include any suitable number of additional operations.
  • Fig. 4A illustrates example input images 400A-B
  • Fig. 4B illustrates example multiple bounding boxes 401/402 that are annotated onto the example input images 400A-B.
  • the input images 400A-B of Fig. 4A may each correspond to separate input images 201 that are used to train a DNN 203A-B according to method 300 of Fig.
  • Bounding boxes 401/402 as shown in Fig. 4B may be included in respective pixel level labels 210 corresponding to the input images 400 A-B.
  • the general features of two objects that are to be classified by the DNN 203 A-B may be similar, e.g. both objects may be rotationally symmetrical.
  • multiple bounding box annotations bounding boxes 401 corresponding to image 400A in Fig. 4B, and bounding boxes 402 corresponding to image 400B in Fig. 4B) that cover relevant features may be generated in block 301 of method 300.
  • Figs. 4A-B are shown for illustrative purposes only.
  • an input image may show any appropriate object
  • the multiple bounding boxes may have any appropriate number
  • each bounding box may have any appropriate shape, size, and location within the mask.
  • Fig. 5 illustrates an example mask 500A and inverse mask 500B for embodiments of attention loss based DNN training.
  • Mask 500A includes multiple bounding boxes 501, corresponding to bounding boxes 401 as shown in image 400 A of Fig. 4B.
  • Inverse mask 500B is the inverse of mask 500A.
  • Inverse mask 500B also includes bounding boxes 502, and may be used by attention loss module 211 of Fig.
  • Areas in the attention map 209 corresponding to image 400A of Fig. 4A that are located outside of a bounding box 502 may be used to determine the attention loss in block 305 of method 300 of Fig. 3, while areas inside of a bounding box 502 (e.g., the black areas of inverse mask 500B) may not be used to determine the attention loss in block 305 of method 300 of Fig. 3.
  • the inverse mask 500B may be used to scale the attention map 209, which gives an attention map that is only non-zero outside of the mask.
  • the attention loss that is determined in block 305 of method 300 of Fig. 3 may be the summation over all values in the scaled attention map.
  • Fig. 5 is shown for illustrative purposes only.
  • a mask and an inverse mask as are shown in Fig. 5 may each include any appropriate number of bounding boxes, each having any appropriate shape, size, and location within the mask.
  • Fig. 6 illustrates an example attention map 600 A, and an example scaled attention map 600B, for attention loss based DNN training.
  • Attention map 600A may correspond to attention map 209 of Fig. 2.
  • Area 601 indicates an area of the image that was used by the DNN 203 A-B to classify the object in the image.
  • Scaled attention map 600B illustrates application of inverse mask 602, which corresponds to a bounding box 502 as show in inverse mask 500B of Fig. 5B, to the attention map 600A.
  • the penalty area 603, which is outside of the inverse mask 602 in image 600B, is used to calculate the attention loss by attention loss module 211 in block 305 of Fig. 3.
  • attention map 600A While a single attention area and bounding box are shown in attention map 600A and scaled attention map 600B, this is for illustrative purposes only; an attention map such as attention map 600A may include any appropriate number of attention areas, which may be scaled as shown in scaled attention map 600B based on any appropriate inverse mask including any appropriate number of bounding boxes, each bounding box having any appropriate shape, size, and location.
  • the present disclosure may be a system, a method, apparatus, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk
  • a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a
  • the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration can be implemented by special purpose hardware- based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

Examples of techniques for attention loss based deep neural network (DNN) training are described herein. An aspect includes receiving an input image. Another aspect includes receiving a pixel level label corresponding to the input image. Another aspect includes determining by a DNN a probable class of the input image. Another aspect includes determining an attention map of the DNN corresponding to the probable class of the input image. Another aspect includes determining an attention loss of the DNN based on the attention map and the pixel level label. Yet another aspect includes updating weights of the DNN based on the attention loss.

Description

ATTENTION LOSS BASED DEEP NEURAL NETWORK TRAINING
BACKGROUND
[0001] The present techniques relate to neural networks. More specifically, the techniques relate to attention loss based deep neural network (DNN) training.
[0002] A neural network may include a plurality of processing elements arranged in layers. Interconnections are made between successive layers in the neural network. A neural network may have an input layer, an output layer, and any appropriate number of intermediate layers. The intermediate layers may allow solution of nonlinear problems by the neural network. A layer of a neural network may generate an output signal which may be determined based on a weighted sum of any input signals the layer receives. The input signals to a layer of a neural network may be provided from the neural network input, or from the output of any other layer of the neural network.
SUMMARY
[0003] According to an embodiment described herein, a system can include a processor to receive an input image. The processor can also receive a pixel level label corresponding to the input image. The processor can also determine, by a DNN, a probable class of the input image. The processor can also determine an attention map of the DNN corresponding to the probable class of the input image. The processor can also determine an attention loss of the DNN based on the attention map and the pixel level label. The processor can also update weights of the DNN based on the attention loss.
[0004] According to another embodiment described herein, a method can include receiving, by a processor, an input image. The method can also include receiving, by the processor, a pixel level label corresponding to the input image. The method can also include determining by a DNN a probable class of the input image. The method can also include determining an attention map of the DNN corresponding to the probable class of the input image. The method can also include determining an attention loss of the DNN based on the attention map and the pixel level label. The method can also include updating weights of the DNN based on the attention loss.
[0005] According to another embodiment described herein, computer program product may include a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method including receiving, an input image. The method can also include receiving a pixel level label corresponding to the input image. The method can also include determining by a DNN a probable class of the input image. The method can also include determining an attention map of the DNN corresponding to the probable class of the input image. The method can also include determining an attention loss of the DNN based on the attention map and the pixel level label. The method can also include updating weights of the DNN based on the attention loss.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] Fig. 1 is a block diagram of an example computer system for use in conjunction with attention loss based deep neural network training;
[0007] Fig. 2 is a block diagram of an example system for attention loss based deep neural network training;
[0008] Fig. 3 is a process flow diagram of an example method for attention loss based deep neural network training;
[0009] Figs. 4A-B illustrate example input images and associated multiple bounding boxes for embodiments of attention loss based deep neural network training;
[0010] Fig. 5 illustrates an example mask and inverse mask for embodiments of attention loss based deep neural network training; and [0011] Fig. 6 illustrates an example attention map and scaled attention map for embodiments of attention loss based deep neural network training.
DETAILED DESCRIPTION
[0012] Embodiments of attention loss based deep neural network (DNN) training are provided, with exemplary embodiments being discussed below in detail. The weights of a DNN, which are used to determine a weighted sum, may be determined based on a training process. A DNN may be trained by feeding the DNN a succession of known input patterns (including but not limited to images) and comparing the output of the DNN to a corresponding expected output pattern (e.g., a class of an object depicted in an input image). The DNN may learn by measuring a difference between the expected output pattern and the output pattern that was produced by the current state of the DNN for the corresponding input pattern. The weights of DNN may be adjusted based on the measured difference. DNN training may be iterative process, requiring a relatively large amount of input patterns to be sequentially fed into the DNN. When the weights of a DNN are set to appropriate levels by the training, an input pattern at the input layer of the DNN may successively propagate through the intermediate layers of the DNN, to give a correct corresponding output pattern for the input pattern.
[0013] A DNN may provide visual recognition and classification of objects in images that have been captured by sensors (e.g., a camera), including but not limited to red- green-blue (RGB) images. Such a DNN may be trained to classify objects in input images into a predetermined number of classes. In some embodiments, a DNN may be trained based on two loss determinations: a classification loss that compares an actual class of an input image with the DNN class prediction for the input image; and an attention loss that is determined based on an inverse mask of a bounding box annotation of the input image, and an attention map of the DNN for the input image. The weights of the DNN are updated via successive training iterations based on the classification loss and the attention loss via backpropagation. [0014] In, for example, an industrial process, a DNN may classify objects, such as parts for a manufacturing process, that may be relatively visually similar. A bounding box on an image, which may encompass the object that is being classified within an input image, may be used to guide the attention of a DNN during training for the classification task. However, for widely distributed features on an image, a single bounding box covering the entire object may include background noise along with any distinguishing features of the object. Multiple smaller bounding boxes, each corresponding to one or more relevant features of the image, may be used to guide the attention of a DNN during training for image classification. Each of the multiple bounding box annotations may closely correspond to a relatively small area containing a relevant feature on the image in order to guide the attention of the DNN. Provision of multiple bounding boxes allows guiding of the attention of the DNN to multiple relevant features while ignoring background noise, which may improve classification performance of the DNN, especially for an object having relevant features that are relatively far apart in the image.
[0015] Multiple bounding boxes may be annotated on training input images by, for example, an expert regarding the objects that are being classified by the DNN. The multiple bounding boxes may be included in a pixel level label corresponding to the input image, and may be pixel level annotations on the image, e.g., an annotation
corresponding to each pixel indicating whether the pixel is inside or outside of a bounding box. The pixel level label including multiple bounding boxes may be compared to an attention map of the DNN. The attention map highlights any areas of the input image that were used by the DNN to determine the class of the input image; e.g., any portions of the input image that support the class determination by the DNN. The attention map may be generated by any appropriate attention map generator, including but not limited to Gradient-weighted Class Activation Mapping (Grad-CAM) or Grad- CAM++. A difference between the pixel level label including the multiple bounding box annotations and the attention map may give the attention loss of the DNN. [0016] In some embodiments, during training of a DNN, an input image may be fed to the DNN, and probable class of the input image may be determined by the DNN. The determined probable class and the actual class (e.g., an image level label) of the image are compared to determine the classification loss. In some embodiments, a signal is set wherein the actual class value is 1, and any other class values are 0. A gradient may be backpropagated to the DNN, and global average pooling may be performed on the gradient to determine a weight vector. The weight vector may be used to determine a weighted sum of the feature maps that are output by the DNN for the input image to determine an attention map of the DNN. An inverse mask of the pixel level label of the input image may be determined, and the attention map may be elementwise scaled with the inverse mask. The scaled attention map may therefore omit any portions of the attention map that are covered by the inverse mask, i.e., that are inside a bounding box, such that a penalty is only incurred for attention to pixels that are outside of a bounding box. The attention loss may be then calculated by summing up values in the scaled attention map. The classification loss and the attention loss may be used to train the DNN to focus on relevant areas in input images and make classification decisions accordingly.
[0017] Turning now to FIG. 1, a computer system 100 is generally shown in accordance with an embodiment. The computer system 100 can be an electronic, computer framework comprising and/or employing any number and combination of computing devices and networks utilizing various communication technologies, as described herein. The computer system 100 can be easily scalable, extensible, and modular, with the ability to change to different services or reconfigure some features independently of others. The computer system 100 may be, for example, a server, desktop computer, laptop computer, tablet computer, or smartphone. In some examples, computer system 100 may be a cloud computing node. Computer system 100 may be described in the general context of computer system executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system 100 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
[0018] As shown in FIG. 1, the computer system 100 has one or more central processing units (CPU(s)) lOla, lOlb, lOlc, etc. (collectively or generically referred to as processor(s) 101). The processors 101 can be a single-core processor, multi-core processor, computing cluster, or any number of other configurations. The processors 101, also referred to as processing circuits, are coupled via a system bus 102 to a system memory 103 and various other components. The system memory 103 can include a read only memory (ROM) 104 and a random access memory (RAM) 105. The ROM 104 is coupled to the system bus 102 and may include a basic input/output system (BIOS), which controls certain basic functions of the computer system 100. The RAM is read- write memory coupled to the system bus 102 for use by the processors 101. The system memory 103 provides temporary memory space for operations of said instructions during operation. The system memory 103 can include random access memory (RAM), read only memory, flash memory, or any other suitable memory systems.
[0019] The computer system 100 comprises an input/output (I/O) adapter 106 and a communications adapter 107 coupled to the system bus 102. The I/O adapter 106 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 108 and/or any other similar component. The I/O adapter 106 and the hard disk 108 are collectively referred to herein as a mass storage 110.
[0020] Software 111 for execution on the computer system 100 may be stored in the mass storage 110. The mass storage 110 is an example of a tangible storage medium readable by the processors 101, where the software 111 is stored as instructions for execution by the processors 101 to cause the computer system 100 to operate, such as is described herein below with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail. The communications adapter 107 interconnects the system bus 102 with a network 112, which may be an outside network, enabling the computer system 100 to communicate with other such systems. In one embodiment, a portion of the system memory 103 and the mass storage 110 collectively store an operating system, which may be any appropriate operating system, to coordinate the functions of the various components shown in FIG. 1.
[0021] Additional input/output devices are shown as connected to the system bus 102 via a display adapter 115 and an interface adapter 116 and. In one embodiment, the adapters 106, 107, 115, and 116 may be connected to one or more I/O buses that are connected to the system bus 102 via an intermediate bus bridge (not shown). A display 119 (e.g., a screen or a display monitor) is connected to the system bus 102 by a display adapter 115, which may include a graphics controller to improve the performance of graphics intensive applications and a video controller. A keyboard 121, a mouse 122, a speaker 123, etc. can be interconnected to the system bus 102 via the interface adapter 116, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit. Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Thus, as configured in Fig. 1, the computer system 100 includes processing capability in the form of the processors 101, and, storage capability including the system memory 103 and the mass storage 110, input means such as the keyboard 121 and the mouse 122, and output capability including the speaker 123 and the display 119.
[0022] In some embodiments, the communications adapter 107 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others. The network 112 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others.
An external computing device may connect to the computer system 100 through the network 112. In some examples, an external computing device may be an external Webserver or a cloud computing node.
[0023] It is to be understood that the block diagram of Fig. 1 is not intended to indicate that the computer system 100 is to include all of the components shown in Fig. 1. Rather, the computer system 100 can include any appropriate fewer or additional components not illustrated in Fig. 1 (e.g., additional memory components, embedded controllers, modules, additional network interfaces, etc.). Further, the embodiments described herein with respect to computer system 100 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.
[0024] Fig. 2 is a block diagram of an example system 200 for attention loss based DNN training. System 200 may be implemented in conjunction with any appropriate computer system, such as computer system 100 of Fig. 1. System 200 includes an input image 201, which is used for training DNNs 203 A-B (also referred to as DNN 203 A-B). DNNs 203 A-B may include any appropriate number of layers. DNNs 203 A-B share their weights, and the weights are updated together. The input image 201 has an associated image level label 202. DNNs 203 A-B receives the input image 201, and the various layers of DNNs 203 A-B determine feature maps 204A-B based on the input image 201. The feature maps 204A-B are provided to classifiers 205A-B. The classifiers 205A-B determine, from a predetermined number of classes, a probable class of an object depicted in the input image 201 based on the feature maps 204A-B. Classifier 205A outputs the probable class determination to cross-entropy (CE) classification loss module 206. CE classification loss module 206 also receives the image level label 202, which gives the actual class of the input image 201. The CE classification loss module 206 determines a difference between the class determination from classifier 205A and the image level label 202, and outputs a classification loss based on the determined difference to summation module 212.
[0025] Classifier 205B provides the class determination to an attention map generator 208. The attention map generator 208 may be Grad-CAM or Grad-CAM++ in some embodiments. The attention map generator 208 also receives decision information 207 from DNN 203B regarding any areas of the input image 201 that were used by
DNN203B to determine the feature maps 204B. The decision information 207 from the DNN203B may be received from a last layer of the DNN 203B in some embodiments.
The attention map generator 208 determines an attention map 209 of the input image 201, which is input to attention loss module 211. Attention loss module 211 also receives pixel level label 210, which corresponds to input image 201. Pixel level label 210 may include multiple bounding boxes, each corresponding to a relevant feature of the object in the input image 201. The attention loss module 211 determines an attention loss based on the attention map 209 and the pixel level label 210, and outputs an attention loss to summation module 212. The determination of the attention loss by attention loss module 211 may be based on application of an inverse mask of the pixel level label 210 to the attention map 209. The summation module 212 sums the classification loss from CE classification loss module 206, and the attention loss from attention loss module 211, and provides gradient backpropagation and weight update signal 213 to the DNN 203 A-B.
The weights of the DNN 203 A-B are updated based on the gradient backpropagation and weight update signal 213. A next input image 201, having a respective corresponding image level label 202 and pixel level label 210, may then be used for further training of the updated DNN 203 A-B. Fig. 2 is discussed in further detail below with respect to Fig. 3.
[0026] It is to be understood that the block diagram of Fig. 2 is not intended to indicate that the system 200 is to include all of the components shown in Fig. 2. Rather, the system 200 can include any appropriate fewer or additional components not illustrated in Fig. 2 (e.g., additional modules, signals, computer systems, processors, memory components, embedded controllers, computer networks, network interfaces, data inputs, etc.). Further, the embodiments described herein with respect to system 200 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various embodiments.
[0027] Fig. 3 is a process flow diagram of an example method 300 for attention loss based DNN training. Method 300 is discussed with reference to Fig. 2, and may be implemented in conjunction with any appropriate computer system, such as computer system 100 of Fig. 1. In block 301, training data corresponding to a class to be identified by DNN, such as DNN 203 A-B, is generated. The training data may include input images, such as input image 201, having associated image level labels, such as image level label 202, and associated pixel level labels, such as pixel level label 210. Each input image may be an image of an object belonging to a particular class; the DNN 203 A- B may be trained to classify objects into any appropriate number of classes. The image level label corresponding to an input image gives an actual class of the object. The pixel level label may include annotations including multiple bounding boxes. Each of the multiple bounding boxes may correspond to a relevant feature of the object that may be used by the DNN to determine the class of the object. The training data that is generated in block 301 may be generated by, for example, an expert in the objects that the DNN is being trained to classify. Examples of input images corresponding to different classes are shown in Fig. 4 A, and examples of multiple bounding boxes corresponding to the input images are shown in Fig. 4B, which are discussed in further detail below.
[0028] In block 302, an input image 201 corresponding to a class that the DNN 203 A-B is being trained to recognize is provided to the DNN 203 A-B. The DNN 203 A- B determines feature maps 204 A-B of the input image 201, and provides the feature maps 204A-B to classifiers 205A-B. The classifiers 205A-B determine a probable class, selected from a predetermined number of classes, for the input image 201 based on the feature maps 204A-B.
[0029] In block 303, CE classification loss module 206 receives the determined class from the classifier 205A and the image level label 202. The image level label 202 gives the actual class of the input image 201. The CE classification loss module 206 determines a difference between the determined class from the classifier 205A and the image level label 202, and outputs a classification loss to summation module 212 based on the determined difference. In some embodiments, if the DNN 203 A-B correctly classified the input image 201, the classification loss may be zero; however, if the DNN 203 A-B incorrectly classified the input image 201, the classification loss may be greater than zero.
[0030] In block 304, attention map generator 208 determines an attention map 209 based on the determined class from the classifier 205B and decision information 207 from the DNN 203B. The attention map generator 208 may be Grad-CAM or Grad- CAM++ in some embodiments. Decision information 207 from DNN 203B indicates any areas of the input image 201 that were used by DNN 203B to determine the feature maps 204B. The decision information 207 from the DNN203B may be received from a last layer of the DNN 203B in some embodiments. An example attention map 209 is discussed in further detail below with respect to Fig. 6.
[0031] In block 305, the attention loss module 211 receives the attention map 209 and the pixel level label 210. The pixel level label 210 may include the multiple bounding boxes that were annotated on the input image 201 in block 301 of method 300. The attention loss module 211 may determine an inverse mask of the pixel level label 210.
An example of an inverse mask is discussed below in further detail with respect to Fig. 5. The attention loss module 209 may apply the inverse mask to the attention map 209 in order to determine the attention loss. The attention loss module 209 outputs the determined attention loss to the summation module 212. An example of application of an inverse mask to the attention map as may be performed in block 305 is illustrated below with respect to Fig. 6.
[0032] The attention loss calculation of block 305 by attention loss module 211 may penalize the DNN 203 A-B for focusing attention on features of the input image 201 that are outside of the inverse mask of the pixel level label 210 (i.e., pixels that are outside of a bounding box), but may not penalize attention inside the inverse mask (i.e., pixels that are inside of a bounding box). In some embodiments, the attention loss calculation may include:
Maskinv Mask (Eq. 1);
AttMapscaied = AttMap ° Maskinv (Eq. 2); and
Figure imgf000014_0001
The attention map 209 may be elementwise scaled with the inverse mask, as denoted with the ° Hadamard operator in Eq. 2. Thus, AttMapscaied is a map that leaves any region located inside of a bounding box in the pixel level label 210. The attention loss may then be calculated by summing up all elements in å i åjAttMapSCaied(ij), wherein i and j are variables that are step through each pixel in the scaled attention map. The penalty for each pixel may be determined based on the attention focused on the pixel by the DNN 203 A-B, and whether the pixel is located in the penalty area (e.g., outside of a bounding box).
[0033] In block 306, the summation module 212 sums the classification loss from CE classification loss module 206 and the attention loss from attention loss module 211, and outputs a gradient backpropagation and weight update signal 213 for the input image 201 to the DNN 203 A-B. In block 307, the weights of the DNN 203 A-B are updated based on the gradient backpropagation and weight update signal 213. [0034] In block 308, blocks 302-307 may be repeated with additional input images 201 until the training of DNN 203 A-B is determined to be completed. Each additional input image may have a respective image level label 202 and pixel level label 210. The training of DNN 203 A-B may be determined to be completed based on, for example, the DNN 203 A-B achieving a classification accuracy threshold corresponding to relatively low classification losses and attention losses for subsequent images.
[0035] The process flow diagram of Fig. 3 is not intended to indicate that the operations of the method 300 are to be executed in any particular order, or that all of the operations of the method 300 are to be included in every case. Additionally, the method 300 can include any suitable number of additional operations.
[0036] Fig. 4A illustrates example input images 400A-B, and Fig. 4B illustrates example multiple bounding boxes 401/402 that are annotated onto the example input images 400A-B. The input images 400A-B of Fig. 4A may each correspond to separate input images 201 that are used to train a DNN 203A-B according to method 300 of Fig.
3. Bounding boxes 401/402 as shown in Fig. 4B may be included in respective pixel level labels 210 corresponding to the input images 400 A-B. As shown in Fig. 4A, in some embodiments, the general features of two objects that are to be classified by the DNN 203 A-B may be similar, e.g. both objects may be rotationally symmetrical. To distinguish the two objects in different settings, multiple bounding box annotations (bounding boxes 401 corresponding to image 400A in Fig. 4B, and bounding boxes 402 corresponding to image 400B in Fig. 4B) that cover relevant features may be generated in block 301 of method 300.
[0037] Figs. 4A-B are shown for illustrative purposes only. For example, an input image may show any appropriate object, the multiple bounding boxes may have any appropriate number, and each bounding box may have any appropriate shape, size, and location within the mask. [0038] Fig. 5 illustrates an example mask 500A and inverse mask 500B for embodiments of attention loss based DNN training. Mask 500A includes multiple bounding boxes 501, corresponding to bounding boxes 401 as shown in image 400 A of Fig. 4B. Inverse mask 500B is the inverse of mask 500A. Inverse mask 500B also includes bounding boxes 502, and may be used by attention loss module 211 of Fig. 2 to determine the attention loss, in order to ensure that only areas outside of a bounding box incur a penalty. Areas in the attention map 209 corresponding to image 400A of Fig. 4A that are located outside of a bounding box 502 (e.g., the white area of inverse mask 500B) may be used to determine the attention loss in block 305 of method 300 of Fig. 3, while areas inside of a bounding box 502 (e.g., the black areas of inverse mask 500B) may not be used to determine the attention loss in block 305 of method 300 of Fig. 3.
The inverse mask 500B may be used to scale the attention map 209, which gives an attention map that is only non-zero outside of the mask. The attention loss that is determined in block 305 of method 300 of Fig. 3 may be the summation over all values in the scaled attention map.
[0039] Fig. 5 is shown for illustrative purposes only. For example, a mask and an inverse mask as are shown in Fig. 5 may each include any appropriate number of bounding boxes, each having any appropriate shape, size, and location within the mask.
[0040] Fig. 6 illustrates an example attention map 600 A, and an example scaled attention map 600B, for attention loss based DNN training. Attention map 600A may correspond to attention map 209 of Fig. 2. Area 601 indicates an area of the image that was used by the DNN 203 A-B to classify the object in the image. Scaled attention map 600B illustrates application of inverse mask 602, which corresponds to a bounding box 502 as show in inverse mask 500B of Fig. 5B, to the attention map 600A. The penalty area 603, which is outside of the inverse mask 602 in image 600B, is used to calculate the attention loss by attention loss module 211 in block 305 of Fig. 3. While a single attention area and bounding box are shown in attention map 600A and scaled attention map 600B, this is for illustrative purposes only; an attention map such as attention map 600A may include any appropriate number of attention areas, which may be scaled as shown in scaled attention map 600B based on any appropriate inverse mask including any appropriate number of bounding boxes, each bounding box having any appropriate shape, size, and location.
[0041] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the
functionality and/or processing capabilities described with respect to a particular system, system component, device, or device component may be performed by any other system, device, or component. Further, while various illustrative implementations and architectures have been described in accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like may be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase“based on,” or variants thereof, should be interpreted as“based at least in part on.”
[0042] The present disclosure may be a system, a method, apparatus, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
[0043] The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
[0044] Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
[0045] Computer readable program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field- programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.
[0046] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
[0047] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a
programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
[0048] The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
[0049] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware- based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
[0050] The descriptions of the various embodiments of the present techniques have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

CLAIMS What is claimed is:
1. A system, comprising a processor configured to: receive an input image; receive a pixel level label corresponding to the input image; determine, by a deep neural network (DNN), a probable class of the input image; determine an attention map of the DNN corresponding to the probable class of the input image; determine an attention loss of the DNN based on the attention map and the pixel level label; and update weights of the DNN based on the attention loss.
2. The system of claim 1 , wherein the probable class comprises a class that the DNN is being trained to recognize using the input image, and wherein the probable class is one of a predetermined number of classes that are recognized by the DNN.
3. The system of claim 1, wherein the attention map highlights one or more areas of the input image that were used by the DNN to determine the probable class.
4. The system of claim 3, wherein the pixel level label comprises a plurality of bounding boxes, each bounding box corresponding to a respective feature of an object depicted in the input image.
5. The system of claim 4, wherein determining the attention loss comprises: determining an inverse mask of the pixel level label; and scaling the attention map according to the inverse mask, wherein scaling the attention map according to the inverse mask comprises determining a number of pixels in the scaled attention map that are highlighted in the attention map and that are not inside a bounding box in the pixel level label.
6. The system of claim 1, wherein the attention map is determined based on feature maps and decision information corresponding to the input image from the DNN by one of Gradient-weighted Class Activation Mapping (Grad-CAM) or Grad-CAM++.
7. The system of claim 1, further configured to: receive an image level label corresponding to the input image; determine a classification loss based on a difference between the image level label and the probable class; and update the weights of the DNN based on a sum of the classification loss and the attention loss.
8. A computer-implemented method, comprising: receiving, by a processor of a computer, an input image; receiving, by the processor, a pixel level label corresponding to the input image; determining, by a deep neural network (DNN), a probable class of the input image; determining an attention map of the DNN corresponding to the probable class of the input image; determining an attention loss of the DNN based on the attention map and the pixel level label; and updating weights of the DNN based on the attention loss.
9. The computer-implemented method of claim 8, wherein the probable class comprises a class that the DNN is being trained to recognize using the input image, and wherein the probable class is one of a predetermined number of classes that are recognized by the DNN.
10. The computer- implemented method of claim 8, wherein the attention map highlights one or more areas of the input image that were used by the DNN to determine the probable class.
11. The computer- implemented method of claim 10, wherein the pixel level label comprises a plurality of bounding boxes, each bounding box corresponding to a respective feature of an object depicted in the input image.
12. The computer- implemented method of claim 11, wherein determining the attention loss comprises: determining an inverse mask of the pixel level label; and scaling the attention map according to the inverse mask, wherein scaling the attention map according to the inverse mask comprises determining a number of pixels in the scaled attention map that are highlighted in the attention map and that are not inside a bounding box in the pixel level label.
13. The computer- implemented method of claim 8, wherein the attention map is determined based on feature maps and decision information corresponding to the input image from the DNN by one of Gradient-weighted Class Activation Mapping (Grad- CAM) or Grad-CAM++.
14. The computer- implemented method of claim 8, further comprising: receiving an image level label corresponding to the input image; determining a classification loss based on a difference between the image level label and the probable class; and updating the weights of the DNN based on a sum of the classification loss and the attention loss.
15. A computer program product comprising: a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method comprising:
receiving an input image; receiving a pixel level label corresponding to the input image; determining, by a deep neural network (DNN), a probable class of the input image; determining an attention map of the DNN corresponding to the probable class of the input image; determining an attention loss of the DNN based on the attention map and the pixel level label; and updating weights of the DNN based on the attention loss.
16. The computer program product of claim 15, wherein the probable class comprises a class that the DNN is being trained to recognize using the input image, and wherein the probable class is one of a predetermined number of classes that are recognized by the DNN.
17. The computer program product of claim 15, wherein the attention map highlights one or more areas of the input image that were used by the DNN to determine the probable class.
18. The computer program product of claim 17, wherein the pixel level label comprises a plurality of bounding boxes, each bounding box corresponding to a respective feature of an object depicted in the input image.
19. The computer program product of claim 18, wherein determining the attention loss comprises: determining an inverse mask of the pixel level label; and scaling the attention map according to the inverse mask, wherein scaling the attention map according to the inverse mask comprises determining a number of pixels in the scaled attention map that are highlighted in the attention map and that are not inside a bounding box in the pixel level label.
20. The computer program product of claim 15, wherein the attention map is determined based on feature maps and decision information corresponding to the input image from the DNN by one of Gradient-weighted Class Activation Mapping (Grad- CAM) or Grad-CAM++.
PCT/US2019/031507 2018-06-12 2019-05-09 Attention loss based deep neural network training WO2019240900A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201862683844P 2018-06-12 2018-06-12
US201862683791P 2018-06-12 2018-06-12
US62/683,791 2018-06-12
US62/683,844 2018-06-12

Publications (1)

Publication Number Publication Date
WO2019240900A1 true WO2019240900A1 (en) 2019-12-19

Family

ID=66625399

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/031507 WO2019240900A1 (en) 2018-06-12 2019-05-09 Attention loss based deep neural network training

Country Status (1)

Country Link
WO (1) WO2019240900A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046674A (en) * 2019-12-20 2020-04-21 科大讯飞股份有限公司 Semantic understanding method and device, electronic equipment and storage medium
CN111259940A (en) * 2020-01-10 2020-06-09 杭州电子科技大学 Target detection method based on space attention map
CN111538761A (en) * 2020-04-21 2020-08-14 中南大学 Click rate prediction method based on attention mechanism
CN111582409A (en) * 2020-06-29 2020-08-25 腾讯科技(深圳)有限公司 Training method of image label classification network, image label classification method and device
CN111783446A (en) * 2020-05-26 2020-10-16 华为技术有限公司 Method and device for processing sequence
CN112308129A (en) * 2020-10-28 2021-02-02 中国科学院宁波材料技术与工程研究所 Plant nematode data automatic labeling and classification identification method based on deep learning
CN112329659A (en) * 2020-11-10 2021-02-05 平安科技(深圳)有限公司 Weak supervision semantic segmentation method based on vehicle image and related equipment thereof
CN112749667A (en) * 2021-01-15 2021-05-04 中国科学院宁波材料技术与工程研究所 Deep learning-based nematode classification and identification method
CN113487506A (en) * 2021-07-06 2021-10-08 杭州海康威视数字技术股份有限公司 Countermeasure sample defense method, device and system based on attention denoising
CN113507466A (en) * 2021-07-07 2021-10-15 浙江大学 Method and system for defending backdoor attack by knowledge distillation based on attention mechanism
US11475304B2 (en) * 2020-05-12 2022-10-18 International Business Machines Corporation Variational gradient flow
CN111046674B (en) * 2019-12-20 2024-05-31 科大讯飞股份有限公司 Semantic understanding method and device, electronic equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHENXI LIU ET AL: "Attention Correctness in Neural Image Captioning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 31 May 2016 (2016-05-31), XP081396672 *
JIANFENG WANG ET AL: "Face Attention Network: An Effective Face Detector for the Occluded Faces", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 November 2017 (2017-11-20), XP081290095 *
KUNPENG LI ET AL: "Tell Me Where to Look: Guided Attention Inference Network", 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 27 February 2018 (2018-02-27), pages 9215 - 9223, XP055604361, ISBN: 978-1-5386-6420-9, DOI: 10.1109/CVPR.2018.00960 *
TINGTING QIAO ET AL: "Exploring Human-like Attention Supervision in Visual Question Answering", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 September 2017 (2017-09-19), XP080817283 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111046674B (en) * 2019-12-20 2024-05-31 科大讯飞股份有限公司 Semantic understanding method and device, electronic equipment and storage medium
CN111046674A (en) * 2019-12-20 2020-04-21 科大讯飞股份有限公司 Semantic understanding method and device, electronic equipment and storage medium
CN111259940A (en) * 2020-01-10 2020-06-09 杭州电子科技大学 Target detection method based on space attention map
CN111259940B (en) * 2020-01-10 2023-04-07 杭州电子科技大学 Target detection method based on space attention map
CN111538761A (en) * 2020-04-21 2020-08-14 中南大学 Click rate prediction method based on attention mechanism
US11475304B2 (en) * 2020-05-12 2022-10-18 International Business Machines Corporation Variational gradient flow
CN111783446A (en) * 2020-05-26 2020-10-16 华为技术有限公司 Method and device for processing sequence
WO2021238289A1 (en) * 2020-05-26 2021-12-02 华为技术有限公司 Sequence processing method and apparatus
CN111582409A (en) * 2020-06-29 2020-08-25 腾讯科技(深圳)有限公司 Training method of image label classification network, image label classification method and device
CN111582409B (en) * 2020-06-29 2023-12-26 腾讯科技(深圳)有限公司 Training method of image tag classification network, image tag classification method and device
CN112308129A (en) * 2020-10-28 2021-02-02 中国科学院宁波材料技术与工程研究所 Plant nematode data automatic labeling and classification identification method based on deep learning
CN112329659A (en) * 2020-11-10 2021-02-05 平安科技(深圳)有限公司 Weak supervision semantic segmentation method based on vehicle image and related equipment thereof
CN112329659B (en) * 2020-11-10 2023-08-29 平安科技(深圳)有限公司 Weak supervision semantic segmentation method based on vehicle image and related equipment thereof
CN112749667B (en) * 2021-01-15 2023-04-07 中国科学院宁波材料技术与工程研究所 Deep learning-based nematode classification and identification method
CN112749667A (en) * 2021-01-15 2021-05-04 中国科学院宁波材料技术与工程研究所 Deep learning-based nematode classification and identification method
CN113487506B (en) * 2021-07-06 2023-08-29 杭州海康威视数字技术股份有限公司 Attention denoising-based countermeasure sample defense method, device and system
CN113487506A (en) * 2021-07-06 2021-10-08 杭州海康威视数字技术股份有限公司 Countermeasure sample defense method, device and system based on attention denoising
CN113507466A (en) * 2021-07-07 2021-10-15 浙江大学 Method and system for defending backdoor attack by knowledge distillation based on attention mechanism

Similar Documents

Publication Publication Date Title
WO2019240900A1 (en) Attention loss based deep neural network training
WO2019240964A1 (en) Teacher and student based deep neural network training
CN108710885B (en) Target object detection method and device
US11586851B2 (en) Image classification using a mask image and neural networks
CN111435461B (en) Antagonistic input recognition using reduced accuracy deep neural networks
US20220058451A1 (en) Identifying a type of object in a digital image based on overlapping areas of sub-images
CN110706262B (en) Image processing method, device, equipment and storage medium
US10539881B1 (en) Generation of hotspot-containing physical design layout patterns
CN110349138B (en) Target object detection method and device based on example segmentation framework
CN113361593B (en) Method for generating image classification model, road side equipment and cloud control platform
US11195024B1 (en) Context-aware action recognition by dual attention networks
CN111767750A (en) Image processing method and device
CN112883818A (en) Text image recognition method, system, device and storage medium
CN112712036A (en) Traffic sign recognition method and device, electronic equipment and computer storage medium
US11423262B2 (en) Automatically filtering out objects based on user preferences
CN113901998A (en) Model training method, device, equipment, storage medium and detection method
US11295211B2 (en) Multi-scale object detection with a trained neural network
CN111783777B (en) Image processing method, apparatus, electronic device, and computer readable medium
US20210209414A1 (en) Defect detection using multiple models
US11562235B2 (en) Activation function computation for neural networks
CN112907575A (en) Face quality evaluation method and device and electronic equipment
US20230062313A1 (en) Generating 2d mapping using 3d data
CN115272705A (en) Method, device and equipment for training salient object detection model
CN115375657A (en) Method for training polyp detection model, detection method, device, medium, and apparatus
US11354793B2 (en) Object detection with missing annotations in visual inspection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19725592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19725592

Country of ref document: EP

Kind code of ref document: A1