WO2020097461A1

WO2020097461A1 - Convolutional neural networks with reduced attention overlap

Info

Publication number: WO2020097461A1
Application number: PCT/US2019/060469
Authority: WO
Inventors: Lezi WANG; Kuan-Chuan Peng; Ziyan Wu
Original assignee: Siemens Aktiengesellschaft; Siemens Corporation
Priority date: 2018-11-08
Filing date: 2019-11-08
Publication date: 2020-05-14

Abstract

The ability of a particular convolutional neural network model to predict labels or classes of test data can be referred to as the model's generalization ability. For example, it can be said that a model that performs well at predicting labels or classes is a model that generalizes well, or simply, that it is a good model. Current approaches to training and testing CNN models result in CNN models that lack generalization capabilities. Embodiments of the invention improve generalization, among addressing other issues, of CNN models by reducing attention overlap between target and confused classes during training.

Description

CONVOLUTIONAL NEURAL NETWORKS WITH REDUCED ATTENTION OVERLAP

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application Serial No. 62/757,203 filed November 8, 2018, the disclosure of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] This application relates to neural networks. The technology described herein is particularly well-suited for, but not limited to, computer vision tasks that include attention maps.

BACKGROUND

[0003] Convolutional Neural Networks (CNNs) and other deep networks have enabled unprecedented breakthroughs in a variety of medical imaging and computer vision tasks, such as image classification, object detection, semantic segmentation, image captioning, and the like. CNNs typically include a plurality of processing elements arranged in layers. For example, a CNN often includes an input layer, an output layer, and any appropriate number of intermediate layers between the input layer and the output layer. Each layer can be composed of neurons that are connected to neurons of other layers, for instance adjacent layers. Each connection can have a numerical weight associated with it that signifies its importance. A particular layer of a CNN may generate an output that is determined based on a weighted sum of any inputs that the particular layer receives. The inputs to a layer of a CNN are provided from the input to the CNN, or from the output of another layer of the CNN.

l [0004] Neural networks are typically trained with training data, then tested with test data. In an example classification problem, during training, a learning algorithm takes a collection of labeled examples as inputs and produces a model, for instance a CNN model. During testing, the model can take an unlabeled example as input and either directly output a label (or class) or output a number (e.g., probability or class score) that can be used to deduce the label. In a typical classification problem, a label is a member of a finite set of classes, and thus the label can refer to a class, classification, category, or the like. Similarly, regression can refer to the problem of predicting a real-valued label, which can also be referred to as a target, given an unlabeled example. A common regression example is predicting a house price valuation based on house features such as location, number of bedrooms, number of bathrooms, and the like. During training for an example regression problem, a collection of labeled examples is used as inputs into a regression learning algorithm to produce a model, for instance a CNN model. During testing, the model can take unlabeled examples (test data) as inputs and output a respective target. The ability of a particular CNN model to predict labels or classes of test data can be referred to as the model’s generalization ability.

For example, it can be said that a model that performs well at predicting labels or classes is a model that generalizes well, or simply, that the model is a good model. Current approaches to training and testing CNN models result in CNN models that lack generalization capabilities.

SUMMARY

[0005] Embodiments of the invention address and overcome one or more of the described-herein shortcomings by providing methods, systems, and apparatuses that improve generalization of CNN models by reducing attention overlap between target and confused classes.

[0006] In an example aspect, a CNN is configured to classify images. The CNN includes an input layer configured to receive an image, and an output layer configured to generate class scores associated with the image. Based on the class scores, the CNN can classify the received image. The class scores include a target class score associated with a correct classification of the image, and a confused class score associated with an incorrect classification of the image. The CNN further includes a plurality of intermediate layers connected between the input layer and the output layer.

A select intermediate layer of the plurality of intermediate layers is configured to generate a target attention map associated with the target class score, and a confused attention map associated with the confused class score. The CNN further includes an attention overlap module configured to compare the target attention map to the confused attention map so as to determine a region (an overlapping region) in which the target attention map and the confused attention map overlap with each other. The attention overlap module can be further configured to reduce the overlapping region so as to improve a generalization ability of the convolutional neural network.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:

[0008] FIG. 1 is a block diagram of a convolutional neural network (CNN) model or system in accordance with various example embodiments.

[0009] FIG. 2 is a flow diagram for training a CNN, for instance the CNN depicted in FIG. 1 , in accordance with an example embodiment.

[0010] FIG. 3 shows an example of a computing environment within which embodiments of the disclosure may be implemented.

DETAILED DESCRIPTION

[0011] Embodiments of the invention address and overcome one or more of the described-herein shortcomings or technical problems by providing methods, systems, and apparatuses that reduce attention overlap between target and confused classes, so as to improve the generalization ability of various CNN models.

[0012] In an image classification example, it is observed herein that greater than 70% of failures of an example CNN model result from the CNN attending to the same image regions to determine that the image is in the target (correct) class or a confused (incorrect) class. That is, there can be significant overlaps between the attention of the ground-truth class and the false positives. It is further recognized herein that false (incorrect) classifications can stem from patterns across classes that confuse the model. Further still, it is recognized herein that eliminating these confusions can lead to better model discriminability. In light of the above-mentioned observed example, among others, embodiments are described herein for reducing attention overlap so as to provide discriminative attention, which can improve generalization. Generalization in this context generally refers to how well a particular model can classify new examples (e.g., images or other test data) in the future.

[0013] In an example, a category-oriented attention map, or an attention map that is specific to a particular classification or category, is generated that is trainable. In particular, a CNN can be trained so as to minimize a soft-lntersection-over-Union (sloU) loss in order to reduce the regions in which multiple confused categories show a high activation response. As used herein, unless otherwise specified, class activation maps and attention maps can be used interchangeably, without limitation. A given class activation map or attention map can indicate which regions in a given image were relevant to a classification of the image. By way of example, regions that are relied upon in a given classification can indicate a high activation response in a class activation or attention map.

[0014] In accordance with various embodiments, class activation maps are generated. The maps can provide a visual indication of the attention of a given model, and thus can also be referred to as attention maps, as described above. The attention maps can also be used to compute the SloU loss. In accordance with various examples described herein, the SloU loss is differentiable and can be minimized so as to decrease confusion between the target and other classes. Furthermore, various embodiments described herein can be applied to CNN models, for instance any CNN, without changing the network architecture of the respective CNN, so as to improve the generalization of the respective CNN.

[0015] Referring now to FIG. 1 , an example system or CNN model 100 can be configured to learn and classify data, such as images for example, in accordance with various example embodiments. The CNN 100 includes a plurality of layers, for instance an input layer 102a configured to receive an image, an output layer 103b configured to generate class scores associated with the image, and a plurality of intermediate layers connected between the input layer 102a and the output layer 103b. In particular, the intermediate layers and the input layer 102a can define a plurality of convolutional layers 102. The intermediate layers can further include one or more fully connected layers 103. The convolution layers 102 can include the input layer 102a configured to receive training and test data, such as images. The convolutional layers 102 can further include a final convolutional or last feature layer 102c, and one or more intermediate or second convolutional layers 102b disposed between the input layer 102a and the final convolutional layer 102c. It will be understood that the illustrated CNN model 100 is simplified for purposes of example. In particular, for example, CNN models may include any number of layers as desired, in particular any number of intermediate layers, and all such models are contemplated as being within the scope of this disclosure.

[0016] The fully connected layers 103, which can include a first layer 103a and a second or output layer 103b, include connections between layers that are fully connected. For example, a neuron in the first layer 103a may communicate its output to every neuron in the second layer 103b, such that each neuron in the second layer 103b will receive input from every neuron in the first layer 103a. It will again be understood that the CNN model is simplified for purposes of explanation, and that the CNN model 100 is not limited to the number of illustrated fully connected layers 103. In contrast to the fully connected layers, the convolutional layers 102 may be locally connected, such that, for example, the neurons in the intermediate layer 102b might be connected to a limited number of neurons in the final convolutional layer 102c. The convolutional layers 102 can also be configured to share connections strengths associated with the strength of each neuron. [0017] Still referring to FIG. 1 , the input layer 102a can be configured to receive inputs 104, for instance an image 104, and the output layer 103b can be configured to return an output 106. The output 106 can include a classification associated with the input 104. For example, the output 106 can include an output vector that indicates a plurality of class scores 108 for various classes or categories. Thus, the output layer 103b can be configured to generate class scores 108 associated with the image 104. The class scores 108 can include a target class score 108a associated with a correct classification of the image 104. The class scores 108 can further include one or more confused or incorrect class scores, for instance a confused class score 108b, associated with an incorrect classification of the image 104. In the illustrated example, the target class score 108a corresponds to the“bed” classification, which is the correct label for the example image 104. Further, in the illustrated example, the example confused class score 108b corresponds to the“sofa” classification, which is an incorrect or confused label for the example image 104. Although the confused score 108b represents the incorrect classification, in some cases, the confused score 108b can be near or higher than the target class score 108a for a particular input (test data). In such a case, the CNN model may predict an incorrect classification for the particular input.

[0018] In an example, the class scores 108 include class scores other than the confused class score 108b that are also associated with incorrect classifications of the image 104. In some cases, the confused class score 108b defines a classification probability that is higher than a probability defined by any of the other class scores associated with incorrect classifications. Thus, an attention map associated with the confused class score 108b may be a candidate for reducing the attention overlap with an attention map associated with the target class score 108a, as further described herein.

[0019] The input 104 is referred to also referred to as the image 104 for purposes of example, but embodiments are not so limited. The input 104 can include a vectorized input. By way of example, the input 104 can be a medical image (e.g., 2D or 3D) that is classified with respect to healthy and diseased structures. By way of further example, without limitation the input 104 can be an industrial image, for instance an image that includes a part that is classified so as to identify the part for an assembly. It will be understood that the CNN model 100 can provide visual recognition and classification of various objects and/or images captured by various sensors or cameras, and all such objects and images are contemplated as being within the scope of this disclosure. The processing of each layer of the CNN 100 may be considered a spatially invariant template or basis projection. If the input 104 is first decomposed into multiple channels, such as the red, green, and blue (RGB) channels of a color image, then the CNN 100 that is trained on that input 104 may be considered three-dimensional (3D), with two spatial dimensions along the axes of the image and a third dimension capturing color information. The outputs of the convolutional connections can be considered to form a feature map in the subsequent layer. For example, the convolutional layers 102a-c may be considered feature maps, wherein each element of the respective feature map receives input from a range of neurons in the previous channel and from each of the multiple channels. The values in the feature map may be further processed with a non- linearity, such as a rectification. Values from adjacent neurons may be further pooled, which can correspond to down sampling, and may provide additional local invariance and dimensionality reduction.

[0020] In an example, the input 104 is an image, such that that each input to a convolutional layer 102 is an input image. The convolutional layers 102 can include one or more filters that are applied to respective input data to generate the respective feature map. The input 104 may be convolved with filters that are set during the training stage of the CNN 100. In some cases, the outputs of the layers 102 detail the image structures (e.g., features) that best matched the filters of the respective layer, thereby identifying those image structures. In an example, each of the layers 102 in the CNN 100 detect image structures in an escalating manner, such that the deeper layers detect features having greater complexity. By way of example, the layer 102b may detect edges, and the layer 102c that follows the layer 102b, and thus is deeper than the layer 102b, may detect other object attributes, such as curvature and texture.

[0021] Turning now to generating attention maps, in accordance with various embodiments, category-specific attention maps are generated and improved upon during training of a given CNN. With respect to a specific layer 102, for instance any or all of the layers 102 of the CNN 100, a category-specific attention map ( Att^c ) can be generated by solving equation (1 ) below.

With respect to equation (1 ), and referring additionally to FIG. 1 , the CNN 100 can generate a class score Y^c (class scores 108 in FIG. 1 ) for the input 104, wherein the class score corresponds to a particular classification or class C. By way of example, referring to FIG. 1 , and without limitation, given the illustrated input 104, the CNN 100 can generate the target class score 108a that corresponds to the“bed” class. By way of further example, given the illustrated input 104, the CNN 100 can generate the class score 108b that corresponds to the“sofa” class 108b. A gradient of the class score Y^c can be computed by taking the derivative of the class score with respect to the feature map A^k . In particular, given the class score Y^c for the class C and the feature map A^k in the /c-th channel, the class-specific gradient can be determined by computing the partial

3Y^C

derivative Thus, in equation (1 ), the gradient of the class score is represented as

3Y^C

By way of example, and without limitation, the gradient can be determined with respect to the feature map of the final convolutional layer 102c. Continuing with that example, k can additionally, or alternatively, be considered as the feature index for the

3Y^C

final convolutional layer 102c. Thus, the class-specific gradient

can indicate how the class score changes in response to a change of a filter or weight in the layer 102c. That is, the impact of the filter or weight on the output 106, in particular the impact the filter or weight has on classification decision within the output 106, can be determined. In some cases, the higher the gradient, the higher the particular filter or weight has on the score, and thus the classification decision.

[0022] Still referring to equation (1 ), w represents the channel-wise importance.

That is, can indicate the importance of the feature map A^k in the /c-th channel, which can be determined by solving equation (2) below.

With respect to equation (2), Z represents the number of pixels in the feature map A^k, and (i,y) represents the position of a pixel, such that

represents the pixel in position

( i,j ) of feature map A^k , and— _F represents the pixel importance. Thus, the channel- d^Aij

wise importance or weight w can be a global average of the pixel importance.

[0023] Referring to equations (1 ) and (2), the ReLu operation preserves the positive values and sets the negative as zero. Without being bound by theory, the ReLu operation is applied so that there is no impact from negative gradients, because an objective, in some cases, is change (train) the CNN 100 so that a given class score is increased. A positive gradient at a specific location can imply that increasing the pixel intensity A^k will have a positive impact on the class score Y^c. As shown in equation (2), the positive gradients can be averaged to obtain the channel-wise importance.

[0024] Thus, with reference to equation (1 ), in accordance with an example embodiment, a weighted combination of the feature maps is calculated for a particular layer 102, for instance the final convolutional layer 102c, to generate the category- oriented attention map Att^c, which incorporates both pixel and channel wise

importance. One or more of the layers of the CNN system 100, for instance one or more of the convolutional layers 102, can generate a plurality of features maps, and can compute a weighted combination of the features maps so as to generate a target attention map ( Att ^*) and a confused attention map (Att^conf). In particular, in

accordance with the example represented by equation (1 ), the (g) operation represents an element wise multiplication between gradient ^ and feature map A*, which captures the pixel-wise importance. Thus, one or more of the layers of the CNN system 100, for instance one or more of the convolutional layers 102, can be configured to perform an element wise multiplication between each feature map and a gradient of its respective class score so as to define the pixel-wise importance, wherein the gradient of the class score can be defined as a derivative of the class score with respect to the feature map. As an example, referring to FIG. 1 , the target attention map can be associated with the target class score 108a, and the confused attention map can be associated with the confused class score 108b. The ReLu operation can be applied to the combination of feature maps because, in some cases, there is only an interest in features that have a positive influence on the class of interest (e.g., target 108a). In particular, one or more of the convolutional layers 102 can be configured to apply a rectified linear unit ( ReLu ) activation function or operation on the pixel-wise importance. Consequently, in some cases, only regions having a positive impact on the class score are kept, such that the different attention maps from different layers each have a high activation response on the same target object.

[0025] Turning now to removing attention overlaps, in some cases, the computed attention maps can provide a visualization of the neuron’s attention, and can also be utilized to update a given CNN, for example, due to its differentiability. Given attention maps of a target class (Att^*), which can also be referred to herein as the target attention map, and attention maps of a confused class (Att^conf), which can also be referred to herein as the confused attention map, the attention overlap can be measured by computing the intersection-over-union (loU) of the attention maps. Referring to FIG. 1 , as an example, the CNN 100 can include an attention overlap module 1 10. The attention overlap module can be configured to compare the target attention map to the confused attention map so as to determine region in which the target attention map and the confused attention map overlap with each other. The attention overlap module 1 10 can be further configured, for instance during training of the CNN 100, to reduce the region in which the target attention map and the confused attention map overlap with each other. In some cases, the attention overlap module 1 10 is further configured to reduce the region in which the target attention map and the confused attention map overlap with each other so as to improve a generalization ability of the CNN 100, wherein the generalization ability is defined as a percentage of test data that is correctly classified by the CNN 100. For example, the attention overlap module 1 10 can reduce the overlapping region of the confused and target attention maps to increase the percentage of test data (e.g., future inputs or images) that is correctly classified by the CNN 100.

[0026] The module 1 10 can be configured to receive or generate attention maps from any of the layers, for instance any of all of the convolutional layers 102. In an example, the attention overlap module 1 10 operates on the maps associated with a select one of the intermediate layers of the CNN 100. For example, the select intermediate layer can be the final convolutional layer 102c, such that the target attention map is from the final convolutional layer 102c and the confused attention map is from the from the final convolutional layer 102c. In some cases, the confused class that is selected is the confused class that has the confused class score 108b that is closest to, or higher than, the class score 108a of the target class as compared to the other confused classes. The module 1 10 can compare the attention maps to determine their loU. In particular, the area of intersection of the two attention maps (the attention map that corresponds to the target class having the target class score 108b and the attention map that corresponds to the confused class having the confused class score 108b) can be divided by their combined area (or area of union), so as generate their loU.

[0027] It is recognized herein, however, that the traditional and above-described loU is computed on attention maps having binary values, which are not differentiable. Thus, in accordance with various embodiments, instead of computing a typical loU, the attention overlay module 1 10 receives or computes attention maps that include real values, and uses them to compute a soft Intersection-of-Union (sloU) loss:

With respect to equation (3), Attl_j ^and Att^°^nf represent the (i,y) pixel in the target and confused attention maps Att^* and Att^conf , respectively. In accordance with the example, the sloU is differentiable and its value is within a range from 0 to 1. By way of example, two attention maps, for instance the maps that correspond to the target class 108a and the confused class 108b, that include respective high responses at the same locations have a high sloU. Minimizing the sloU can rectify the attention of the CNN 100, thereby reducing attention overlap between target and confused categories. Thus, the attention overlap module 1 10 can be further configured to quantify and reduce the region in which the target attention map and the confused attention map overlap with each other. For example, the attention overlap module 1 10 can be further configured to reduce the region by computing a soft-intersection-over-union associated with the target attention map and the confused attention map, wherein the soft-intersection-over-union is differentiable and defines a value between 0 and 1.

[0028] In an example, referring again to FIG. 1 , the sloU is implemented in an add- on module (attention overlap module 1 10) during training of the CNN 100, without changing the network architecture of the CNN 100. Thus, in some cases, the attention overlap module 1 10 can be added to any CNN without changing the architecture of the CNN. As shown, the attention overlay module 1 10, and thus the sloU, can be applied to the last feature layer of a given CNN, for instance the layer 102c of the CNN model 100. Additionally, or alternatively, the sloU can be computed for any other layers of a given CNN, for instance the layers 102a and 102b of the CNN 100. Thus, the attention of a given CNN can be checked and/or modified in different scales, for instance any scale as desired. In particular, the select intermediate layer with which the attention overlap module 1 10 operates can be one of the convolutional layers 102c that is disposed closer to the output layer 103b than the other convolutional layers 102, such that the select intermediate layer defines the final feature layer. In some cases, for example, the plurality of intermediate layers further includes at least one fully connected layer that is adjacent to the final feature layer, such that no other convolutional layers 102 are between the final feature layer and the at least one fully connected layer. Alternatively, or additionally, the select intermediate layer can be disposed closer to the input layer 102a that the other intermediate layers. Alternatively still, or additionally, the attention overlap mode 1 10 can operate with the input layer 102a.

[0029] Without being bound by theory, by applying the attention of the ground-truth class to the attention of the most confusing class to the attention overlap module 1 10, wherein the most confusing class can be defined as the class with non-ground truth class with the highest classification probability, the two attentions can be forced to be separated. In particular, during training, the overlapping regions of the attention maps can be quantified and minimized by the attention overlap module 1 10. The overlapping regions are penalized by explicitly reducing the overlap between the target and a confusing class, for instance the most confusing class. More particularly, the SloU is differentiable, and thus can be minimized. [0030] Referring now to FIG. 2, an example method for training a CNN, for instance the CNN system 100 is shown. At 202, the CNN receives an image. At 204, the CNN determines class scores for the image. The class scores can correspond to a predetermined number of classifications that are associated with the image, such that each class score corresponds to a respective classification in the predetermined number of classifications. At 206, the CNN identifies a target class score. The target class score is associated with a correct classification of the image. At 208, the CNN identifies a confused class score. The confused class score is associated with an incorrect classification of the image. At 210, the CNN generates a target attention map associated with the target class score, and a confused attention map associated with the confused class score. At 212, the CNN determines a region in which the target attention map and the confused attention map overlap with each other. Further, at 212, the CNN reduces the overlapping region so as to improve a generalization ability of the convolutional neural network. Reducing the overlapping region can further include computing a soft-intersection-over-union associated with the target attention map and the confused attention map, wherein the soft-intersection-over-union is differentiable and defines a value between 0 and 1. The training of the CNN can further include quantifying the region in which the target attention map and the confused attention map overlap with each other.

[0031] With continuing reference to FIG. 2, the class scores that are determined at 202 can further include other class scores associated with incorrect classifications of the image. Further, the confused class score that is identified at 208 can define a classification probability that is higher than a probability defined by any of the other class scores, thought it will be understood that other confused class scores can be identified at 208, as desired.

[0032] FIG. 3 illustrates an example of a computing environment within which embodiments of the present disclosure may be implemented. A computing environment 500 includes a computer system 510 that may include a communication mechanism such as a system bus 521 or other communication mechanism for communicating information within the computer system 510. The computer system 510 further includes one or more processors 520 coupled with the system bus 521 for processing the information.

[0033] The processors 520 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as described herein is a device for executing machine- readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general purpose computer. A processor may include any type of suitable processing unit including, but not limited to, a central processing unit, a microprocessor, a Reduced Instruction Set Computer (RISC) microprocessor, a

Complex Instruction Set Computer (CISC) microprocessor, a microcontroller, an

Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), a System-on-a-Chip (SoC), a digital signal processor (DSP), and so forth. Further, the processor(s) 520 may have any suitable microarchitecture design that includes any number of constituent components such as, for example, registers, multiplexers, arithmetic logic units, cache controllers for controlling read/write operations to cache memory, branch predictors, or the like. The microarchitecture design of the processor may be capable of supporting any of a variety of instruction sets. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device. [0034] The system bus 521 may include at least one of a system bus, a memory bus, an address bus, or a message bus, and may permit exchange of information (e.g., data (including computer-executable code), signaling, etc.) between various

components of the computer system 510. The system bus 521 may include, without limitation, a memory bus or a memory controller, a peripheral bus, an accelerated graphics port, and so forth. The system bus 521 may be associated with any suitable bus architecture including, without limitation, an Industry Standard Architecture (ISA), a Micro Channel Architecture (MCA), an Enhanced ISA (EISA), a Video Electronics Standards Association (VESA) architecture, an Accelerated Graphics Port (AGP) architecture, a Peripheral Component Interconnects (PCI) architecture, a PCI-Express architecture, a Personal Computer Memory Card International Association (PCMCIA) architecture, a Universal Serial Bus (USB) architecture, and so forth.

[0035] Continuing with reference to FIG. 3, the computer system 510 may also include a system memory 530 coupled to the system bus 521 for storing information and instructions to be executed by processors 520. The system memory 530 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 531 and/or random access memory (RAM) 532. The RAM 532 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The ROM 531 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 530 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 520. A basic input/output system 533 (BIOS) containing the basic routines that help to transfer information between elements within computer system 510, such as during start-up, may be stored in the ROM 531 . RAM 532 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 520. System memory 530 may additionally include, for example, operating system 534, application programs 535, and other program modules 536. Application programs 535 may also include a user portal for development of the application program, allowing input parameters to be entered and modified as necessary. [0036] The operating system 534 may be loaded into the memory 530 and may provide an interface between other application software executing on the computer system 510 and hardware resources of the computer system 510. More specifically, the operating system 534 may include a set of computer-executable instructions for managing hardware resources of the computer system 510 and for providing common services to other application programs (e.g., managing memory allocation among various application programs). In certain example embodiments, the operating system 534 may control execution of one or more of the program modules depicted as being stored in the data storage 540. The operating system 534 may include any operating system now known or which may be developed in the future including, but not limited to, any server operating system, any mainframe operating system, or any other proprietary or non-proprietary operating system.

[0037] The computer system 510 may also include a disk/media controller 543 coupled to the system bus 521 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 541 and/or a removable media drive 542 (e.g., floppy disk drive, compact disc drive, tape drive, flash drive, and/or solid state drive). Storage devices 540 may be added to the computer system 510 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire). Storage devices 541 , 542 may be external to the computer system 510.

[0038] The computer system 510 may also include a field device interface 565 coupled to the system bus 521 to control a field device 566, such as a device used in a production line. The computer system 510 may include a user input interface or GUI 561 , which may comprise one or more input devices, such as a keyboard, touchscreen, tablet and/or a pointing device, for interacting with a computer user and providing information to the processors 520.

[0039] The computer system 510 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 520 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 530. Such instructions may be read into the system memory 530 from another computer readable medium of storage 540, such as the magnetic hard disk 541 or the removable media drive 542. The magnetic hard disk 541 and/or removable media drive 542 may contain one or more data stores and data files used by

embodiments of the present disclosure. The data store 540 may include, but are not limited to, databases (e.g., relational, object-oriented, etc.), file systems, flat files, distributed data stores in which data is stored on more than one node of a computer network, peer-to-peer network data stores, or the like. The data stores may store various types of data such as, for example, skill data, sensor data, or any other data generated in accordance with the embodiments of the disclosure. Data store contents and data files may be encrypted to improve security. The processors 520 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 530. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.

[0040] As stated above, the computer system 510 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term“computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 520 for execution. A computer readable medium may take many forms including, but not limited to, non-transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 541 or removable media drive 542. Non-limiting examples of volatile media include dynamic memory, such as system memory 530. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 521. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.

[0041] Computer readable medium instructions for carrying out operations of the present disclosure may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar

programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example,

programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present disclosure.

[0042] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, may be implemented by computer readable medium instructions.

[0043] The computing environment 500 may further include the computer system 510 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 580. The network interface 570 may enable communication, for example, with other remote devices 580 or systems and/or the storage devices 541 , 542 via the network 571. Remote computing device 580 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 510. When used in a networking environment, computer system 510 may include modem 672 for establishing communications over a network 571 , such as the Internet. Modem 672 may be connected to system bus 521 via user network interface 570, or via another appropriate mechanism.

[0044] Network 571 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of

connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 510 and other computers (e.g., remote computing device 580). The network 571 may be wired, wireless or a

combination thereof. Wired connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art.

Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 571.

[0045] It should be appreciated that the program modules, applications, computer- executable instructions, code, or the like depicted in FIG. 3 as being stored in the system memory 530 are merely illustrative and not exhaustive and that processing described as being supported by any particular module may alternatively be distributed across multiple modules or performed by a different module. In addition, various program module(s), script(s), plug-in(s), Application Programming Interface(s) (API(s)), or any other suitable computer-executable code hosted locally on the computer system 510, the remote device 580, and/or hosted on other computing device(s) accessible via one or more of the network(s) 571 , may be provided to support functionality provided by the program modules, applications, or computer-executable code depicted in FIG. 3 and/or additional or alternate functionality. Further, functionality may be modularized differently such that processing described as being supported collectively by the collection of program modules depicted in FIG. 3 may be performed by a fewer or greater number of modules, or functionality described as being supported by any particular module may be supported, at least in part, by another module. In addition, program modules that support the functionality described herein may form part of one or more applications executable across any number of systems or devices in accordance with any suitable computing model such as, for example, a client-server model, a peer- to-peer model, and so forth. In addition, any of the functionality described as being supported by any of the program modules depicted in FIG. 3 may be implemented, at least partially, in hardware and/or firmware across any number of devices.

[0046] It should further be appreciated that the computer system 510 may include alternate and/or additional hardware, software, or firmware components beyond those described or depicted without departing from the scope of the disclosure. More particularly, it should be appreciated that software, firmware, or hardware components depicted as forming part of the computer system 510 are merely illustrative and that some components may not be present or additional components may be provided in various embodiments. While various illustrative program modules have been depicted and described as software modules stored in system memory 530, it should be appreciated that functionality described as being supported by the program modules may be enabled by any combination of hardware, software, and/or firmware. It should further be appreciated that each of the above-mentioned modules may, in various embodiments, represent a logical partitioning of supported functionality. This logical partitioning is depicted for ease of explanation of the functionality and may not be representative of the structure of software, hardware, and/or firmware for implementing the functionality. Accordingly, it should be appreciated that functionality described as being provided by a particular module may, in various embodiments, be provided at least in part by one or more other modules. Further, one or more depicted modules may not be present in certain embodiments, while in other embodiments, additional modules not depicted may be present and may support at least a portion of the described functionality and/or additional functionality. Moreover, while certain modules may be depicted and described as sub-modules of another module, in certain embodiments, such modules may be provided as independent modules or as sub-modules of other modules.

[0047] Although specific embodiments of the disclosure have been described, one of ordinary skill in the art will recognize that numerous other modifications and alternative embodiments are within the scope of the disclosure. For example, any of the functionality and/or processing capabilities described with respect to a particular device or component may be performed by any other device or component. Further, while various illustrative implementations and architectures have been described in

accordance with embodiments of the disclosure, one of ordinary skill in the art will appreciate that numerous other modifications to the illustrative implementations and architectures described herein are also within the scope of this disclosure. In addition, it should be appreciated that any operation, element, component, data, or the like described herein as being based on another operation, element, component, data, or the like can be additionally based on one or more other operations, elements, components, data, or the like. Accordingly, the phrase“based on,” or variants thereof, should be interpreted as“based at least in part on.”

[0048] Although embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the disclosure is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the embodiments. Conditional language, such as, among others,“can,”“could,”“might,” or“may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments could include, while other embodiments do not include, certain features, elements, and/or steps. Thus, such conditional language is not generally intended to imply that features, elements, and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include logic for deciding, with or without user input or prompting, whether these features, elements, and/or steps are included or are to be performed in any particular embodiment.

[0049] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Claims

CLAIMS What is claimed is:

1. A convolutional neural network system configured to classify images, the convolutional neural network comprising:

an input layer configured to receive an image;

an output layer configured to generate class scores associated with the image, the class scores comprising a target class score associated with a correct classification of the image, and a confused class score associated with an incorrect classification of the image;

a plurality of intermediate layers connected between the input layer and the output layer, a select intermediate layer of the plurality of intermediate layers configured to generate a target attention map associated with the target class score, and a confused attention map associated with the confused class score; and

an attention overlap module configured to:

compare the target attention map to the confused attention map so as to determine a region in which the target attention map and the confused attention map overlap with each other; and

reduce the region in which the target attention map and the confused attention map overlap with each other.

2. The convolutional neural network system as recited in claim 1 , wherein the attention overlap module is further configured to reduce the region in which the target attention map and the confused attention map overlap with each other so as to improve a generalization ability of the convolutional neural network, the generalization ability defined as a percentage of test data that is correctly classified by the convolutional neural network.

3. The convolutional neural network system as recited in claim 1 , wherein the attention overlap module is further configured to quantify the region in which the target attention map and the confused attention map overlap with each other.

4. The convolutional neural network system as recited in claim 1 , wherein the attention overlap module is further configured to reduce the region by computing a soft- intersection-over-union associated with the target attention map and the confused attention map, wherein the soft-intersection-over-union is differentiable and defines a value between 0 and 1 .

5. The convolutional neural network system as recited in claim 1 , wherein the class scores further comprise other class scores associated with incorrect classifications of the image, and the confused class score defines a classification probability that is higher than a probability defined by any of the other class scores.

6. The convolutional neural network system as recited in claim 1 , wherein the plurality of intermediate layers comprise a plurality of convolutional layers, and the select intermediate layer is one of the convolutional layers, the select intermediate layer disposed closer to the output layer than the other convolutional layers, such that the select intermediate layer defines a final feature layer.

7. The convolutional neural network system as recited in claim 6, wherein the plurality of intermediate layers further comprise at least one fully connected layer, the at least one fully connected layer adjacent to the final feature layer, such that no other convolutional layers are between the final feature layer and the at least one fully connected layer.

8. The convolutional neural network system as recited in claim 1 , wherein the plurality of intermediate layers comprises a plurality of convolutional layers, and the select intermediate layer is one of the convolutional layers, the select intermediate layer disposed closer to the input layer than the other intermediate layers.

9. The convolutional neural network system as recited in claim 1 , wherein the select intermediate layer is further configured to:

generate a plurality of feature maps; and

compute a weighted combination of the feature maps so as to generate the target attention map and the confused attention map.

10. The convolutional neural network system as recited in claim 9, wherein the select intermediate layer is further configured to perform an element wise multiplication between each feature map and a gradient of its respective class score so as to define a pixel-wise importance, the gradient of the class score defined as a derivative of the class score with respect to the feature map.

1 1 . The convolutional neural network system as recited in claim 10, wherein the select intermediate layer is further configured to apply a rectified linear unit activation function on the pixel-wise importance.

12. A method of training a convolutional neural network, the method comprising: receiving an image;

determining class scores for the image, the class scores corresponding to a predetermined number of classifications that are associated with the image, such that each class score corresponds to a respective classification in the predetermined number of classifications;

identifying a target class score, the target class score associated with a correct classification of the image;

identifying a confused class score, the confused class score associated with an incorrect classification of the image; generating a target attention map associated with the target class score, and a confused attention map associated with the confused class score;

determining a region in which the target attention map and the confused attention map overlap with each other; and

reducing the region in which the target attention map and the confused attention map overlap with each other.

13. The method as recited in claim 12, wherein the reducing the region further comprises improving a generalization ability of the convolutional neural network, the generalization ability defined as a percentage of test data that is correctly classified by the convolutional neural network, such that improving the generalization ability comprises increasing the percentage of the test that is correctly classified by the convolution neural network.

14. The method as recited in claim 12, the method further comprising:

quantifying the region in which the target attention map and the confused attention map overlap with each other.

15. The method as recited in claim 12, wherein reducing the region further comprises computing a soft-intersection-over-union associated with the target attention map and the confused attention map, wherein the soft-intersection-over-union is differentiable and defines a value between 0 and 1.

16. The method as recited in claim 12, wherein the class scores further comprise other class scores associated with incorrect classifications of the image, and the confused class score defines a classification probability that is higher than a probability defined by any of the other class scores.

17. The method as recited in claim 12, the method further comprising: generating a plurality of feature maps; and

computing a weighted combination of the feature maps so as to generate the target attention map and the confused attention map.

18. The method as recited in claim 17, the method further comprising performing an element wise multiplication between each feature map and a gradient of its respective class score so as to define a pixel-wise importance, the gradient of the class score defined as a derivative of the class score with respect to the feature map.

19. The method as recited in claim 18, the method further comprising applying a rectified linear unit activation function on the pixel-wise importance.

20. A computer program product comprising:

a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processing device to cause the processing device to perform a method comprising:

receiving an image;

identifying a confused class score, the confused class score associated with an incorrect classification of the image;

generating a target attention map associated with the target class score, and a confused attention map associated with the confused class score;

determining a region in which the target attention map and the confused attention map overlap with each other; and reducing the region in which the target attention map and the confused attention map overlap with each other, so as to improve the generalization ability of a convolutional neural network.