WO2021194490A1 - Method and system for improved attention map guidance for visual recognition in images - Google Patents

Method and system for improved attention map guidance for visual recognition in images Download PDF

Info

Publication number
WO2021194490A1
WO2021194490A1 PCT/US2020/024807 US2020024807W WO2021194490A1 WO 2021194490 A1 WO2021194490 A1 WO 2021194490A1 US 2020024807 W US2020024807 W US 2020024807W WO 2021194490 A1 WO2021194490 A1 WO 2021194490A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
produce
image
attention
attention map
Prior art date
Application number
PCT/US2020/024807
Other languages
French (fr)
Inventor
Kunpeng LI
Kuan-Chuan Peng
Ziyan Wu
Jan Ernst
Original Assignee
Siemens Aktiengesellschaft
Siemens Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft, Siemens Corporation filed Critical Siemens Aktiengesellschaft
Priority to PCT/US2020/024807 priority Critical patent/WO2021194490A1/en
Publication of WO2021194490A1 publication Critical patent/WO2021194490A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/06Recognition of objects for industrial automation

Definitions

  • This application relates to image processing. More particularly, the application relates to machine recognition of objects in images.
  • Machine vision is a term that encompasses technology and processes for extracting useful information from captured images.
  • the information may be used for any number of goals such as automated inspection or analysis of components in an industrial system, by way of non-limiting example.
  • machine vision may be used for automatic inspection, process control, and/or robot guidance.
  • it may be necessary to identify certain objects of interest in a captured image.
  • decisions must be made as to which portions of the captured image contain information related to the object of interest. Separating the pixels in an image containing the object of interest is called segmentation. Segmentation is a non-trivial task as images containing an object may represent the object from different poses or perspectives.
  • background objects in the image make identification of objects difficult for machine-based recognition techniques.
  • Attention maps specify regions of an image that are relevant to an object to be classified in the image.
  • the machine learning network may learn features of the object and recognize these features in regions of the image.
  • the classification of the image may then be performed efficiently by focusing processing efforts on the specific regions of the image that are most likely to produce a correct classification.
  • DNN deep neural networks
  • Complex issues of object identification such as segmentation and anomaly localization require accurate annotations, which may be accomplished for example with bounding boxes, however these bounding boxes must be accurate down to the pixel level. Establishing bounding boxes at this level of detail is time consuming and the results are not scalable.
  • a method for generating an improved attention map in an image includes processing an input image in a trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce an attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to supervision to the neural network to produce an improved attention map.
  • the method may further include generating, by the processing of the input image by the neural network, a score vector representing probabilities that the input image contains a classifier from a plurality of classifiers.
  • the classification loss may be calculated by comparing the score vector to a ground truth vector containing the actual presence of each classifier of the plurality of classifiers, and setting a signal where pixels in the image have a value of one, and all other pixels have a value of zero to produce a gradient.
  • the gradient is backpropagated through the neural network to the last convolution layer in the neural network, and global average pooling may be performed on the gradient to produce the weight vector.
  • An improved attention map may be generated by calculating a weighted sum of a set of feature maps associated with the last convolutional layer to produce the improved attention map.
  • a segmentation mask for the input image is created based on the attention map, wherein regions of the input image identified by the attention map as a target object are masked prior to passing the attention map through the neural network.
  • the masked regions of the input image may be selected based on pixels in the region exceeding a pre-selected threshold placed on the attention map.
  • Retraining the network to produce an improved attention map may include selecting the threshold placed on the attention map to minimize a prediction score for the target object when the masked input image is passed through the neural network, the attention loss comprising the prediction score.
  • the classification loss and the attention loss may identify a target object in the image by using the classification loss and attention loss to supervise machine learning performed by the neural network.
  • generating an improved attention map in an image includes collecting a training dataset for a neural network, the collected training dataset containing a minority percentage of training images that include a low-level segmentation mask label, training a neural network with the collected dataset to produce a trained neural network, processing an input image in the trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce a attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to guide the neural network to provide supervised learning to produce an improved attention map.
  • a minority percentage of training images may be considered between about 10 percent and 15 percent of the dataset of training images. In one embodiment, the minority percentage of training images is about 13 percent of the dataset of training images.
  • the low-level segmentation mask comprises all pixels belonging to a particular target classifier in an associated training image and the minority percentage of training images is adapted to a specific machine vision application.
  • generating an improved attention map in an image includes collecting a training dataset for a neural network, the collected training dataset containing a minority percentage of training images that include a low-level image-level label, training a neural network with the collected dataset to produce a trained neural network, processing an input image in the trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce a attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to guide the neural network to provide supervision to produce an improved attention map.
  • the minority percentage of training images may be between about 3 percent and 7 percent of the dataset of training images. In one embodiment the minority percentage of training images is about 5 percent of the dataset of training images.
  • the low-level image-level label may be defined by a bounding box enclosing a surface portion of a particular target classifier in an associated training image and the minority percentage of training images is adapted to a specific machine vision application. Generating a mask in the training image may be accomplished by setting an area defined by the bounding box as a 1 and other regions of the image as a 0, and training the neural network using the masked training image.
  • FIG. 1 is a high-level diagram of a system for creating an improved attention map using self-guidance according to an embodiment of this disclosure
  • FIG. 2 is an illustration of a masked and cropped input image according to certain embodiments of this disclosure
  • FIG. 3 includes an image of an attention map according to conventional techniques and an improved attention map according to aspects of embodiments of the present disclosure
  • FIG. 4 is an image including an attention map for a target object shown in two perspectives according to a convention technique for generating an attention map:
  • FIG. 5 is a high-level diagram of a system for performing a semi-supervised creation of an improved attention map according to aspects of embodiments of the present disclosure
  • FIG. 6 is an example of an improved attention map according to aspects of embodiments of the present disclosure.
  • FIG. 7 is a high-level diagram of a system for performing a semi-supervised creation of an improved attention map according to aspects of embodiments of the present disclosure
  • FIG. 8 includes an image of an attention map according to conventional techniques and an improved attention map according to aspects of embodiments of the present disclosure
  • FIG. 9 is a block diagram of a computer system for performing the creation of an improved attention map according to aspects of embodiments of the present disclosure.
  • Embodiments described in the specification will present improved techniques for generating attention maps in images, including RGB and depth images.
  • a weakly-supervised method of segmentation is enhanced using self guidance.
  • the guidance mainly relies on the difference of prediction scores from the DNN with respect to ground truth.
  • This technique produces improved attention maps that serve to provide improved priors for object segmentation tasks.
  • existing weakly-supervised methods to generate attention maps are improved to create more accurate attention maps by leveraging the classification ability of the network to design a self-guidance loss, which guides the DNN training process.
  • the self-guidance loss allows a DNN to learn to focus on the whole of an object as opposed to small salient aspects of the object.
  • the network can then make better decisions based on the more complete consideration of the object.
  • FIG. 1 is a high-level diagram of a system for performing weakly supervised segmentation using self-guidance according to an embodiment of this disclosure.
  • An arbitrary image 101 is passed through a convolutional neural network (CNN) 103.
  • CNN convolutional neural network
  • filters are applied to the image and a feature map is produced for each applied filter.
  • the information from the input image 101 is transformed into a set of feature maps 105.
  • Processing then proceeds to one or more fully connected layers 107 to produce a score vector 109.
  • the score vector 109 contains values representative of a probability that a given classifier has been found in the source image 101.
  • a ground truth vector 111 contains values to indicate the actual presence of each classifier in the source image 101.
  • the score vector 109 is subtracted from the ground truth vector 111. The difference between the score vector 109 and the ground truth vector
  • the classification loss is used to generate an attention map 115 which identifies the portions of the input image 101 which are most likely to contain the class of object being detected.
  • a pre-determined threshold is set that uses the attention map 115 to crop the original input image 101.
  • the mask is applied to the source image 101 to block out the portions of the image corresponding to the attention map 115 to produce a masked image 117.
  • the masked image 117 is then passed through the CNN 119 one more time to produce a second score vector 121.
  • the difference between the second score vector 121 and ground truth 111 produces a self-supervision or attention loss 123. Because the most likely pixels relating to the classifier have been masked and removed from the masked image 117, the second score vector 121 values should be very small.
  • the attention loss 123 may be used to assist the training of the CNN 103 to ultimately result in a minimized attention loss 123.
  • the attention loss 123 When the attention loss 123 is minimized, it indicates that the attention map, which is masked in the second pass through the CNN 119, contains the least number of pixels that result in a positive classification for the target classification object. Accordingly, the identified attention map 115 may be expected to contain as much of the whole target object as possible. This is an improvement over prior weakly-supervised techniques which provide a less-complete attention map, resulting in lower classification probabilities using CNNs.
  • the improved method produces a more complete attention mask resulting in higher classification rates.
  • the improved method retains the efficiency of prior methods, requiring input training data that only requires high-level labels at the image level.
  • the classification ability of the network is used to design a self guidance loss that in turn guides the training process of the deep neural network (DNN).
  • the self-guidance loss 123 allows the DNN to focus and learn the whole representation of the target object in the image.
  • Image-level labels associated with the input images 101 use the self-guidance loss 123 to provide constraints on the attention map.
  • the weakly-supervised DNN with a self-guidance constraint allows more accurate attention maps.
  • conventional weakly supervised methods cannot produce an attention map that is sufficient to be used for segmentation
  • the improved attention map includes more of the target object and is useful as a segmentation prior.
  • the technique in the first embodiment may be used in any image modality, including but not limited to RGB images and is useful in any neural network where a classifier is trainable.
  • VOC Pascal visual objects classes
  • the RGB images in the VOC 2012 dataset were used along with their corresponding image-level labels, such as “dog”, “horse”, and the like.
  • this technique does not rely on any low-level labels, such as segmentation masks, or bounding boxes.
  • the only criterion used to augment training of the network is the self guidance or attention loss 123.
  • FIG. 2 is an illustration of a masked and cropped input image according to certain embodiments of this disclosure.
  • a predetermined threshold is set on the attention map of a target class, in the example of FIG. 2 the target class is a “sheep”.
  • the threshold is used to define a mask that is used to crop the original image.
  • the cropped image is then passed forward through the DNN to produce a prediction score for the image.
  • the score for the cropped image should be much smaller than the score vector produced by the original image.
  • an input image depicting the target object e.g., sheep
  • a complete (i.e., unmasked) version of the input image 201 is passed through the deep neural network to produce a classification loss which is backpropagated through the neural network to produce an attention map.
  • Global average pooling may be performed on the gradient to produce a weight vector.
  • the weight vector is then applied to perform a weighted sum to the feature maps of the last convolutional layer and produce an attention map 203.
  • attention map 203 the focus is mainly constrained to the head and less so to the hindquarters of the target object. This results in a probability of identifying the target class at 0.405.
  • the result is masked image 205, which excludes the head and hind quarter of the target object.
  • the resulting attention map 207 based on masked image 205 is now focused on the torso of the target object, and to a less degree, the fore shoulder. It is noted that the new prediction score now shows a probability for identification as the target object as 0.014, much less that the original unmasked image 201.
  • Image 209 shows a masked image where the attention map regions in attention map 203 and attention map 207 are masked out of the input image 209.
  • the resulting attention map 211 is now focused on the tail, belly and legs of the target object and result in an identification probability of 0.008.
  • the cumulation of the attention maps 203, 207, 209 result in a more complete area of focus for identifying the target class as evidence by the low prediction score when these areas are masked and excluded from consideration in the image as depicted in image 209.
  • FIG. 3 provides an exemplary example of improved attention maps generated by a method according to aspects of certain embodiments of this disclosure.
  • an attention map may be generated with on forward pass through the neural network.
  • the first image 301 shows an attention map generated by conventional means such as Grad-CAM.
  • the second image 303 shows an improved attention map produced using the method described above with respect to the first embodiment.
  • the attention map 303 focuses on more of the target object (e.g., cat) including the head, legs, body and tail. This is not achieved through conventional weakly supervised techniques.
  • image 305 shows a second image of a cat and its associated attention map using conventional techniques. The conventional techniques focus solely on the face and miss the legs and body.
  • image 307 produce an attention map using the technique of embodiment 1 to produce an attention map that includes more of the image pertaining to the target object.
  • the improved attention maps of images 303, 307 represent an improvement in the generation of attention maps.
  • the improved attention maps provide better prediction scores and may be used for example, as segmentation priors.
  • a second embodiment for generating improved attention maps for visual recognition in images will now be described.
  • weakly-supervised methods for determining attention maps define portions of the image that support a classifier’s prediction. Due to weakly-supervised methods’ ability to generate attention maps without the need for low-level annotations these methods can conserve resources in many industrial applications.
  • the network may learn to select unimportant areas from the background of the image, or to display some bias of the dataset in order to meet the requirement of image-level supervision successfully. This produces risk in cases where the data that exhibits varying backgrounds or other variances.
  • the second embodiment will be described with reference to a model for detecting coarse rotation of a target object. In an instructive example, a substantially symmetrical shaped object is used to demonstrate detection of rotation of the target object.
  • FIG. 4 is an illustration of a symmetrical target object 410 and its associated attention map 403 using conventional weakly supervised methods.
  • Image 401 shows the symmetrical object 410 in a first position, designated as “normal”.
  • the trained DNN captures part of the background 403 as important regions for the attention map although these regions do not contain the target object. This condition persists in image 405, where the symmetric object 410 is rotated 90° to a second position denoted as “abnormal”.
  • the DNN identifies part of the background 407 and includes this area in the attention map despite that the target object 410 does not appear in the area of the image 405.
  • the attention map 403 mainly focuses on the background rather than the texture and markers on the surface of the target object 410. Occurrences like this that capture the bias in the dataset create undesired risk for misidentification in industrial applications.
  • an end-to-end deep neural network is built having supervision using class level labels combined with a very small number (e.g. about 5 % of the images) that include mask labels generated by bounding boxes.
  • the percentage of low-level mask labels may be found in from about 3 % to about 7 % of all training images. The mask labels are associated with a minority percentage of all training images.
  • Neural networks trained in this manner display superior performance to conventional training techniques when tested using datasets having a different distribution than the data used to train the network but feature require less effort and time than systems which include low-level annotations for all training images.
  • FIG. 5 is a block diagram illustrating the generation of an improved attention map according to aspects of a second embodiment of the present disclosure.
  • An input image 501 is presented to CNN 503.
  • one or more filters are applied with each filter producing a feature map.
  • each convolutional layer produces a set of convolutional feature maps 505.
  • the CNN 503 further includes one or more fully connected layers 507.
  • the input image 501 passes through the convolutional layers 505 and the fully connected layers 507 to produce a score vector 509 that contains a value for each classifier representing the probability that the associated target class is detected in the input image 501.
  • a ground truth vector 513 contains the actual classification data for the input image.
  • the classification loss 511 is used to produce weights 515 on each feature map 505.
  • a signal is set to a value of 1 for the target class and 0 for all others.
  • the gradient is backpropagated through the CNN to the last convolutional layer. Global average pooling is performed to the gradient to produce the weight vector 515.
  • Applying the weight vector 515 to calculate a weighted sum of the feature maps 505 of the last convolutional layer produces the attention map 517.
  • the improved attention mask 517 is more focused on areas of the image containing the target object.
  • the low-level labels are associated with a minority percentage of all training images.
  • a minority percentage contains fewer images with low- level labeling than images that do not contain low-level labels.
  • the improved attention mask 517 may be compared to the ground truth mask 519 to calculate an attention loss 521.
  • the system creates two losses, the classification loss 511 and the attention loss 521 to guide the neural network 503 in focusing on areas containing the target object and to make decisions accordingly.
  • the result from this architecture of FIG. 5 is an improved attention mask that allows the neural network to concentrate on the appropriate region of interest in the input image and to make more accurate decisions based on the improved attention map.
  • FIG. 6 shows two attention maps generated according to the second embodiment depicted in FIG. 5.
  • Attention map 601 shows the target object 410 in a “normal” position.
  • the attention map identifies the surface of the target object 410 to identify the object and pose.
  • improved attention map 601 does not pick up portions of the background or data bias in the image.
  • Attention map 603 shows the target object 410 in an “abnormal” position.
  • the attention map identifies the surface of the target object 410 to identify the object and pose.
  • attention map 603 does not pick up portions of the background or data bias in the image.
  • This embodiment has been proven to improve the training of a DNN to generate improved attention maps that focus on the appropriate relevant areas of the image. The improved attention maps are more reliable for critical processes such as industrial applications.
  • the second embodiment presents a semi-supervised neural network-based method for providing attention map guidance to perform visual recognition on images using a small amount of labeled data with low-level annotations.
  • the low-level annotations provide supervision on the attention map of the network.
  • the DNN can thereby learn to focus on the important areas of the image and make improved decisions accordingly.
  • the described method provides robustness to data relating to varying backgrounds, view perspectives and other dataset bias.
  • a third embodiment according to this disclosure will now be described.
  • weakly supervised techniques for generating attention maps for visual recognition save resources by not requiring large amounts of annotated data.
  • the resulting attention maps are less reliable and may not be suitable for more critical tasks.
  • small amounts of low-level annotated data are included during training of a DNN in the form of segmentation masks to guide the DNN in learning to focus on a target object as a whole in the input image.
  • a small amount, for example about 13% of the training samples include low-level segmentation labels along with the image-level class label.
  • the number of low-level segmentation labels may be found in some embodiments a proportion of images ranging from about 10 percent to about 15 percent of all the training images.
  • the DNN learns to focus on the entire target object and produce an improved attention map.
  • the improved attention map is suitable for use as a segmentation prior.
  • the task of object segmentation using the VOC 2012 dataset is presented as an example to describe this embodiment.
  • the data are divided into two parts.
  • a first portion of the data has only the image level label like ‘dog’, ‘horse’ and so on.
  • the other portion of the data includes both image level label and additionally the ground truth segmentation masks to indicate all pixels in the image belonging to the target class.
  • the method relies on image level labels and a small amount of data with additional ground truth segmentation masks.
  • the percentage of data samples including the additional ground truth segmentation masks may be from about 10 % to about 15 %.
  • annotated images that have both the image level label and the mask label only occupies 13% of the entire data.
  • FIG. 7 is a diagram of the framework for performing a method for generating improved attention maps according to aspects of the third embodiment in the present disclosure.
  • a network Pytorch VGG19 model that is pre-trained on ImageNet and fine-tuned with additional training data may be used.
  • An input image 701 is presented to the trained network that performs a forward pass through one or more convolutional layers 703 and one or more fully connected layers 707 to get a score vector 709.
  • the score vector contains a probability that the image contains a particular class, with each classification having an entry in the score vector 709.
  • the score vector 709 is then compared to a ground truth class label 711 of the image 701 to calculate classification loss 713.
  • a signal is provided where the pixels containing the target class have a value of 1 and all others have a value of 0 is back-propagated and applied to the gradient until the last convolutional layer 703 is reached.
  • Global average pooling may be applied to the gradient to get a weight vector 710 and use this vector to calculate a weighted sum of the last convolutional layer feature maps 705 to get the attention map 715.
  • the L1 loss is calculated between the attention map 715 and the ground truth mask 717 denoted as the attention loss 719. Accordingly, two losses are defined, the classification loss 713 and the attention loss 719. These two losses serve to guide the deep neural network 703 to focus on the complete target objects and learn to make decisions correspondingly.
  • embodiments according to the third embodiment as described herein took an arbitrary image as input and output an improved attention map 715 with one forward pass.
  • FIG. 8 shows some examples of the attention output according to the third embodiment.
  • the attention maps generated by the embodied method are more accurate and complete than those generated by the traditional method.
  • an image of a dog 801 is shown.
  • the attention mask generated by conventional techniques is shown in 801a.
  • the improved attention map according to the described embodiment is shown in 801b.
  • the improved attention map 801b includes more of the image containing the target object.
  • the improved attention map 801b includes the dog’s torso and legs, which are ignored by the convention technique in 801a.
  • Attention maps 803a and 803b illustrate the improved attention map 803b for a target object of a cat.
  • Attention maps 805a and 805b include a group of horses.
  • the improved attention map 805b focuses on the entire body of the horse and learns the presence of a second horse that is not recognized in the attention map 805b generated by traditional methods.
  • Attention maps 807a and 807b likewise provide an improved attention map for two cats, detecting the presence of a second cat not directly facing the image capture device and covering more of the first cat’s face and head as compared the conventionally generated attention map 807a.
  • the attention map generated by our method can serve as a better segmentation prior.
  • FIG. 9 illustrates an exemplary computing environment 900 within which embodiments of the invention may be implemented.
  • Computers and computing environments such as computer system 910 and computing environment 900, are known to those of skill in the art and thus are described briefly here.
  • the computer system 910 may include a communication mechanism such as a system bus 921 or other communication mechanism for communicating information within the computer system 910.
  • the computer system 910 further includes one or more processors 920 coupled with the system bus 921 for processing the information.
  • the processors 920 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as used herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device.
  • CPUs central processing units
  • GPUs graphical processing units
  • a processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general-purpose computer.
  • a processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between.
  • a user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof.
  • a user interface comprises one or more display images enabling user interaction with a processor or other device.
  • the computer system 910 also includes a system memory 930 coupled to the system bus 921 for storing information and instructions to be executed by processors 920.
  • the system memory 930 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 931 and/or random-access memory (RAM) 932.
  • the RAM 932 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM).
  • the ROM 931 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM).
  • system memory 930 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 920.
  • a basic input/output system 933 (BIOS) containing the basic routines that help to transfer information between elements within computer system 910, such as during start-up, may be stored in the ROM 931.
  • RAM 932 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 920.
  • System memory 930 may additionally include, for example, operating system 934, application programs 935, other program modules 936 and program data 937.
  • the computer system 910 also includes a disk controller 940 coupled to the system bus 921 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 941 and a removable media drive 942 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid state drive).
  • Storage devices may be added to the computer system 910 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).
  • SCSI small computer system interface
  • IDE integrated device electronics
  • USB Universal Serial Bus
  • FireWire FireWire
  • the computer system 910 may also include a display controller 965 coupled to the system bus 921 to control a display or monitor 966, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user.
  • the computer system includes an input interface 960 and one or more input devices, such as a keyboard 962 and a pointing device 961 , for interacting with a computer user and providing information to the processors 920.
  • the pointing device 961 for example, may be a mouse, a light pen, a trackball, or a pointing stick for communicating direction information and command selections to the processors 920 and for controlling cursor movement on the display 966.
  • the display 966 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 961.
  • an augmented reality device 967 that is wearable by a user, may provide input/output functionality allowing a user to interact with both a physical and virtual world.
  • the augmented reality device 967 is in communication with the display controller 965 and the user input interface 960 allowing a user to interact with virtual items generated in the augmented reality device 967 by the display controller 965.
  • the user may also provide gestures that are detected by the augmented reality device 967 and transmitted to the user input interface 960 as input signals.
  • the computer system 910 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 920 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 930.
  • a memory such as the system memory 930.
  • Such instructions may be read into the system memory 930 from another computer readable medium, such as a magnetic hard disk 941 or a removable media drive 942.
  • the magnetic hard disk 941 may contain one or more datastores and data files used by embodiments of the present invention. Datastore contents and data files may be encrypted to improve security.
  • the processors 920 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 930.
  • hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
  • the computer system 910 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein.
  • the term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 920 for execution.
  • a computer readable medium may take many forms including, but not limited to, non- transitory, non-volatile media, volatile media, and transmission media.
  • Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 941 or removable media drive 942.
  • Non-limiting examples of volatile media include dynamic memory, such as system memory 930.
  • Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 921. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications.
  • the computing environment 900 may further include the computer system 910 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 980.
  • Remote computing device 980 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 910.
  • computer system 910 may include modem 972 for establishing communications over a network 971 , such as the Internet. Modem 972 may be connected to system bus 921 via user network interface 970, or via another appropriate mechanism.
  • Network 971 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 910 and other computers (e.g., remote computing device 980).
  • the network 971 may be wired, wireless or a combination thereof. Wred connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art.
  • Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 971.
  • An executable application comprises code or machine- readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input.
  • An executable procedure is a segment of code or machine-readable instruction, sub routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
  • a graphical user interface comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions.
  • the GUI also includes an executable procedure or executable application.
  • the executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user.
  • the processor under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

A method for generating an improved attention map in an image includes processing an input image in a trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce a attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to supervision to the neural network to produce an improved attention map. In some embodiments, a minority percentage of training images may be provided with low-level labels in addition to image-level labels. Low-level labels may include bounding boxes enclosing surface regions of a target object or segmentations mask labels for each pixel identifiable as part of the target object.

Description

METHOD AND SYSTEM FOR IMPROVED ATTENTION MAP GUIDANCE FOR
VISUAL RECOGNITION IN IMAGES
TECHNICAL FIELD
[0001] This application relates to image processing. More particularly, the application relates to machine recognition of objects in images.
BACKGROUND
[0002] Machine vision is a term that encompasses technology and processes for extracting useful information from captured images. The information may be used for any number of goals such as automated inspection or analysis of components in an industrial system, by way of non-limiting example. In this context, machine vision may be used for automatic inspection, process control, and/or robot guidance. To perform these functions, it may be necessary to identify certain objects of interest in a captured image. To identify the object of interest, decisions must be made as to which portions of the captured image contain information related to the object of interest. Separating the pixels in an image containing the object of interest is called segmentation. Segmentation is a non-trivial task as images containing an object may represent the object from different poses or perspectives. In addition, background objects in the image make identification of objects difficult for machine-based recognition techniques.
[0003] Attention maps specify regions of an image that are relevant to an object to be classified in the image. The machine learning network may learn features of the object and recognize these features in regions of the image. The classification of the image may then be performed efficiently by focusing processing efforts on the specific regions of the image that are most likely to produce a correct classification. [0004] In order to generate attention maps using deep neural networks (DNN), it is necessary to train the model with a large amount of annotated data. Complex issues of object identification such as segmentation and anomaly localization require accurate annotations, which may be accomplished for example with bounding boxes, however these bounding boxes must be accurate down to the pixel level. Establishing bounding boxes at this level of detail is time consuming and the results are not scalable. Past efforts have focused on weakly supervised methods using simple, high-level annotation, providing limited low-level information while requiring less effort and time. However, without meaningful low-level guidance the outcomes are unreliable and depending on the application, may produce unacceptable risks due to inaccurate recognition of objects. Accordingly, improved methods and systems are desired.
SUMMARY
[0005] According to embodiments described in this application, a method for generating an improved attention map in an image includes processing an input image in a trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce an attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to supervision to the neural network to produce an improved attention map. The method may further include generating, by the processing of the input image by the neural network, a score vector representing probabilities that the input image contains a classifier from a plurality of classifiers. The classification loss may be calculated by comparing the score vector to a ground truth vector containing the actual presence of each classifier of the plurality of classifiers, and setting a signal where pixels in the image have a value of one, and all other pixels have a value of zero to produce a gradient. The gradient is backpropagated through the neural network to the last convolution layer in the neural network, and global average pooling may be performed on the gradient to produce the weight vector.
[0006] An improved attention map may be generated by calculating a weighted sum of a set of feature maps associated with the last convolutional layer to produce the improved attention map. According to some embodiments, a segmentation mask for the input image is created based on the attention map, wherein regions of the input image identified by the attention map as a target object are masked prior to passing the attention map through the neural network. The masked regions of the input image may be selected based on pixels in the region exceeding a pre-selected threshold placed on the attention map.
[0007] Retraining the network to produce an improved attention map may include selecting the threshold placed on the attention map to minimize a prediction score for the target object when the masked input image is passed through the neural network, the attention loss comprising the prediction score. The classification loss and the attention loss may identify a target object in the image by using the classification loss and attention loss to supervise machine learning performed by the neural network.
[0008] According to another embodiment, generating an improved attention map in an image includes collecting a training dataset for a neural network, the collected training dataset containing a minority percentage of training images that include a low-level segmentation mask label, training a neural network with the collected dataset to produce a trained neural network, processing an input image in the trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce a attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to guide the neural network to provide supervised learning to produce an improved attention map. A minority percentage of training images may be considered between about 10 percent and 15 percent of the dataset of training images. In one embodiment, the minority percentage of training images is about 13 percent of the dataset of training images. The low-level segmentation mask comprises all pixels belonging to a particular target classifier in an associated training image and the minority percentage of training images is adapted to a specific machine vision application.
[0009] According to another embodiment, generating an improved attention map in an image includes collecting a training dataset for a neural network, the collected training dataset containing a minority percentage of training images that include a low-level image-level label, training a neural network with the collected dataset to produce a trained neural network, processing an input image in the trained neural network to produce a classification loss, back-propagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network, applying the weight vector to the input image to produce a attention map, passing the attention map through the neural network to produce an attention loss, and using the classification loss and the attention loss to guide the neural network to provide supervision to produce an improved attention map. The minority percentage of training images may be between about 3 percent and 7 percent of the dataset of training images. In one embodiment the minority percentage of training images is about 5 percent of the dataset of training images. The low-level image-level label may be defined by a bounding box enclosing a surface portion of a particular target classifier in an associated training image and the minority percentage of training images is adapted to a specific machine vision application. Generating a mask in the training image may be accomplished by setting an area defined by the bounding box as a 1 and other regions of the image as a 0, and training the neural network using the masked training image.
BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The foregoing and other aspects of the present invention are best understood from the following detailed description when read in connection with the accompanying drawings. For the purpose of illustrating the invention, there is shown in the drawings embodiments that are presently preferred, it being understood, however, that the invention is not limited to the specific instrumentalities disclosed. Included in the drawings are the following Figures:
[0011] FIG. 1 is a high-level diagram of a system for creating an improved attention map using self-guidance according to an embodiment of this disclosure;
[0012] FIG. 2 is an illustration of a masked and cropped input image according to certain embodiments of this disclosure; [0013] FIG. 3 includes an image of an attention map according to conventional techniques and an improved attention map according to aspects of embodiments of the present disclosure;
[0014] FIG. 4 is an image including an attention map for a target object shown in two perspectives according to a convention technique for generating an attention map:
[0015] FIG. 5 is a high-level diagram of a system for performing a semi-supervised creation of an improved attention map according to aspects of embodiments of the present disclosure;
[0016] FIG. 6 is an example of an improved attention map according to aspects of embodiments of the present disclosure.
[0017] FIG. 7 is a high-level diagram of a system for performing a semi-supervised creation of an improved attention map according to aspects of embodiments of the present disclosure;
[0018] FIG. 8 includes an image of an attention map according to conventional techniques and an improved attention map according to aspects of embodiments of the present disclosure;
[0019] FIG. 9 is a block diagram of a computer system for performing the creation of an improved attention map according to aspects of embodiments of the present disclosure.
DETAILED DESCRIPTION
[0020] Embodiments described in the specification will present improved techniques for generating attention maps in images, including RGB and depth images. In a first embodiment, a weakly-supervised method of segmentation is enhanced using self guidance. The guidance mainly relies on the difference of prediction scores from the DNN with respect to ground truth. This technique produces improved attention maps that serve to provide improved priors for object segmentation tasks. According to this embodiment, existing weakly-supervised methods to generate attention maps are improved to create more accurate attention maps by leveraging the classification ability of the network to design a self-guidance loss, which guides the DNN training process. The self-guidance loss allows a DNN to learn to focus on the whole of an object as opposed to small salient aspects of the object. The network can then make better decisions based on the more complete consideration of the object.
[0021] FIG. 1 is a high-level diagram of a system for performing weakly supervised segmentation using self-guidance according to an embodiment of this disclosure. An arbitrary image 101 is passed through a convolutional neural network (CNN) 103. In the convolutional layers, filters are applied to the image and a feature map is produced for each applied filter. In this way, the information from the input image 101 is transformed into a set of feature maps 105. Processing then proceeds to one or more fully connected layers 107 to produce a score vector 109. The score vector 109 contains values representative of a probability that a given classifier has been found in the source image 101. A ground truth vector 111 contains values to indicate the actual presence of each classifier in the source image 101. The score vector 109 is subtracted from the ground truth vector 111. The difference between the score vector 109 and the ground truth vector
111 is denoted as a classification loss 113. [0022] According to the embodiment of FIG. 1 , the classification loss is used to generate an attention map 115 which identifies the portions of the input image 101 which are most likely to contain the class of object being detected. Using the attention map 115 and the source image 101 , a pre-determined threshold is set that uses the attention map 115 to crop the original input image 101. The mask is applied to the source image 101 to block out the portions of the image corresponding to the attention map 115 to produce a masked image 117. The masked image 117 is then passed through the CNN 119 one more time to produce a second score vector 121. The difference between the second score vector 121 and ground truth 111 produces a self-supervision or attention loss 123. Because the most likely pixels relating to the classifier have been masked and removed from the masked image 117, the second score vector 121 values should be very small.
[0023] The attention loss 123 may be used to assist the training of the CNN 103 to ultimately result in a minimized attention loss 123. When the attention loss 123 is minimized, it indicates that the attention map, which is masked in the second pass through the CNN 119, contains the least number of pixels that result in a positive classification for the target classification object. Accordingly, the identified attention map 115 may be expected to contain as much of the whole target object as possible. This is an improvement over prior weakly-supervised techniques which provide a less-complete attention map, resulting in lower classification probabilities using CNNs. The improved method produces a more complete attention mask resulting in higher classification rates. The improved method retains the efficiency of prior methods, requiring input training data that only requires high-level labels at the image level. [0024] To provide improved and more accurate attention maps, which may be used as segmentation priors, the classification ability of the network is used to design a self guidance loss that in turn guides the training process of the deep neural network (DNN). The self-guidance loss 123 allows the DNN to focus and learn the whole representation of the target object in the image. Image-level labels associated with the input images 101 use the self-guidance loss 123 to provide constraints on the attention map. Accordingly, the weakly-supervised DNN with a self-guidance constraint allows more accurate attention maps. While conventional weakly supervised methods cannot produce an attention map that is sufficient to be used for segmentation, the improved attention map includes more of the target object and is useful as a segmentation prior. The technique in the first embodiment may be used in any image modality, including but not limited to RGB images and is useful in any neural network where a classifier is trainable.
[0025] A test was performed to demonstrate the first embodiment focusing on the task of object segmentation using the Pascal visual objects classes (VOC) 2012 dataset. For the training data setup, the RGB images in the VOC 2012 dataset were used along with their corresponding image-level labels, such as “dog”, “horse”, and the like. As stated earlier, this technique does not rely on any low-level labels, such as segmentation masks, or bounding boxes. The only criterion used to augment training of the network is the self guidance or attention loss 123.
[0026] FIG. 2 is an illustration of a masked and cropped input image according to certain embodiments of this disclosure. A predetermined threshold is set on the attention map of a target class, in the example of FIG. 2 the target class is a “sheep”. The threshold is used to define a mask that is used to crop the original image. The cropped image is then passed forward through the DNN to produce a prediction score for the image. The score for the cropped image should be much smaller than the score vector produced by the original image.
[0027] Referring to FIG. 2, an input image depicting the target object (e.g., sheep) is shown. A complete (i.e., unmasked) version of the input image 201 is passed through the deep neural network to produce a classification loss which is backpropagated through the neural network to produce an attention map. Global average pooling may be performed on the gradient to produce a weight vector. The weight vector is then applied to perform a weighted sum to the feature maps of the last convolutional layer and produce an attention map 203. As is seen in attention map 203, the focus is mainly constrained to the head and less so to the hindquarters of the target object. This results in a probability of identifying the target class at 0.405. When the resulting classification loss is used to mask a portion of the image corresponding to the attention map 203, the result is masked image 205, which excludes the head and hind quarter of the target object. The resulting attention map 207 based on masked image 205 is now focused on the torso of the target object, and to a less degree, the fore shoulder. It is noted that the new prediction score now shows a probability for identification as the target object as 0.014, much less that the original unmasked image 201. Image 209 shows a masked image where the attention map regions in attention map 203 and attention map 207 are masked out of the input image 209. The resulting attention map 211 is now focused on the tail, belly and legs of the target object and result in an identification probability of 0.008. The cumulation of the attention maps 203, 207, 209 result in a more complete area of focus for identifying the target class as evidence by the low prediction score when these areas are masked and excluded from consideration in the image as depicted in image 209.
[0028] FIG. 3 provides an exemplary example of improved attention maps generated by a method according to aspects of certain embodiments of this disclosure. During testing using an arbitrary input image, an attention map may be generated with on forward pass through the neural network. In FIG. 4, the first image 301 shows an attention map generated by conventional means such as Grad-CAM. The second image 303 shows an improved attention map produced using the method described above with respect to the first embodiment. The attention map 303 focuses on more of the target object (e.g., cat) including the head, legs, body and tail. This is not achieved through conventional weakly supervised techniques. Similarly, image 305 shows a second image of a cat and its associated attention map using conventional techniques. The conventional techniques focus solely on the face and miss the legs and body. In contrast, image 307 produce an attention map using the technique of embodiment 1 to produce an attention map that includes more of the image pertaining to the target object. The improved attention maps of images 303, 307 represent an improvement in the generation of attention maps. The improved attention maps provide better prediction scores and may be used for example, as segmentation priors.
[0029] A second embodiment for generating improved attention maps for visual recognition in images will now be described. As discussed above, weakly-supervised methods for determining attention maps define portions of the image that support a classifier’s prediction. Due to weakly-supervised methods’ ability to generate attention maps without the need for low-level annotations these methods can conserve resources in many industrial applications. However, due to the powerful modeling ability of DNNs, the network may learn to select unimportant areas from the background of the image, or to display some bias of the dataset in order to meet the requirement of image-level supervision successfully. This produces risk in cases where the data that exhibits varying backgrounds or other variances. The second embodiment will be described with reference to a model for detecting coarse rotation of a target object. In an instructive example, a substantially symmetrical shaped object is used to demonstrate detection of rotation of the target object.
[0030] FIG. 4 is an illustration of a symmetrical target object 410 and its associated attention map 403 using conventional weakly supervised methods. Image 401 shows the symmetrical object 410 in a first position, designated as “normal”. The trained DNN captures part of the background 403 as important regions for the attention map although these regions do not contain the target object. This condition persists in image 405, where the symmetric object 410 is rotated 90° to a second position denoted as “abnormal”. Again, the DNN identifies part of the background 407 and includes this area in the attention map despite that the target object 410 does not appear in the area of the image 405. The example shown in FIG. 4 was performed using a conventional weakly- supervised method for generating the attention map and successfully detected the rotation of the target object 410. However, the attention map 403 mainly focuses on the background rather than the texture and markers on the surface of the target object 410. Occurrences like this that capture the bias in the dataset create undesired risk for misidentification in industrial applications. To improve this, an end-to-end deep neural network is built having supervision using class level labels combined with a very small number (e.g. about 5 % of the images) that include mask labels generated by bounding boxes. According to other embodiments, the percentage of low-level mask labels may be found in from about 3 % to about 7 % of all training images. The mask labels are associated with a minority percentage of all training images. A minority percentage contains fewer images with low-level labeling than images that do not contain low-level labels. Neural networks trained in this manner display superior performance to conventional training techniques when tested using datasets having a different distribution than the data used to train the network but feature require less effort and time than systems which include low-level annotations for all training images.
[0031] FIG. 5 is a block diagram illustrating the generation of an improved attention map according to aspects of a second embodiment of the present disclosure. An input image 501 is presented to CNN 503. Within the convolution layers of CNN 503, one or more filters are applied with each filter producing a feature map. In this way, each convolutional layer produces a set of convolutional feature maps 505. The CNN 503 further includes one or more fully connected layers 507. The input image 501 passes through the convolutional layers 505 and the fully connected layers 507 to produce a score vector 509 that contains a value for each classifier representing the probability that the associated target class is detected in the input image 501. A ground truth vector 513 contains the actual classification data for the input image. Comparing the score vector 509 with the ground truth 513 results in a classification loss 511. The classification loss 511 is used to produce weights 515 on each feature map 505. A signal is set to a value of 1 for the target class and 0 for all others. The gradient is backpropagated through the CNN to the last convolutional layer. Global average pooling is performed to the gradient to produce the weight vector 515. Applying the weight vector 515 to calculate a weighted sum of the feature maps 505 of the last convolutional layer produces the attention map 517. By using the classification loss 511 in combination with training data including a small percentage of low-level labels, the low-level labels indicating a bounding box for masking labels, the improved attention mask 517 is more focused on areas of the image containing the target object. The low-level labels are associated with a minority percentage of all training images. A minority percentage contains fewer images with low- level labeling than images that do not contain low-level labels. The improved attention mask 517 may be compared to the ground truth mask 519 to calculate an attention loss 521. The system creates two losses, the classification loss 511 and the attention loss 521 to guide the neural network 503 in focusing on areas containing the target object and to make decisions accordingly. The result from this architecture of FIG. 5 is an improved attention mask that allows the neural network to concentrate on the appropriate region of interest in the input image and to make more accurate decisions based on the improved attention map.
[0032] FIG. 6 shows two attention maps generated according to the second embodiment depicted in FIG. 5. Attention map 601 shows the target object 410 in a “normal” position. The attention map identifies the surface of the target object 410 to identify the object and pose. In contrast to the attention map shown in FIG. 4, improved attention map 601 does not pick up portions of the background or data bias in the image. Attention map 603 shows the target object 410 in an “abnormal” position. The attention map identifies the surface of the target object 410 to identify the object and pose. In contrast to the attention map shown in FIG. 4, attention map 603 does not pick up portions of the background or data bias in the image. The second embodiment where a small number of training samples include a bounding box for the target object, was tested and compared to a traditionally trained classification network on a dataset containing varied backgrounds and viewpoints. Testing resulted in accuracy of only 45% in the traditionally trained classification network, while the network trained according to the described second embodiment enjoyed an accuracy rate of 100%. This embodiment has been proven to improve the training of a DNN to generate improved attention maps that focus on the appropriate relevant areas of the image. The improved attention maps are more reliable for critical processes such as industrial applications.
[0033] The second embodiment presents a semi-supervised neural network-based method for providing attention map guidance to perform visual recognition on images using a small amount of labeled data with low-level annotations. The low-level annotations provide supervision on the attention map of the network. The DNN can thereby learn to focus on the important areas of the image and make improved decisions accordingly. The described method provides robustness to data relating to varying backgrounds, view perspectives and other dataset bias.
[0034] A third embodiment according to this disclosure will now be described. As has already been discussed above, weakly supervised techniques for generating attention maps for visual recognition save resources by not requiring large amounts of annotated data. However, the resulting attention maps are less reliable and may not be suitable for more critical tasks. According to a third embodiment of this disclosure, small amounts of low-level annotated data are included during training of a DNN in the form of segmentation masks to guide the DNN in learning to focus on a target object as a whole in the input image.
[0035] According to this embodiment, a small amount, for example about 13% of the training samples include low-level segmentation labels along with the image-level class label. The number of low-level segmentation labels may be found in some embodiments a proportion of images ranging from about 10 percent to about 15 percent of all the training images. Using this technique, the DNN learns to focus on the entire target object and produce an improved attention map. The improved attention map is suitable for use as a segmentation prior.
[0036] The task of object segmentation using the VOC 2012 dataset is presented as an example to describe this embodiment. For training data setup, the data are divided into two parts. A first portion of the data has only the image level label like ‘dog’, ‘horse’ and so on. The other portion of the data includes both image level label and additionally the ground truth segmentation masks to indicate all pixels in the image belonging to the target class. The method relies on image level labels and a small amount of data with additional ground truth segmentation masks. In one example, the percentage of data samples including the additional ground truth segmentation masks may be from about 10 % to about 15 %. In one exemplary test, annotated images that have both the image level label and the mask label only occupies 13% of the entire data. These numerical examples used in the experiment are merely provided by way of example to help provide an understanding of the technique, and they may be adapted for other applications. [0037] After generating such a dataset with the image level labels and a small amount of them including mask labels, a DNN classifier is trained with the partially low-level annotated training data.
[0038] FIG. 7 is a diagram of the framework for performing a method for generating improved attention maps according to aspects of the third embodiment in the present disclosure. In some embodiments where a limited amount of training data is available, a network Pytorch VGG19 model that is pre-trained on ImageNet and fine-tuned with additional training data may be used. However, it should be understood that other network topologies may be used in combination or in replacement of such a system. An input image 701 is presented to the trained network that performs a forward pass through one or more convolutional layers 703 and one or more fully connected layers 707 to get a score vector 709. The score vector contains a probability that the image contains a particular class, with each classification having an entry in the score vector 709. The score vector 709 is then compared to a ground truth class label 711 of the image 701 to calculate classification loss 713. A signal is provided where the pixels containing the target class have a value of 1 and all others have a value of 0 is back-propagated and applied to the gradient until the last convolutional layer 703 is reached. Global average pooling may be applied to the gradient to get a weight vector 710 and use this vector to calculate a weighted sum of the last convolutional layer feature maps 705 to get the attention map 715. Finally, the L1 loss is calculated between the attention map 715 and the ground truth mask 717 denoted as the attention loss 719. Accordingly, two losses are defined, the classification loss 713 and the attention loss 719. These two losses serve to guide the deep neural network 703 to focus on the complete target objects and learn to make decisions correspondingly. During testing, embodiments according to the third embodiment as described herein took an arbitrary image as input and output an improved attention map 715 with one forward pass.
[0039] FIG. 8 shows some examples of the attention output according to the third embodiment. The attention maps generated by the embodied method are more accurate and complete than those generated by the traditional method. As seen in the examples of FIG. 8, an image of a dog 801 is shown. The attention mask generated by conventional techniques is shown in 801a. The improved attention map according to the described embodiment is shown in 801b. The improved attention map 801b includes more of the image containing the target object. For example, the improved attention map 801b includes the dog’s torso and legs, which are ignored by the convention technique in 801a. For other target objects, the improvements are similar. Attention maps 803a and 803b illustrate the improved attention map 803b for a target object of a cat. Attention maps 805a and 805b include a group of horses. The improved attention map 805b focuses on the entire body of the horse and learns the presence of a second horse that is not recognized in the attention map 805b generated by traditional methods. Attention maps 807a and 807b likewise provide an improved attention map for two cats, detecting the presence of a second cat not directly facing the image capture device and covering more of the first cat’s face and head as compared the conventionally generated attention map 807a. The attention map generated by our method can serve as a better segmentation prior.
[0040] FIG. 9 illustrates an exemplary computing environment 900 within which embodiments of the invention may be implemented. Computers and computing environments, such as computer system 910 and computing environment 900, are known to those of skill in the art and thus are described briefly here.
[0041] As shown in FIG. 9, the computer system 910 may include a communication mechanism such as a system bus 921 or other communication mechanism for communicating information within the computer system 910. The computer system 910 further includes one or more processors 920 coupled with the system bus 921 for processing the information.
[0042] The processors 920 may include one or more central processing units (CPUs), graphical processing units (GPUs), or any other processor known in the art. More generally, a processor as used herein is a device for executing machine-readable instructions stored on a computer readable medium, for performing tasks and may comprise any one or combination of, hardware and firmware. A processor may also comprise memory storing machine-readable instructions executable for performing tasks. A processor acts upon information by manipulating, analyzing, modifying, converting or transmitting information for use by an executable procedure or an information device, and/or by routing the information to an output device. A processor may use or comprise the capabilities of a computer, controller or microprocessor, for example, and be conditioned using executable instructions to perform special purpose functions not performed by a general-purpose computer. A processor may be coupled (electrically and/or as comprising executable components) with any other processor enabling interaction and/or communication there-between. A user interface processor or generator is a known element comprising electronic circuitry or software or a combination of both for generating display images or portions thereof. A user interface comprises one or more display images enabling user interaction with a processor or other device.
[0043] Continuing with reference to FIG. 9, the computer system 910 also includes a system memory 930 coupled to the system bus 921 for storing information and instructions to be executed by processors 920. The system memory 930 may include computer readable storage media in the form of volatile and/or nonvolatile memory, such as read only memory (ROM) 931 and/or random-access memory (RAM) 932. The RAM 932 may include other dynamic storage device(s) (e.g., dynamic RAM, static RAM, and synchronous DRAM). The ROM 931 may include other static storage device(s) (e.g., programmable ROM, erasable PROM, and electrically erasable PROM). In addition, the system memory 930 may be used for storing temporary variables or other intermediate information during the execution of instructions by the processors 920. A basic input/output system 933 (BIOS) containing the basic routines that help to transfer information between elements within computer system 910, such as during start-up, may be stored in the ROM 931. RAM 932 may contain data and/or program modules that are immediately accessible to and/or presently being operated on by the processors 920. System memory 930 may additionally include, for example, operating system 934, application programs 935, other program modules 936 and program data 937.
[0044] The computer system 910 also includes a disk controller 940 coupled to the system bus 921 to control one or more storage devices for storing information and instructions, such as a magnetic hard disk 941 and a removable media drive 942 (e.g., floppy disk drive, compact disc drive, tape drive, and/or solid state drive). Storage devices may be added to the computer system 910 using an appropriate device interface (e.g., a small computer system interface (SCSI), integrated device electronics (IDE), Universal Serial Bus (USB), or FireWire).
[0045] The computer system 910 may also include a display controller 965 coupled to the system bus 921 to control a display or monitor 966, such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user. The computer system includes an input interface 960 and one or more input devices, such as a keyboard 962 and a pointing device 961 , for interacting with a computer user and providing information to the processors 920. The pointing device 961, for example, may be a mouse, a light pen, a trackball, or a pointing stick for communicating direction information and command selections to the processors 920 and for controlling cursor movement on the display 966. The display 966 may provide a touch screen interface which allows input to supplement or replace the communication of direction information and command selections by the pointing device 961. In some embodiments, an augmented reality device 967 that is wearable by a user, may provide input/output functionality allowing a user to interact with both a physical and virtual world. The augmented reality device 967 is in communication with the display controller 965 and the user input interface 960 allowing a user to interact with virtual items generated in the augmented reality device 967 by the display controller 965. The user may also provide gestures that are detected by the augmented reality device 967 and transmitted to the user input interface 960 as input signals.
[0046] The computer system 910 may perform a portion or all of the processing steps of embodiments of the invention in response to the processors 920 executing one or more sequences of one or more instructions contained in a memory, such as the system memory 930. Such instructions may be read into the system memory 930 from another computer readable medium, such as a magnetic hard disk 941 or a removable media drive 942. The magnetic hard disk 941 may contain one or more datastores and data files used by embodiments of the present invention. Datastore contents and data files may be encrypted to improve security. The processors 920 may also be employed in a multi-processing arrangement to execute the one or more sequences of instructions contained in system memory 930. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions. Thus, embodiments are not limited to any specific combination of hardware circuitry and software.
[0047] As stated above, the computer system 910 may include at least one computer readable medium or memory for holding instructions programmed according to embodiments of the invention and for containing data structures, tables, records, or other data described herein. The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processors 920 for execution. A computer readable medium may take many forms including, but not limited to, non- transitory, non-volatile media, volatile media, and transmission media. Non-limiting examples of non-volatile media include optical disks, solid state drives, magnetic disks, and magneto-optical disks, such as magnetic hard disk 941 or removable media drive 942. Non-limiting examples of volatile media include dynamic memory, such as system memory 930. Non-limiting examples of transmission media include coaxial cables, copper wire, and fiber optics, including the wires that make up the system bus 921. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. [0048] The computing environment 900 may further include the computer system 910 operating in a networked environment using logical connections to one or more remote computers, such as remote computing device 980. Remote computing device 980 may be a personal computer (laptop or desktop), a mobile device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to computer system 910. When used in a networking environment, computer system 910 may include modem 972 for establishing communications over a network 971 , such as the Internet. Modem 972 may be connected to system bus 921 via user network interface 970, or via another appropriate mechanism.
[0049] Network 971 may be any network or system generally known in the art, including the Internet, an intranet, a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), a direct connection or series of connections, a cellular telephone network, or any other network or medium capable of facilitating communication between computer system 910 and other computers (e.g., remote computing device 980). The network 971 may be wired, wireless or a combination thereof. Wred connections may be implemented using Ethernet, Universal Serial Bus (USB), RJ-6, or any other wired connection generally known in the art. Wireless connections may be implemented using Wi-Fi, WiMAX, and Bluetooth, infrared, cellular networks, satellite or any other wireless connection methodology generally known in the art. Additionally, several networks may work alone or in communication with each other to facilitate communication in the network 971.
[0050] An executable application, as used herein, comprises code or machine- readable instructions for conditioning the processor to implement predetermined functions, such as those of an operating system, a context data acquisition system or other information processing system, for example, in response to user command or input. An executable procedure is a segment of code or machine-readable instruction, sub routine, or other distinct section of code or portion of an executable application for performing one or more particular processes. These processes may include receiving input data and/or parameters, performing operations on received input data and/or performing functions in response to received input parameters, and providing resulting output data and/or parameters.
[0051] A graphical user interface (GUI), as used herein, comprises one or more display images, generated by a display processor and enabling user interaction with a processor or other device and associated data acquisition and processing functions. The GUI also includes an executable procedure or executable application. The executable procedure or executable application conditions the display processor to generate signals representing the GUI display images. These signals are supplied to a display device which displays the image for viewing by the user. The processor, under control of an executable procedure or executable application, manipulates the GUI display images in response to signals received from the input devices. In this way, the user may interact with the display image using the input devices, enabling user interaction with the processor or other device.
[0052] The functions and process steps herein may be performed automatically or wholly or partially in response to user command. An activity (including a step) performed automatically is performed in response to one or more executable instructions or device operation without user direct initiation of the activity. [0053] The system and processes of the figures are not exclusive. Other systems, processes and menus may be derived in accordance with the principles of the invention to accomplish the same objectives. Although this invention has been described with reference to particular embodiments, it is to be understood that the embodiments and variations shown and described herein are for illustration purposes only. Modifications to the current design may be implemented by those skilled in the art, without departing from the scope of the invention. As described herein, the various systems, subsystems, agents, managers and processes can be implemented using hardware components, software components, and/or combinations thereof.

Claims

What is claimed is:
1. A method for generating an improved attention map for an image comprising: processing an input image in a trained neural network to produce a classification loss; backpropagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network; applying the weight vector to the input image to produce an attention map; passing the attention map through the neural network to produce an attention loss; and using the classification loss and the attention loss to provide supervision to the neural network to produce an improved attention map.
2. The method of Claim 1 , further comprising: generating, by the processing of the input image by the neural network, a score vector representing probabilities that the input image contains a classifier from a plurality of classifiers.
3. The method of Claim 2, wherein the classification loss is calculated by comparing the score vector to a ground truth vector containing the actual presence of each classifier of the plurality of classifiers.
4. The method of Claim 1 further wherein backpropagating the classification loss further comprises: setting a signal where pixels of a target object in the image have a value of one, and all other pixels have a value of zero to produce a gradient; backpropagating the gradient through the neural network to the last convolution layer in the neural network; and performing global average pooling on the gradient to produce the weight vector.
5. The method of Claim 4, further comprising: calculate a weighted sum of a set of feature maps associated with the last convolutional layer to produce the attention map.
6. The method of Claim 1 , further comprising: producing a segmentation mask for the input image based on the attention map, wherein regions of the input image identified in the attention map as a target object are masked prior to passing the attention map through the neural network.
7. The method of Claim 6, wherein the regions of the input image that are masked are selected based on exceeding a pre-selected threshold placed on the attention map.
8. The method of Claim 7, wherein retraining the network to produce an improved attention map further comprises: selecting the threshold placed on the attention map to minimize a prediction score for the target object when the masked input image is passed through the neural network, the attention loss comprising the prediction score.
9. The method of Claim 1 , further comprising: processing an input image using the classification loss and the attention loss to identify a target object in the image, wherein the classification loss and attention loss are used to supervise machine learning performed by the neural network.
10. A method for generating an improved attention map in an image comprising: collecting a training dataset for a neural network, the collected training dataset containing a minority percentage of training images that include a low-level segmentation mask label; training a neural network with the collected dataset to produce a trained neural network; processing an input image in the trained neural network to produce a classification loss; backpropagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network; applying the weight vector to the input image to produce an attention map; passing the attention map through the neural network to produce an attention loss; and using the classification loss and the attention loss to guide the neural network to provide supervised learning to produce an improved attention map.
11 . The method of Claim 10, wherein the minority percentage of training images is between about 10 percent and 15 percent of the dataset of training images.
12. The method of claim 10, wherein the minority percentage of training images is about 13 percent of the dataset of training images.
13. The method of Claim 10, wherein the low-level segmentation mask comprises all pixels belonging to a particular target classifier in an associated training image.
14. The method of Claim 10, wherein the minority percentage of training images is adapted to a specific machine vision application.
15. A method for generating an improved attention map in an image comprising: collecting a training dataset for a neural network, the collected training dataset containing a minority percentage of training images that include a low-level image-level label; training a neural network with the collected dataset to produce a trained neural network; processing an input image in the trained neural network to produce a classification loss; backpropagating the classification loss through the neural network to produce a weight vector in a last convolutional layer of the neural network; applying the weight vector to the input image to produce an attention map; passing the attention map through the neural network to produce an attention loss; and using the classification loss and the attention loss to guide the neural network to provide supervision to produce an improved attention map.
16. The method of Claim 15, wherein the minority percentage of training images is between about 3 percent and 7 percent of the dataset of training images.
17. The method of Claim 15, wherein the minority percentage of training images is about 5 percent of the dataset of training images.
19. The method of Claim 15, wherein the low-level image-level label comprises a bounding box enclosing a surface portion of a particular target classifier in an associated training image.
18. The method of Claim 15, wherein the minority percentage of training images is adapted to a specific machine vision application.
20. The method of Claim 19, further comprising: generating a mask in the training image by setting an area defined by the bounding box as a 1 and other regions of the image as a 0; and training the neural network using the masked training image.
PCT/US2020/024807 2020-03-26 2020-03-26 Method and system for improved attention map guidance for visual recognition in images WO2021194490A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2020/024807 WO2021194490A1 (en) 2020-03-26 2020-03-26 Method and system for improved attention map guidance for visual recognition in images

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2020/024807 WO2021194490A1 (en) 2020-03-26 2020-03-26 Method and system for improved attention map guidance for visual recognition in images

Publications (1)

Publication Number Publication Date
WO2021194490A1 true WO2021194490A1 (en) 2021-09-30

Family

ID=70465241

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/024807 WO2021194490A1 (en) 2020-03-26 2020-03-26 Method and system for improved attention map guidance for visual recognition in images

Country Status (1)

Country Link
WO (1) WO2021194490A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210406693A1 (en) * 2020-06-25 2021-12-30 Nxp B.V. Data sample analysis in a dataset for a machine learning model
CN114037699A (en) * 2021-12-07 2022-02-11 中国医学科学院北京协和医院 Pathological image classification method, equipment, system and storage medium
WO2023081095A1 (en) * 2021-11-05 2023-05-11 Subtle Medical, Inc. Systems and methods for multi-contrast multi-scale vision transformers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUNPENG LI ET AL: "Guided Attention Inference Network", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 7 June 2019 (2019-06-07), USA, pages 1 - 1, XP055743399, ISSN: 0162-8828, DOI: 10.1109/TPAMI.2019.2921543 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210406693A1 (en) * 2020-06-25 2021-12-30 Nxp B.V. Data sample analysis in a dataset for a machine learning model
WO2023081095A1 (en) * 2021-11-05 2023-05-11 Subtle Medical, Inc. Systems and methods for multi-contrast multi-scale vision transformers
CN114037699A (en) * 2021-12-07 2022-02-11 中国医学科学院北京协和医院 Pathological image classification method, equipment, system and storage medium

Similar Documents

Publication Publication Date Title
Daftry et al. Introspective perception: Learning to predict failures in vision systems
Wang et al. Multi-label image recognition by recurrently discovering attentional regions
Li et al. Deepsaliency: Multi-task deep neural network model for salient object detection
Su et al. Deep multi-state object pose estimation for augmented reality assembly
CN114902294A (en) Fine-grained visual recognition in mobile augmented reality
WO2021194490A1 (en) Method and system for improved attention map guidance for visual recognition in images
Jayaraman et al. End-to-end policy learning for active visual categorization
JP7286013B2 (en) Video content recognition method, apparatus, program and computer device
US10614575B2 (en) Searching trees: live time-lapse cell-cycle progression modeling and analysis
CN112990222B (en) Image boundary knowledge migration-based guided semantic segmentation method
Wang et al. Rethinking the learning paradigm for dynamic facial expression recognition
US20220366244A1 (en) Modeling Human Behavior in Work Environment Using Neural Networks
Muthu et al. Motion segmentation of rgb-d sequences: Combining semantic and motion information using statistical inference
CN114549557A (en) Portrait segmentation network training method, device, equipment and medium
CN114330588A (en) Picture classification method, picture classification model training method and related device
Lee et al. Landing area recognition using deep learning for unammaned aerial vehicles
Tvoroshenko et al. Object identification method based on image keypoint descriptors
Teng et al. Clickbait-v2: Training an object detector in real-time
WO2021226296A1 (en) Semi-automated image annotation for machine learning
Ramasamy et al. Object detection and tracking in video using deep learning techniques: A review
Hajar et al. Autonomous UAV-based cattle detection and counting using YOLOv3 and deep sort
Wu et al. Weighted classification of machine learning to recognize human activities
Jakob et al. Extracting training data for machine learning road segmentation from pedestrian perspective
Dugăeșescu et al. Evaluation of Class Activation Methods for Understanding Image Classification Tasks
Le et al. A+ D-Net: Shadow detection with adversarial shadow attenuation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20721866

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20721866

Country of ref document: EP

Kind code of ref document: A1