CN115063655A

CN115063655A - Class activation mapping graph generation method fusing supercolumns

Info

Publication number: CN115063655A
Application number: CN202111655904.4A
Authority: CN
Inventors: 刘晶晶; 吕学强; 游新冬; 韩晶; 刘国明
Original assignee: Beijing Aerospace Automatic Control Research Institute
Current assignee: Beijing Aerospace Automatic Control Research Institute
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-09-16

Abstract

The invention relates to a method for generating a class activation mapping map fused with supercolumns, which comprises the following steps: dividing the network convolution layer into a low region, a middle region and a high region according to a filter channel, and extracting the last convolution block of each region as information of three different levels of the low region, the middle region and the high region; the second step is that: the output features of two levels of d1 and d2 are up-sampled to be feature dimensions of a low level, then deep splicing is carried out to obtain a feature map, and standardization is carried out to enable the range of elements contained in the feature map to be between [0,1 ]; the third step: grouping and batching the feature maps obtained in the second step, and obtaining the confidence coefficient of each group of feature maps by adopting a confidence coefficient algorithm for each group of feature maps; the fourth step: splicing the confidence results of all groups to obtain a multi-dimensional vector, performing soft maximum (softmax ()) processing on the vector, and taking the result as the contribution degree of each feature map; the fifth step: and multiplying the contribution degrees by the corresponding feature maps, and adding the multiplied feature maps to obtain a final class activation map.

Description

Class activation mapping graph generation method fusing supercolumns

Technical Field

The invention belongs to the field of artificial intelligence.

Background

Deep learning is well-established, the interpretability of neural networks has been a hotspot in discussion, and the interpretability is often tied to model visualizations, which help us understand which features can guide the models in classifying images. Many different visualization techniques are known, such as visualizing the intermediate output of the convolutional neural network (intermediate activation), visualizing the filter of the convolutional neural network, and visualizing activation-like thermodynamic diagrams in images. The basic principle of Class Activation Mapping (CAM) is to find out the weight corresponding to each channel by using back propagation for the feature map of the last convolutional layer, and the greater the weight, the more important the corresponding feature map is. And then multiplying the corresponding weight and the feature map to obtain a final class activation map. Although the CAM method can locate and provide a basis for network judgment to a certain class, and theoretical derivation is sufficient, the CAM method also has great disadvantages: the network needs to be trained for the second time to obtain the weight corresponding to each feature map. The Grad-CAM algorithm combines the discriminant of CAM with the gradient-based pixel space visualization technology to obtain a high-resolution class prediction interpretation map, and the technology is not limited to a full convolution network and can be used for a common CNN structure. The Grad-CAM + + algorithm on the basis is used for optimizing the Grad-CAM result, so that the positioning is more accurate, and the method is more suitable for the situation that more than one object of the target class is contained in the image. However, since the algorithms such as the Grad-CAM and the Grad-CAM + + use gradients to obtain the feature weights, the gradients are prone to noise and saturation problems for the deep neural network, and thus the effect is affected. The Score-CAM algorithm is the first to get rid of the dependence on the gradient and measures the linear weight by using the global confidence Score of the model for the feature map. The Ablation-CAM algorithm and the SS-CAM algorithm are also free from the dependence on the gradient, the visualization result is more focused, and the noise in the background is greatly reduced. However, the CAM generated by the three algorithms of the Score-CAM, the approximation-CAM and the SS-CAM mainly depends on the characteristics obtained by convolution of the last layer of the neural network, but the characteristics of the middle and lower layers of the network are not paid much attention, so that the problems of incomplete important information, loss of edge information and the like contained in the generated characteristic diagram are easily caused.

Disclosure of Invention

The technical problem solved by the invention is as follows: the defects of the prior art are overcome, and a class activation mapping chart generation method fusing supercolumns is provided.

The technical scheme of the invention is as follows: a method for generating a class activation map fused with supercolumns comprises the following steps:

the first step is as follows: dividing the network convolution layer into three regions according to the filter channel, sequentially marking the three regions as a low region, a middle region and a high region, and extracting the last convolution block of each region as information of three different levels of the low region, the middle region and the high region, wherein the characteristic dimension of the marked low level is d0(a0 a0 b0), the characteristic dimension of the marked middle level is d1(a1 a1 b1), and the characteristic dimension of the marked high level is d2(a2 a2 b 2);

the second step is that: the output features of two levels of d1 and d2 are up-sampled to be feature dimensions of a0 × a0, then three levels of output features with unified dimensions are deeply spliced to obtain a feature map, and the feature map is standardized to enable the range of elements contained in the feature map to be between [0,1 ];

the third step: carrying out grouping batch processing on the feature maps obtained in the second step, wherein each group of feature maps adopts a confidence coefficient algorithm to obtain the confidence coefficient of the group of feature maps;

the fourth step: splicing the confidence results of all groups to obtain a multi-dimensional vector, performing soft maximum processing on the multi-dimensional vector, and taking the result as the contribution of each feature map;

the fifth step: and multiplying the contribution degree obtained in the fourth step by the corresponding feature maps, and adding the multiplied feature maps to obtain the final class activation map.

Preferably, a0 ≠ a1 ≠ a 2.

Preferably, the network convolution is divided into three levels in the first step. The features learned by the neural network are discriminative features, and the features learned by the network in the previous layers are generally low-level features such as colors, edges and the like; the middle part of the network learns the texture features; the last layers of the network learn distinctive features which are complete and have distinguishing key features. The invention divides the network into different levels according to different characteristics learned by the network, and the level learned by the network to low-level characteristics is regarded as a low level, the level learned to textural characteristics is regarded as a middle level, and the level learned to key semantic characteristics is regarded as a high level.

Preferably, the grouping preference range in batch processing is generally 2 ⁿ How much of the specific settings are related to the effect of the experiment and the memory utilization used at that time, and the most appropriate values need to be determined by trial and error.

Preferably, the feature activation map calculation process is as follows:

wherein

Representing the contribution degree of the Kth feature map to the class C;

d _k representing the Kth feature map;

relu () is a linear rectification function;

n () is a normalization process, mapping matrix values to [0,1] intervals;

representing the resulting class activation map.

Compared with the prior art, the invention has the beneficial effects that:

since the feature activation map correlation algorithm is usually used to explain the image classification task, it tends to activate certain important areas in the image, ignoring other important areas that may exist. To solve this problem, many algorithms deliberately hide or erase object regions, forcing the model to look for more different parts, but these algorithms either hide fixed-size patches randomly or require repeated model training and response aggregation steps. There are also algorithms that extend the attention of the algorithm to non-target areas through a competing erasure strategy in an end-to-end training manner, but such strategies may gradually extend their attention to non-target areas, creating the problem of inaccurate attention. In addition, the current feature activation maps have the problem of covering edges with insufficient accuracy. Aiming at the problems existing in the current generation of the feature activation graph, the invention provides an HCscore-CAM algorithm for combining network multilayer information and generating the feature activation graph by using a batch processing mode in combination with the idea of supercolumn, wherein the algorithm is integrated into the idea of supercolumn, convolution layers at the front end, the middle end and the tail end of a trained network model form a more representative feature graph by a deep connection mode, and then the feature activation graph with wider coverage and more accurate edge information is generated by using the batch processing mode. When the number of the same target in the image is more than one, the generated class activation mapping has better effect than other algorithms.

Drawings

FIG. 1 is a schematic diagram of a supercolumn feature fusion method;

FIG. 2 is a flow chart of the HCScore-CAM algorithm of the present invention;

FIG. 3 is a graph showing the comparison result of the HCscore-CAM algorithm generating the feature activation map compared with other algorithms.

Detailed Description

The invention is further illustrated by the following examples.

1. Supercolumn feature fusion of multiple feature maps

The supercolumn idea is that on the feature diagram of each intermediate convolution layer between the CNN input layer and the output layer, the active values of all units at each pixel position in the input picture are taken out to form a vector, so as to achieve the effect of effectively utilizing the information of the intermediate layer of the neural network. According to the superordinate idea, the output characteristics of the low, medium and high-end convolutional layers in the CNN are extracted, and a composite characteristic comprising the output characteristics of the low, medium and high-end convolutional layers is formed in a deep connection mode, as shown in FIG. 1. Because in CNN, deep features have a large receptive field and rich semantic information, but edge information is lost due to the reduced resolution. In contrast, shallow features contain rich edge detail information. The deep layer characteristics and the shallow layer characteristics are extracted to generate composite characteristics, and the composite characteristics are used for generating a characteristic activation map, so that the problem of unclear edges can be effectively solved, and more important information can be highlighted.

Because the number of the convolution layers of the CNN is large and the correlation of adjacent layers is large, the invention divides the network convolution into a high region, a middle region and a low region according to the number of filter channels, and only extracts the last convolution block of the regions as information of three different levels, namely the high region, the middle region and the low region. Because the dimensions of the low, medium and high three-level output features are different, the invention firstly carries out up-sampling on the medium and high levels, adjusts the dimensions to be consistent with the low levels, and then standardizes the three-level output features with unified dimensions to ensure that the element range contained in the three-level output features is [0,1]]In the meantime. The specific calculation is shown in formula (1), wherein d _k Representing the feature layer that needs to be converted, Up () representing the upsampling calculation, and N () representing the normalization calculation.

D _k ＝N(Up(d _k )) (1)

2. Generation of feature activation graphs

To generate the feature activation graph, the contribution degree of each feature graph to the current classification result is firstly obtained, the feature activation graph is finally obtained by obtaining the contribution degree and then multiplying the feature graph by the corresponding contribution degree and then carrying out linear addition, and the specific steps are as follows:

the first step is as follows: and (3) sequentially covering the original image by taking the feature images contained in the supercolumn fusion feature image as masks, putting the covered image into the same network again to obtain the scores corresponding to the categories of the original image, and calculating the confidence coefficient of the current feature image by taking the difference value between the scores corresponding to the categories of the original image and the scores corresponding to the categories of the original image as the confidence coefficient of the current feature image, wherein the confidence coefficient is shown in formula (3).

C(A _l )＝f(X ₀ *H _l )-f(X ₀ ) (3)

Wherein X ₀ Original drawing H showing input _l Shows the first feature map after convolutional layer fusion, C (A) _l ) Representing the degree of contribution corresponding to the profile, and the f () function corresponds to the network model.

The second step is that: and splicing the confidence coefficient results of all the feature maps calculated in the first step into a vector as the confidence coefficient of the whole supercolumn fusion feature map, and then carrying out the treatment of a normalization and soft maximum (softmax) method to obtain the whole contribution degree. Each term in the vector corresponds to the contribution degree of one feature map in the supercolumn feature fusion feature maps to the whole picture. And then performing point multiplication operation on the overall confidence vector and the supercolumn feature fusion feature map to obtain a feature activation map of the HCscore-CAM. The feature activation graph calculation process is shown in formulas (4) and (5).

Wherein

Representing the contribution degree of the Kth feature map to the class C;

d _k represents the K thA feature map;

relu () is a linear rectification function;

n () is a normalization process, mapping matrix values to [0,1] intervals;

representing the resulting class activation map.

The invention HCScore-CAM algorithm flow chart (figure 2)

Examples

A method for generating a class activation map fused with supercolumns comprises the following steps:

the first step is as follows: and dividing the network convolution into three areas of low, medium and high according to the filter channel, and extracting the last convolution block of each area as information of three different levels of low, medium and high. The feature dimension of the lower hierarchy is d0(224 × 64), the feature dimension of the middle hierarchy is d1(56 × 256), and the feature dimension of the upper hierarchy is d2(14 × 512).

The second step is that: the output features of the two levels d1 and d2 are up-sampled to be a feature dimension of 224 x 224, then the three levels of output features with the uniform dimension are deeply spliced to obtain a feature map with the dimension of 224 x 832, and the feature map is normalized to make the range of elements contained in the feature map be between [0,1 ].

The third step: and the feature map set obtained by the second step of deep stitching comprises 832 feature maps of 224 x 1, in order to increase the calculation speed, the 832 feature maps are processed in batch, each 32 feature maps are in a group, each group of feature maps adopts a confidence coefficient algorithm, the original image is covered by using the feature maps as masks, the covered images are put into the same network again to obtain scores corresponding to the classes of the original image, the difference value of the scores corresponding to the classes of the original image is recorded as the confidence coefficient of the group of feature maps, and the result obtained by each group is a vector with the dimension of (32 x 1).

The fourth step: and (3) splicing the confidence results of all the groups to obtain a vector with the dimension of (832 × 1), wherein each value in the vector represents the confidence of the corresponding feature map. This vector is then soft-maximized to improve the connection of each feature map to the whole, while the result is still a (832 × 1) vector representing the contribution of the feature map.

Compared with other algorithms, the HCScore-CAM algorithm has the effect comparison experiment for generating the characteristic activation graph under the condition of multiple targets. As shown in FIG. 3, the HCScore-CAM algorithm has a better effect on locating multiple homogeneous objects than the Score-CAM and SS-CAM algorithms. When one image contains a plurality of similar objects and the objects are relatively separated, the Score-CAM, SS-CAM and HCscore-CAM algorithms can respectively locate the plurality of objects, and when the plurality of similar objects are distributed too densely, the HCscore-CAM has a more obvious locating effect compared with other two algorithms. This is because the HCScore-CAM incorporates the low-level features of the network, which tend to contain edge features, and because of these features, the HCScore-CAM algorithm works better for dense object localization.

Although the present invention has been described with reference to the preferred embodiments, it is not intended to limit the present invention, and those skilled in the art can make variations and modifications of the present invention without departing from the spirit and scope of the present invention by using the methods and technical contents disclosed above.

Claims

1. A method for generating a class activation map fused with supercolumns is characterized by comprising the following steps:

the fourth step: splicing the confidence results of all groups to obtain a multi-dimensional vector, performing soft maximum processing on the multi-dimensional vector, and taking the result as the contribution degree of each feature map;

2. The method of claim 1, wherein: a0 ≠ a1 ≠ a 2.

3. The method of claim 1, wherein: in the first step, the network convolution divides the network into different hierarchies according to different characteristics learned by the network, the hierarchy learned by the network into low-level characteristics is considered as a low layer, the hierarchy learned into textural characteristics is defined as a middle layer, and the hierarchy learned into key semantic characteristics is defined as a high layer.

4. The method of claim 1, wherein: the grouping preference range in batch processing is generally 2 ⁿ The specific setting of the form of the test is related to the effect of the experiment and the utilization rate of the used memory.

5. The method of claim 1, wherein: the feature activation graph calculation process is as follows:

wherein

Representing the contribution degree of the Kth feature map to the class C;

d _k representing the Kth feature map;

relu () is a linear rectification function;

n () is a normalization process, mapping matrix values to [0,1] intervals;

representing the resulting class activation map.