Disclosure of Invention
The invention aims to provide a visualization algorithm for solving the problem that the effect of explaining the classification of a convolutional neural network in the prior art is not ideal and orienting to the classification result of the convolutional neural network.
The purpose of the invention is realized by the following steps:
the invention discloses a visual algorithm for a convolutional neural network classification result, which is specifically realized by the following steps:
(1) Extracting a data set of an input image, and training the convolutional neural network by using the data set as a training set to obtain trained model parameters;
(2) According to a calculation method of a Rel-CAM algorithm in the fully-connected layer, calculating the contribution of each neural unit in the fully-connected layer to output layer by using an output result and model parameters until the convolutional layer;
(3) Calculating the corresponding weight of each channel and the output result of the layer according to the contribution of all the neural units in the last layer of convolutional layer obtained in the step (2) to the output, thereby obtaining a class activation mapping chart of the network model;
(4) Recording neural units with positive values in the class activation mapping, wherein the positions of the neural units are used as the positions of pixels contributing to output results in the layer, and the neural units are added into a neuron set contributing to output in the layer;
(5) Sequentially taking out each neuron in the set, calculating Hadamard products of all neurons in the receptive field in the previous layer and corresponding weights, summing the Hadamard products of each channel, taking the channel with the maximum sum as a contribution channel, adding the neuron with the positive value into the neuron set contributing to the output in the layer, and removing repeated neurons in the neuron set;
(6) And (5) repeating the propagation process of the step (5) until a neuron set in the input layer is obtained, wherein the neuron unit in the neuron set indicates that the pixel at the position contributes to the output result.
For a convolutional neural network classification result-oriented visualization algorithm, the step (2) is implemented by the following steps:
(2.1) assuming a trained CNN model and a given input picture, the model divides the CNN model into C types, assuming that a C node in an output layer is an output node of the type, and the score at the C node is S c The output before the Softmax layer is selected as the class score in the algorithm, the output maps to the position of the feature only related to class c,
wherein the content of the first and second substances,
expressing the correlation predicted as c-type neurons in an output layer, namely the distribution of the correlation of the prediction result on the output layer;
(2.2) assuming that the layer before the output layer is l, the contribution degree of each neural unit in the layer to the final output, namely the correlation between each neuron and the prediction result, is defined as:
wherein, the first and the second end of the pipe are connected with each other,
represents the activation value of the ith neuron in the l-th layer, and
representing the weighted connections between the neural unit and the next layer, the output layer neurons;
(2.3) only considering the relevance of each neuron to the node C because only the class C output nodes have the relevance in the last layer; if the transmission between the middle layers is considered, the correlation between each neuron in the previous layer to all neurons in the next layer is considered, and then:
wherein the content of the first and second substances,
represents the correlation between the jth neuron in the ith layer and the class c prediction output,
representing the correlation between the ith neuron in the l-1 layer and the jth neuron in the next layer l;
(2.4) according to conservation law, the sum of the correlations of all neurons of the l-th layer is equal to the correlation of the output layer, so that the correlation of the i neuron with the next layer, which is equal to its correlation with the predicted result, is:
wherein the content of the first and second substances,
representing the correlation between the ith neuron in the l-1 layer and the predicted result;
meanwhile, according to the conservation law of transmission, the following conditions exist:
for a visualization algorithm for a convolutional neural network classification result, in step (3), in order to obtain a category CAM map, it is necessary to first reversely transfer the correlation of the prediction result to the last convolutional layer, because the spatial information in the input image is stored in the convolutional layer, the correlation is first transferred layer by layer until the last convolutional layer, so as to prepare for calculating the CAM map in the next step, in a general CNN structure, the output of the last convolutional layer is converted from a three-dimensional tensor to a one-dimensional vector, so as to connect the following fully-connected layers, and the specific implementation steps include:
(3.1) assuming that the output of the last convolutional layer is at the mth layer of the network, according to the conservation law of correlation:
according to the conservation law of the correlation, the sum of the correlations of each neuron output by the last convolutional layer is equal to the final class score:
(3.2) when classification prediction is carried out on forward propagation, a feature mapping image output by the m-th layer in the convolutional layer, namely a corresponding three-dimensional tensor is converted into a one-dimensional vector, and the conversion abandons spatial information in the extracted features, so that when a CAM (computer aided manufacturing) image is calculated by backward correlation and backward propagation, the one-dimensional vector representing the correlation of neurons in the m-th layer needs to be correspondingly converted into a three-dimensional tensor when the neurons are propagated in the forward direction, namely a spatial structure of the feature mapping image of the layer;
the algorithm firstly converts the one-dimensional correlation vector of the m-th layer into a three-dimensional tensor with the correlation of the feature mapping space structure, and the sum of the three-dimensional tensor with the correlation of the feature mapping space structure is kept unchanged as the values of the three-dimensional tensor are in one-to-one correspondence,
wherein, the first and the second end of the pipe are connected with each other,
representing the correlation between the neuron with the coordinate (i, j) in the kth channel in the m-th layer of correlation tensor in the network and the prediction classification result;
(3.3) if the output characteristics of each channel are globally averaged and pooled, the result is:
wherein f is k (i, j) represents the activation value of the neuron with coordinate (i, j) in the kth channel in the feature map of the last convolutional layer, so that:
compared with the calculation formula of the CAM diagram, the calculation formula of the CAM diagram can be obtained as follows:
i.e. the weight of each feature map after global average pooling and final output
As shown in the above equation, after weighted summation, a CAM map of the CNN model containing the fully-connected layer is obtained, which is:
for a visualization algorithm for the classification result of the convolutional neural network, in the step (4), if the used CNN model has N convolutional layers, the index of each layer is 1,2, \ 8230; N, and in the l layer, the matrix A is used
l Represents the activation value of all neurons in this layer, W
l A weight matrix connecting this layer and the previous layer is represented,
denotes the kth neuron in layer l, X
l Representing neurons in layer I contributing to the last decision in the feature mapPosition, i.e. the bit of the neuron whose correlation with the final output result is positive, m represents the number of neurons therein; the position of the pixel in the input that supports this CNN decision will be obtained below based on the previously obtained CAM map in conjunction with the new propagation method proposed.
For a visualization algorithm for a convolutional neural network classification result, the step (5) specifically includes the following steps:
(5.1) for X
l Extracting an activation value in a corresponding receptive field in layer l-1
(5.2) calculating Hadamard products (Hadamard products) of the activation values and corresponding weights of the convolution kernels
(5.3) obtaining a channel which has the maximum contribution to the next layer of neurons by summing the Hadamard products in each channel, wherein the neurons with positive Hadamard products in the channel are recorded into the neuron sets contributing to classification by an algorithm;
(5.4) removing the repeated neurons.
The invention has the beneficial effects that: the Rel-CAM algorithm has higher accuracy in explaining the classification of the convolutional neural network, and can distinguish the characteristics among the classifications when explaining the classification decision, so that people can be helped to better understand the classification basis of the convolutional neural network, and the problem that the effect of explaining the classification of the convolutional neural network in the prior art is not ideal is solved.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
With reference to fig. 1, the invention discloses a visualization algorithm for a convolutional neural network classification result, which is implemented by the following steps:
the method comprises the following steps: training the convolutional neural network by using a data set containing an input image to be interpreted as a training set to obtain trained model parameters;
step two: according to a calculation method of a Rel-CAM algorithm in the fully-connected layer, calculating the contribution of each neural unit in the fully-connected layer to output layer by using an output result and model parameters until the convolutional layer;
step three: calculating the corresponding weight of each channel and the output result of the layer according to the contribution of all the neural units in the last layer of convolutional layer to the output obtained in the step two, thereby obtaining a class activation mapping chart of the network model;
step four: recording neural units with positive values in the class activation mapping, wherein the positions of the neural units are used as the positions of pixels contributing to output results in the layer, and the neural units are added into a neuron set contributing to output in the layer;
step five: sequentially taking out each neuron in the set, calculating the Hadamard products of all neurons in the receptive field in the previous layer and corresponding weights, summing the Hadamard products of each channel, taking the channel with the maximum sum as a contribution channel, adding the neuron with the positive value into the neuron set which contributes to the output in the layer, and removing repeated neurons in the neuron set;
step six: and repeating the propagation process of the step five until a neuron set in the input layer is obtained, wherein the neuron units in the neuron set indicate that the pixel at the position contributes to the output result.
At present, a visual method for explaining the classification result of a convolutional neural network is a popular direction of the current machine learning research, and scholars at home and abroad propose various model methods and corresponding algorithms which have characteristics aiming at different network models and specific practical problems, on the basis of the research of predecessors, aiming at the defects of the accuracy of the explanation of the classification result and the efficiency of the algorithm of the existing sensitivity analysis visual algorithm, the invention provides a class activation mapping image visual algorithm based on correlation by combining the advantages and innovation of the class activation mapping image algorithm, and the main viewpoints and contents are as follows:
(1) The Rel-CAM algorithm computes the method in the fully-connected layer. The correlation propagation algorithm is one of the commonly used algorithms for explaining the classification of the convolutional neural network, and the general idea of the correlation propagation algorithm is to understand the contribution of each pixel to the final prediction result, and the correlation propagation algorithm utilizes the structure of the network to perform back propagation on the correlation. The algorithm starts from the output layer of the network and redistributes the scores of the predictive classification at each layer along the direction of backward propagation of the network, up to the input layer. And the redistribution process obeys the conservation law that the sum of the correlations of each layer remains unchanged. Here the correlation is denoted R (x), where x denotes a single pixel or a neuron in the middle layer. To obtain a certain type of CAM map, the prediction result of the last layer needs to be transferred to the last convolutional layer first.
First, assume that there is a trained CNN model and a given input picture, the model divides it into C classes, assume that the C node in the output layer is the output node of the class, and the score at this node is S c The output before the Softmax layer is selected as the class score in the algorithm, because then the output would map to the location of features that are only relevant to class c; if the Softmax layer output is chosen, the normalized output will map to locations that contain features from other classes, and the resulting visualization will be inaccurate because it contains features that are classified as other classes, although with little probability it will be classified as other classes. So taken together, the algorithm uses the output value before Softmax as the start of correlation propagation. Thus, there are:
wherein the content of the first and second substances,
the relevance of the predicted c-type neurons in the output layer is represented, namely the distribution of the relevance of the prediction result on the output layer has only one value because only one node is related to the c-type neurons in the output layer
Thus, the sum of the correlations of each neuron of the previous layers is also
Assuming that the layer before the output layer is l, the contribution degree of each neural unit in the layer to the final output, that is, the correlation of each neuron with the prediction result, is defined as:
in the formula (I), the compound is shown in the specification,
represents the activation value of the ith neuron in the l-th layer, and
the weighted connection between the neural unit and the next layer (output layer) of neurons is represented, and because only the class C output nodes in the last layer have correlation, the correlation of each neuron to the node C is considered. However, if the transmission between the middle layers is carried out, the correlation between each neuron in the previous layer and all neurons in the next layer must be considered, and the following steps are carried out:
wherein
Represents the correlation between the jth neuron in the ith layer and the class c prediction output,
expressing the correlation between the ith neuron in the l-1 layer and the jth neuron in the next layer l, namely the contribution of the ith neuron to the j neuron, the sum of the contributions of the ith neuron to all the neurons in the next layer is the contribution of the ith neuron to the next layer, and according to the conservation law, the sum of the correlations of all the neurons in the l layer is equal to the correlation of the output layer, so that the correlation of the ith neuron to the next layer is equal to the correlation of the ith neuron to the prediction result, and the correlation is as follows:
wherein the content of the first and second substances,
and (4) representing the correlation between the ith neuron in the l-1 layer and the predicted result. Meanwhile, according to the conservation law of transmission, the method comprises the following steps:
in order to obtain the CAM map of the category, it is necessary to first reversely transfer the correlation of the prediction result to the last convolutional layer, because the spatial information in the input image is stored in the convolutional layer, so that the correlation is first transferred layer by layer until the last convolutional layer, and preparation is made for calculating the CAM map in the next step. In a general CNN structure, the output of the last convolutional layer is converted from a three-dimensional tensor into a one-dimensional vector so as to connect the following fully connected layers. Assuming that the output of the last convolutional layer is located at the mth layer of the network, according to the conservation law of correlation:
according to the conservation law of the correlation, the sum of the correlations of each neuron output by the last convolutional layer is equal to the final class score:
because the feature map output from the m-th layer in the convolutional layer, i.e. the corresponding three-dimensional tensor, is converted into a one-dimensional vector in the forward propagation for the classification prediction, so as to facilitate the forward propagation in the fully-connected layer. This conversion discards spatial information in the extracted features, so when performing the reverse correlation back propagation to calculate the CAM map, it is necessary to first convert the one-dimensional vector representing the correlation of the neurons in the mth layer into the three-dimensional tensor in the forward propagation, that is, the spatial structure of the feature map in the layer.
The algorithm firstly converts the one-dimensional correlation vector of the mth layer into a three-dimensional tensor with the correlation of the feature mapping space structure, and the sum of the three-dimensional tensor is kept unchanged because the values of the three-dimensional tensor are in one-to-one correspondence. Thus for the transformed correlation tensor, there are also:
wherein, the first and the second end of the pipe are connected with each other,
and (3) representing the correlation between the neuron with the coordinate (i, j) in the kth channel in the correlation tensor of the mth layer in the network and the prediction classification result. If the global average pooling is performed on the output characteristics of each channel, the obtained result is:
wherein, f k (i, j) represents the activation value of the neuron with coordinate (i, j) in the kth channel in the feature map of the last convolutional layer, so that:
comparison with the calculation formula of the CAM map yields:
i.e. the weight of each feature map after global average pooling and final output
As shown in the above equation, after weighted summation, a CAM map of the CNN model containing the fully-connected layer is obtained, which is:
(2) Rel-CAM algorithm is a calculation method in convolutional layers. In convolutional layers, the Rel-CAM algorithm uses an algorithm based on location information propagation. The core idea of the algorithm is as follows: assuming that at the current layer, if a neuron supports the final classification result, i.e. it is positively correlated with the final output result, the neurons positively correlated with the neuron in the previous layer can be regarded as evidence supporting neurons in the current layer, and also as evidence supporting the final classification result. The correlation is positive, i.e. the product of the neurons in the previous layer and the weight between them is positive. This is the core idea of the Rel-CAM algorithm to propagate layer by layer in convolutional layers.
First, if the CNN model used has N convolutional layers, the index of each layer is 1,2, \ 8230; N. In the l-th layer, using the matrix A
l Represents all neuron activation values, W, of the layer
l RepresentA weight matrix connecting this layer and the previous layer,
denotes the kth neuron in layer l, X
l The positions of the neurons contributing to the final decision in the feature map in the l-th layer, that is, the bits of the neurons having a positive correlation with the final output result are represented, and m represents the number of neurons therein. The position of the pixel in the input that supports this CNN decision will be obtained below based on the previously obtained CAM map in conjunction with the new propagation method proposed.
The CAM map obtained in the previous section is located in the mth layer of the network, and the neurons with positive values in the CAM map are those contributing to the final decision result in this layer, so X
l Is that
A set of element positions whose median value is greater than 0. The position information is then passed layer by layer up to the input layer.
After reaching the convolutional layer, X
l Is a three-dimensional set of indices, each index identifying the location of a neuron at the layer that contributes to the final classification decision. How the method herein inversely localizes the neuron having discrimination in the previous layer will be explained below. It should be noted that typically the receptive field of the pooling layer for performing pooling operations is a two-dimensional plane, while the receptive field of the convolutional layer for performing convolution operations is a three-dimensional volume, and thus, for X
l Need to extract the activation value in the corresponding receptive field in layer l-1
The Hadamard product (Hadamard product) of these activation values and the corresponding weights of the convolution kernels is then computed
By summing the Hadamard products in each channel, the channel with the largest contribution to the next layer of neurons can be obtained, and the Hadamard products in the channelNeurons with positive dammar product are recorded by the algorithm into the set of neurons contributing to the classification.
Algorithm 1 below explains the process of obtaining the position of the support-classified neurons in the convolutional layer. In the case of the
algorithm 1, the algorithm,
representing the activation value in the field of the previous layer and is therefore a three-dimensional tensor. When it and the weight of the corresponding neuron
![Figure RE-GDA0001812829770000096](https://patentimages.storage.googleapis.com/20/e7/0c/4d4aa58a3ae017/RE-GDA0001812829770000096.png)
When the hadamard product is performed, the result is also a three-dimensional tensor of the same size. The algorithm first sums the outputs along the x-axis and y-axis directions to locate the most discriminative feature map. If the convolutional layer does not perform any down-sampling operation, the spatial position of the determinate neuron will not change during this conversion, that is, the position (x, y) in the subsequent layer will be shifted to the channel with the largest contribution in the current layer, thus completing the propagation of the position information between layers. The algorithm may further select the neuron with the largest activation value among the channels contributing the most, but the results of both are almost the same in the experiment, so the algorithm still selects the element of the channel contributing the most as the decision neuron.
The algorithm steps of the location update are as follows:
algorithm 1: neuron position propagation algorithm supporting classification decision in convolutional layer
Inputting: x l Neuron positions in the higher layer that contribute to the classification are derived from the CAM: x l [1]...X l [m]
W l Weight of the l-th layer
A l-1 Activation value of neurons of layer l-1
And (3) outputting: x l-1 Location of neurons with supporting classification in layer l-1
1 order of X l-1 =φ
2for i=1:m do
3 neurons
Corresponding convolution kernel weights
4 neurons
Activation value in corresponding receptive field
5 Hadamard product of activation value and weight
6, calculating the contribution value of each channel according to the Hadamard product, namely summing each channel in the product tensor, and assigning C, namely
Where S (x) is the summation over the plane elements
7 storing the position of the neuron with positive Hadamard product in the channel into the position set of the decision neuron in the current layer, namely in X l-1 Increase in (X) l-1 ,argmax(C))
8end for
9 for X l-1 The position with the same median value retains one of them.
With pooling layers, the algorithm extracts neurons in the two-dimensional receptive field range in the previous layer and finds the location with the largest activation value from it, because most CNN structures typically downsample the feature map using the method of maximal pooling. Thus, the activation in the next layer comes from the maximum activation that occurs in the corresponding receptive field in the previous layer. Thus, when the algorithm backtracks activation of a previous layer at a downsampling layer, the neuron with the largest activation value among the receptive fields of the corresponding neurons is selected.
Thus, when training CNNs for recognition, the Rel-CAM algorithm may start with the prediction result of the last layer, and first use a correlation propagation-based algorithm at the fully-connected layer to generate a class activation map capable of locating classification features. The map is then converted into a set containing a set of positions, and the positions of the neurons with decision-making are traced back at the convolutional layer to the input layer using another propagation algorithm based on position information. Finally, the localization of the features that determine the classification is obtained on the input image. Although the input picture usually contains three channels of RGB, the algorithm only considers the planar space of x and y, that is, only the positioning of the pixels in two-dimensional space is of interest.
When the method is used for explaining the classification result of the convolutional neural network, the concept of the class activation mapping chart is used for reference, and the class activation mapping chart shows that the region of the input picture which is divided into the classes is determined in the input picture, so that the method also has the advantage, the qualitative experiment result chart in the graph 2 can be used for seeing that when the image is classified into a cat or a dog, the Rel-CAM algorithm only identifies the pixel region of the corresponding class in the image, and does not identify the pixel regions of other classes or environments in the image; the Backprop method identifies features of all classes in both cases, indicating that the method cannot distinguish features between classes when interpreting classification decisions. The LRP method is similar to the Rel-CAM algorithm proposed herein, and has class distinctiveness compared with the Backprop method, but the algorithm labels more non-critical feature regions and pixels in the environment, and is more computationally intensive than the Rel-CAM algorithm, so the Rel-CAM algorithm in the present invention has better effect on interpreting the classification decision of CNN, especially when there are more than one type of objects in the image.
In addition, through quantitative experimental comparison of (a) classification confidence degree reduction drop and (b) classification confidence degree increase increment incrase of the three methods under the same data set, the Rel-CAM algorithm can explain classification of the convolutional neural network more accurately. The evaluation criterion uses the concept of an interpretation graph, and the interpretation E of the defined image is that the generated thermodynamic diagram H is multiplied by the input picture I in an element correspondence mode:
in the formula (I), the compound is shown in the specification,
the hadamard product representing the multiplication of the elements is corresponding, I is the input picture, and H is a thermodynamic diagram capable of determining the classification. In the experiment of each picture, c represents the category into which the model is divided. In short, the interpretation graph is the importance of the model decision for each pixel obtained according to the algorithm, and covers a part of the input image.
Mean decrease in classification confidence: a good explanatory figure should mark the parts that are most important for classification. The deep CNN model makes a final decision according to all features of the input image in prediction, so that blocking a part of the image inevitably reduces the confidence of the model in decision. On the other hand, this drop should be small, since the most important parts of the whole input image that influence the classification decision are retained in the interpretation map. Thus, this metric compares the degradation in confidence of a picture after masking for a particular class of models. For example, if it is known that a model predicts an image as a tiger with a confidence of 0.8, when the prediction is performed using the interpretation graph, the confidence that the model classifies the image as a tiger drops to 0.4. Then the reduction in model confidence is 50%. Experimental comparisons were made with the top 50 images of each category of prediction in the experimental selection dataset, comparing the average drop of several algorithms.
The classification confidence is increased: however, sometimes it is possible that all features sought by CNN are the identified parts in the interpretation graph, while other parts of features are unnecessary features, and do not help classification decision. In this case, the confidence of the model for a particular class would instead increase. This metric calculates the number of times the confidence of the model increases, expressed as a percentage, when predicted using the interpretation graph throughout the data set.
Least reduction in classification confidence: the first two criteria may evaluate the ability of an interpretation graph generated by one visualization method to correctly identify regions in the image that affect classification, and this criterion is used to explicitly compare how good an interpretation graph generated by a different method is. The method is characterized in that in a given data set, the size of the confidence reduction of the interpretation graph generated by each visualization method on each image is compared, and the least reduction of which method is reduced is added with 1. The confidence degree is reduced at least, and the more important classification features of the class are identified in the interpretation graph generated by the method, namely the interpretation is better. The final output is a percentage, i.e., the minimum number of drops per method is a proportion of all algorithms.
Through experimental analysis, as shown in fig. 3, the Rel-CAM algorithm is lower than the other two existing algorithms in terms of average decrease of classification confidence, and is comparable to the other algorithms in terms of confidence increase, but has a little advantage. On the other hand, the experimental result shows that the positioning of the features can help to improve the performance of the classifier, which may provide a new angle for a deep learning researcher to improve the performance of the neural network, i.e., a feature recognition component is added to the model, and then training is guided according to the recognized features, so as to improve the network performance.
In the aspect of minimal confidence reduction, the proportion occupied by the Rel-CAM algorithm is the largest, that is, in the whole data set, in many cases, the Rel-CAM algorithm can identify the feature with the largest influence on classification, which indicates that the Rel-CAM is better than the other two methods.
In conclusion, through qualitative and quantitative analysis, the Rel-CAM algorithm has higher accuracy in explaining the classification of the convolutional neural network, and can distinguish the characteristics between the classifications when explaining the classification decision, thereby helping people to better understand the classification basis of the convolutional neural network.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.