CN112115872A

CN112115872A - Three-dimensional action recognition residual error network visualization method using category activation mapping

Info

Publication number: CN112115872A
Application number: CN202010994268.7A
Authority: CN
Inventors: 毛琳; 陈思宇; 杨大伟
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-12-22

Abstract

A three-dimensional action recognition residual error network visualization method using category activation mapping belongs to the field of video understanding in computer vision application, and aims to solve the problem of improving the interpretability of a three-dimensional action recognition residual error network to a video data understanding process; the method has the advantages that the thermodynamic diagram is activated to be overlapped with the input data to obtain the attention result image, the attention of the network model is mainly focused on the corresponding information of the input data under the corresponding classification result, and the effect is that the action recognition technology can be better applied and popularized in the fields of medical treatment, logistics transportation, automatic driving, safety monitoring, network short video, live screening and the like.

Description

Three-dimensional action recognition residual error network visualization method using category activation mapping

Technical Field

The invention belongs to the field of video understanding in computer vision application, and particularly relates to a three-dimensional action recognition residual error network visualization method by utilizing category activation mapping.

Background

Nowadays, the neural network method has excellent effect in various computer vision tasks, but because the learning of the network model is completed only by continuously updating internal parameters, people are difficult to intuitively understand the contents learned by the model. In order for the development of the technology in each related art to advance, the neural network method needs to be fair, transparent, clearly interpreted and understood by the public. Therefore, partial scholars design various neural network visualization methods to intuitively display the learning and understanding process of the network model to the data, and the interpretability of the network model is increased.

The deconvolution Network (deconvolution Network) can read the feature information of the selected neuron in the convolutional Neural Network on the premise of not changing the input data by using a deconvolution frame, and convert the feature information with general visibility into an input original image through reverse convolution, so that the information captured by the selected neuron in the input image is highlighted, and the identification degree of each neuron on the input data is effectively expressed. The Activation Maximization (Activation Maximization) method takes random noise as an input image, gradually generates the noise image into a visual image which meets the maximum Activation value of a selected neuron through gradient rise, and the generated result can intuitively embody what characteristics of each neuron are taken as Activation standards to enable the neuron to learn the contents of the neuron well. Class Activation Mapping (Class Activation Mapping) is mainly applied to a weak supervision learning situation for simultaneously classifying and positioning targets in a classification network when being proposed, but the Mapping result of the method can be used for target positioning, and people can also know which areas of an input image are used as a judgment basis for classification when a network model outputs the classification result through further visualization processing, so that the method has more intuitive understanding on how to obtain a reliable classification result for the network model.

Since the development of the video understanding field is relatively slower than that of the image understanding field, the current neural network visualization method mainly aims at a 2-dimensional neural network taking a plane image as input data, and the application of the related visualization method in a three-dimensional neural network taking video information as input is less. Video data has time domain dynamic information which image data does not have, so that different technical means are needed for assistance when a visualization method is used in the field of video understanding, and the obtained visual effect can be satisfactory. There is still much room for exploration and development in visualization techniques in the field of video understanding.

Disclosure of Invention

In order to improve the interpretability of a three-dimensional motion recognition residual error network to a video data understanding process, the invention provides the following technical scheme: a three-dimensional action recognition residual error network visualization method utilizing category activation mapping obtains an activation thermodynamic diagram through classification weight and characteristic information of a network model, and the activation thermodynamic diagram is used as an activation degree schematic diagram between network model action recognition and input data; and the activated thermodynamic diagram is superposed with the input data to obtain an attention result image, and the attention of the network model is shown to be mainly focused on the corresponding information of the input data under the corresponding classification result.

Further, the network model classifies the feature extraction of the data through the middle hidden layer, and the output layer obtains a classification result from the combined calculation of each class classification weight and each level feature.

The invention relates to another three-dimensional action recognition residual error network visualization method utilizing category activation mapping, which comprises the following implementation steps:

the first step is as follows: reading classification weight w connected between an output layer and a global average pooling layer of the network model, wherein the data size is [ t, c '], t represents the total number of action categories which can be identified by the model, and c' represents the number of connecting channels between the output layer and the previous layer;

the second step is that: processing the action video to be recognized into a single-frame image data set X according to the set channel number c, the pixel height h and the width d, and taking every n frames of images as a set of input data

Concrete formulaic representation of single-frame image data set XComprises the following steps:

X＝[x₁,x₂…x_a]

wherein x represents a single frame image of size [ c, h, d ], a represents the total number of frames of a motion video image data set, and the group I input data is represented by the formula:

wherein the content of the first and second substances,

will be converted into data size [ c, n, h, d ] when applied]The tensor of (a);

the third step: inputting all groups of input data into the network for action recognition, reading a convolution layer before the global average pooling layer of the network, and according to different input data groups

Obtaining each characteristic information group f_I，f_IHas a data size of [ c ', n, h ', d ']，f_IHas a pixel height h 'and a width d';

the fourth step: selecting a certain action recognition result t_iSorting the results t by channel c_iCorresponding weight values w and convolution characteristics f_IMultiplication and subsequent accumulation of the same tensor H with zero initial value_IThe following steps of (1);

the fifth step: h is to be_IMapping to a three-channel jet color space and sizing the height and width of the pixel data by [ h ', d']Conversion to [ h, d]And obtaining the visual activation thermodynamic diagram h under the current action category_I，h_IHas a size of [3, h, d]；

And a sixth step: h is divided according to corresponding different data groups_IWith corresponding input data

And overlapping to obtain an attention result graph of model classification.

Further, in the second step: when a cannot be divided by n, the redundant image frame will be cut off; the default channel number c is 3, the frame number n is 16, and the pixel height h and width d are 128.

Further, in the fourth step: classification weight w_iIs expressed as: w is a_i＝w[t_i,c′]；H_IIs represented by the following programming language: h_I+＝f_I[c']×w[t_i,c′]。

Has the advantages that: the visualization result of the invention is helpful for more comprehensive understanding of the network model. Because video motion recognition requires the model to learn and understand the time information and the spatial information in the video data at the same time, how to better balance the learning degree of the two groups of information to ensure the classification performance of the network model is one of the important difficult point and the breakthrough point of the motion recognition technology. Through the visualization result of the model classification basis, people can better master the fact that the judgment of the model during classification is more inclined to time information or space information, or the dependence degree of the model on the two kinds of information is more balanced. When the model excessively depends on certain information as a judgment basis to generate an overfitting phenomenon, the network model can be debugged and improved more pertinently by using visual judgment brought by a visual result.

The invention utilizes the category activation mapping to carry out the visualization of classification attention on the three-dimensional action recognition residual neural network, and is helpful for people to better understand the region of the input data where the model puts attention during the action recognition process by using the network model, and which information is the judgment basis of the action recognition result. The interpretability of the network model is improved, meanwhile, the trust degree of a user can be improved, and the action recognition technology can be better applied and popularized in the fields of medical treatment, logistics transportation, automatic driving, safety monitoring and the like.

In the practical application of the motion recognition technology, it is possible to use a plurality of different models or multitask models to construct one recognition system according to different situations. The visualization result obtained by the method can more intuitively show the performance difference of the network model in different application scenes, for example, the specific performance of the model is evaluated by comparing the visualization results in different examples by using different targeted application examples such as day, cloudy day, dusk, night, pedestrians, buildings, vehicles and the like. The identification system generates different identification tasks according to different application examples, and network models are grouped and placed in the system according to specific performance of the different application examples, so that the overall performance of the system is improved, the application scenes of the system can be expanded by using the different models, and different identification functions are added.

Drawings

FIG. 1 is a logic diagram of the present method.

Fig. 2 shows the original image of the 10 th frame to which the motion video data is input in example 1.

Fig. 3 shows the original image of the 60 th frame to which the motion video data is input in example 1.

Fig. 4 is an activation thermodynamic diagram of an input group in which the original image of the 10 th frame in example 1 is implemented.

Fig. 5 is an activation thermodynamic diagram of the input group in which the original image of the 60 th frame in example 1 is implemented.

Fig. 6 is a result view of attention obtained by superimposing the original image of the 10 th frame and the corresponding thermodynamic diagram in example 1.

Fig. 7 is a result view of attention obtained by superimposing the original image of the 60 th frame and the corresponding thermodynamic diagram in example 1.

Fig. 8 shows the original image of the 31 st frame to which the motion video data is input in example 2.

Fig. 9 shows the 154 th frame original of the input motion video data in example 2.

Fig. 10 is an activation thermodynamic diagram of an input group in which the original image of the 31 st frame of embodiment 2 is located.

Fig. 11 is an activation thermodynamic diagram of an input group in which the 154 th frame of original image is located in embodiment 2.

Fig. 12 is a result view of attention obtained by superimposing the original image of the 31 st frame and the corresponding thermodynamic diagram in example 2.

Fig. 13 is a result view of attention obtained by superimposing the 154 th frame original image and the corresponding thermodynamic diagram in example 2.

Fig. 14 shows the original image of the 31 st frame for inputting the motion video data in example 3.

Fig. 15 shows the 154 th frame original for inputting motion video data in example 3.

Fig. 16 is an activation thermodynamic diagram of an input group in which the original image of the 31 st frame in embodiment 3 is located.

Fig. 17 is an activation thermodynamic diagram of an input group in which the 154 th frame of original image in embodiment 3 is located.

Fig. 18 is a result view of attention obtained by superimposing the original image of the 31 st frame and the corresponding thermodynamic diagram in example 3.

Fig. 19 is a result view of attention obtained by superimposing the 154 th frame original image and the corresponding thermodynamic diagram in example 3.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

in one embodiment, a three-dimensional action recognition residual error network visualization method using class activation mapping is described, an activation thermodynamic diagram is obtained by using classification weights and feature information of a model as an activation degree schematic diagram between model action recognition and input data, and a regional display network model takes which part of information of the input data as a classification basis. The activation thermodynamic diagram is superposed with the input data to obtain an attention result image, and the image can visually show which information of the input data is mainly focused on by the model under the corresponding classification result.

The classification basis of the display network model which is composed of the classification weight and the characteristic information and can be regionalized by the activation thermodynamic diagram is disclosed. When the network model classifies input data, the characteristics of the data need to be extracted by using the middle hidden layer, and the output layer obtains a final classification result from the combined calculation of each class classification weight and each level of characteristics. Therefore, although the method cannot know the specific learning content of the neuron, the method can intuitively embody the sensitivity of each classification result to the information of which part of the region in the input data and the sensitivity difference with other part of the information by visualizing the combination result of each classification weight and each hierarchy feature;

the sensitivity of the specified classification result to each region information in the input data is equivalent to the correlation degree of the classification result and the region information, and the region with higher correlation degree has more obvious influence on the classification result, namely the information of the region is the main judgment basis for obtaining the specified classification result by the output layer;

the output layer obtains a final classification result according to the combination of classification weights of all classes and different features, the classification weight parameters are independent according to the classification classes, and different motion recognition results have different classification weight parameter sets;

wherein, different visual effects can be realized by using the classification weight parameters of different action recognition results. When the classification weight of the correct recognition result is used as a merging basis, the visualization result can provide a judgment basis for obtaining the correct classification of the network model, so that the interpretability of the model is enhanced; when the classification weight of the error recognition result is used as a merging basis, the visualization result can indicate which information in the input data causes the network model to generate misjudgment, so that the method is convenient for deeply knowing the intercommunity between different data and the defects of the performance of the network model, and is also convenient for performing more targeted optimization on the performance of the model.

According to the method, when video data is input, data is required to be cut into blocks and then input as different data groups, and the activation thermodynamic diagram is required to be generated in a grouping mode according to the characteristics of the different video blocks;

the method requires that the structure of the network model meets the following requirements: only a global average pooling layer is arranged between the last output layer and the last convolution layer of the model, and other full-connection layers are not arranged;

further, the specific steps of the 3D residual action recognition network visualization method using category activation mapping are as follows:

the first step is as follows: reading a classification weight w connected between an output layer and a global average pooling layer of the 3D residual action recognition network model to be visualized;

wherein, the size of w is [ t, c '], wherein t represents the total number of the action types which can be identified by the model, c' represents the number of connecting channels between the output layer and the previous layer, and each channel has different connecting weights;

wherein, the values of t and c' are determined according to the structural difference of the network model, and have no default value;

the second step is that: processing the motion video to be recognized into a single-frame image data set X, and further grouping the motion video into each input data set according to the fixed frame number

Wherein the image data set X is formulated as:

X＝[x₁,x₂……x_a]

x represents a single frame image of [ c, h, d ] after processing according to the set channel number c, pixel height h and width d, and a represents the total frame number of the image data group depending on the time length of the input video. The default number of channels c is 3, and the pixel height h and width d are both 128;

further, the first set of input data may be represented by the following formula

Wherein the content of the first and second substances,

has a data size of [ c, n, h, d ]]And n represents the number of single-frame images possessed by a group of input data, and is set as 16 by default. When n cannot divide a completely, the redundant image frames in the image data set X will be discarded as input data. As will be used hereinafter

To refer to a certain set of input data;

the third step: will input data

Inputting the information into a network for action recognition, and reading the characteristic information f of a convolutional layer in front of a network global average pooling layer;

wherein the characteristic information f is to be based on different input data sets

Is divided into corresponding characteristic information groups f_I；

Wherein, due to layer-by-layer iteration of the network model on the data, f_IHas a data size of [ c ', n, h ', d ']，f_IThe pixel height h 'and the width d' are compared with the input data set

The height h and the width d of the pixel have certain difference, and the specific numerical value difference depends on the structure of the network model;

the fourth step: corresponding merging of selected action categories t by channels c_iBy classification weight w_iWith convolution characteristic f_IAt tensor H_IThe combined result of each channel is accumulated to obtain H_IMapping to three-channel jet color space to obtain current action category t_iVisual activation thermodynamic diagram h_I；

Wherein the classification weight w_iIs expressed as: w is a_i＝w[t_i,c′]；

Wherein, tensor H_IAll initial values of the internal values are zero, and the sizes are [ c ', h ', d ']；

Wherein, H is_ISibling of jet color space mapped to three channels requires that the data height and width be sized by [ h ', d']Conversion to [ h, d]Visual activation thermodynamic diagram h_IHas a size of [3, h, d]；

The fifth step: drawing graph h_IAccording to the corresponding grouping and input data

And (5) overlapping to obtain a classification attention result graph of the model.

The three-dimensional action recognition residual error network visualization method utilizing category activation mapping belongs to the field of video understanding in computer vision application, and in order to improve interpretability of a three-dimensional action recognition residual error network to a video data understanding process, an activation thermodynamic diagram is obtained by utilizing classification weight and characteristic information of a model to serve as an activation degree schematic diagram between the model and input data, and a regional display network model takes which part of information of the input data as a classification basis. The thermodynamic diagram is activated to be superposed with the input data to obtain an attention result image, the image can visually show that the attention of the model is mainly focused on the area where the input data information is located under the corresponding classification result, so that the action recognition technology can be better applied and popularized in the fields of medical treatment, logistics transportation, automatic driving, safety monitoring, network short video, live screening and the like.

In another embodiment, a three-dimensional motion recognition residual error network visualization method using category activation mapping is described, a logic schematic of the method is shown in fig. 1, and the algorithm is implemented by the following steps:

the first step is as follows: reading classification weight w of connection between an output layer and a global average pooling layer of the network model, wherein the data size is [ t, c '], t represents the total number of action categories which can be identified by the model, and c' represents the number of connection channels between the output layer and the previous layer;

The single frame image data set X is specifically formulated as:

X＝[x₁,x₂…x_a]

where x represents a single frame image of size [ c, h, d ], a represents the total number of frames of a motion video image data set, and when a cannot be divided by n, the redundant image frames will be discarded. The first set of input data may be formulated as:

wherein the content of the first and second substances,

will be converted into data size [ c, n, h, d ] when applied]The default number of channels c of the method is 3, the number of frames n is 16, and the height h and the width d of the pixel are both 128. As will be used hereinafter

To refer to a certain set of input data;

the third step: will input data

Inputting the data into the network for action recognition, reading the convolution layer before the global average pooling layer of the network according to different input data sets

Each obtained feature information group f_I. Due to layer-by-layer iteration of the network model on the data, f_IThe pixel height h' is compared with the width d

Has a certain difference between the height h and the width d of the pixel f_IHas a data size of [ c ', n, h ', d ']. Each specific numerical value will be different due to the structural difference of the network model in practical application;

the fourth step: selecting a certain action recognition result t_iSorting the results t by channel c_iCorresponding weight values w and convolution characteristics f_IMultiplication and subsequent accumulation of the same tensor H with zero initial value_ILower, H_IThe calculation process of (a) can be represented by the following programming language:

H_I+＝f_I[c']×w[t_i,c′]

the fifth step:h is to be_IMapping to a three-channel jet color space and sizing the height and width of the pixel data by [ h ', d']Conversion to [ h, d]And obtaining the visual activation thermodynamic diagram h under the current action category_I，h_IHas a size of [3, h, d]；

And a sixth step: h is divided according to different data groups_IWith corresponding input data

And overlapping to obtain an attention result graph of model classification.

In view of the above, the present invention provides a three-dimensional motion recognition residual error network visualization method using class activation mapping, which combines classification weights and feature information to obtain an activation thermodynamic diagram as a visual activation result between model motion recognition and input data, and displays a network model in a regionalized manner according to which part of information of the input data is used as an activation condition of a classification basis. The thermodynamic diagram is activated to be overlapped with the input data to obtain an attention result image, the model attention is visually shown in the corresponding classification result, the information of the input data is mainly focused on, and the interpretability of the classification result obtained by the three-dimensional action recognition residual error network is improved.

The solutions disclosed in the prior art in connection with the present disclosure are as follows:

in 2017, the invention patent application of 'method for visualizing a convolutional neural network' (publication number: CNIO7392085A) discloses a method for visualizing a convolutional neural network, which uses customized network parameters and sets a convolutional neural network feature extraction function corresponding to the parameters, effectively executes a feature extraction program, stores response features of all neurons, and visualizes the larger response and the maximum response features of a single neuron in a specified layer by combining the calculated response domain parameters of the specified neurons. The difference is that the example does not need to calculate the response parameters additionally, and the visualization is completed only by reading and combining the feature information and the classification weight.

In 2017, the invention patent application of intermediate information analysis device, optimization device and feature visualization device of a neural network (publication number: CN109460812A) discloses an intermediate information analysis device, an optimization device and a feature visualization device of a neural network, which simplify the hidden state output by a long and short term memory layer in the neural network through a dimensionality reduction algorithm, analyze the intermediate information of the neural network according to the simplified hidden state, the input data of the network and the output result, realize the feature visualization of the hidden state output by the long and medium term memory layer according to the intermediate information, and are beneficial to further research and learning of the neural network with the long and short term memory layer. The difference is that the network model applied in the example does not have a long-short term memory layer and visualizes the information of the middle layer in the network, but utilizes the superposition effect of the classification weight and different feature activation areas to map the classification basis of the network into a visible image through parameter calculation.

In 2018, the invention patent application of a convolutional neural network visualization method based on regularization of a gram matrix (publication number: CNIO8470209A) discloses a convolutional neural network visualization method based on regularization of a gram matrix, which can visualize the characteristics of each intermediate layer of a convolutional neural network, improve the characteristic visualization effect by utilizing regularization based on the gram matrix and obtain the effect of resisting the visualized fool effect. The difference lies in that the calculation structure of this example is comparatively simple, does not add other regularization algorithm and handles visual effect, only needs map the middle tensor to the color space to obtain the visual effect that can carry out the contrast of colorization degree.

Example 1:

in the present embodiment, a group of "open-close jump" motion videos is input into the network model, and activation mapping visualization is performed by using the correct classification label "open-close jump", fig. 2 and 3 are original images of 10 th and 60 th frames of the motion videos, fig. 4 and 5 are activation thermodynamic diagrams of the input group in which the original image of 10 th and 60 th frames is located, and fig. 6 and 7 are two thermodynamic diagrams and an attention result diagram obtained by superimposing the original images.

Example 2:

in the present embodiment, a group of "rope skipping" motion videos is input into the network model and is visualized by activation mapping using the correct classification label "rope skipping", fig. 8 and 9 are original images of the 31 st and 154 th frames of the motion videos, fig. 10 and 11 are activation thermodynamic diagrams of the input group in which the original image of the 31 st and 154 th frames is located, and fig. 12 and 13 are two thermodynamic diagrams and an attention result diagram obtained by superimposing the original images.

Example 3:

in the present embodiment, a set of "rope skipping" motion videos is input to the network model and is visualized by activation mapping using the error classification label "jump open and closed", fig. 14 and 15 show original images of frames 31 and 154 of the motion videos, fig. 16 and 17 show activation thermodynamic diagrams of input groups in which the original images of frames 31 and 154 are located, and fig. 18 and 19 show two thermodynamic diagrams and an attention result diagram obtained by superimposing the original images.

Claims

1. A three-dimensional action recognition residual error network visualization method using category activation mapping is characterized in that:

acquiring an activation thermodynamic diagram through the classification weight and the characteristic information of the network model, and taking the activation thermodynamic diagram as an activation degree schematic diagram between the action recognition and the input data of the network model; and the activated thermodynamic diagram is superposed with the input data to obtain an attention result image, and the attention of the network model is shown to be mainly focused on the corresponding information of the input data under the corresponding classification result.

2. The method for three-dimensional motion recognition residual network visualization using category activation mapping of claim 1, wherein: the network model classifies the feature extraction of the data through the middle hidden layer, and the output layer obtains a classification result from the combined calculation of each class classification weight and each level feature.

3. A three-dimensional action recognition residual error network visualization method using category activation mapping is characterized in that: the implementation steps are as follows:

The single frame image data set X is specifically formulated as:

X＝[x₁,x₂…x_a]

wherein the content of the first and second substances,

will be converted into data size [ c, n, h, d ] when applied]The tensor of (a);

And overlapping to obtain an attention result graph of model classification.

4. The method of three-dimensional motion recognition residual network visualization using category activation mapping of claim 3, wherein: in the second step:

when a cannot be divided by n, the redundant image frame will be cut off;

the default channel number c is 3, the frame number n is 16, and the pixel height h and width d are 128.

5. The method of three-dimensional motion recognition residual network visualization using category activation mapping of claim 3, wherein: in the fourth step:

classification weight w_iIs expressed as: w is a_i＝w[t_i,c′]；

H_IIs represented by the following programming language:

H_I+＝f_I[c']×w[t_i,c′]。