CN112115872A - Three-dimensional action recognition residual error network visualization method using category activation mapping - Google Patents

Three-dimensional action recognition residual error network visualization method using category activation mapping Download PDF

Info

Publication number
CN112115872A
CN112115872A CN202010994268.7A CN202010994268A CN112115872A CN 112115872 A CN112115872 A CN 112115872A CN 202010994268 A CN202010994268 A CN 202010994268A CN 112115872 A CN112115872 A CN 112115872A
Authority
CN
China
Prior art keywords
input data
network
data
classification
activation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010994268.7A
Other languages
Chinese (zh)
Inventor
毛琳
陈思宇
杨大伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Minzu University
Original Assignee
Dalian Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Minzu University filed Critical Dalian Minzu University
Priority to CN202010994268.7A priority Critical patent/CN112115872A/en
Publication of CN112115872A publication Critical patent/CN112115872A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Abstract

A three-dimensional action recognition residual error network visualization method using category activation mapping belongs to the field of video understanding in computer vision application, and aims to solve the problem of improving the interpretability of a three-dimensional action recognition residual error network to a video data understanding process; the method has the advantages that the thermodynamic diagram is activated to be overlapped with the input data to obtain the attention result image, the attention of the network model is mainly focused on the corresponding information of the input data under the corresponding classification result, and the effect is that the action recognition technology can be better applied and popularized in the fields of medical treatment, logistics transportation, automatic driving, safety monitoring, network short video, live screening and the like.

Description

Three-dimensional action recognition residual error network visualization method using category activation mapping
Technical Field
The invention belongs to the field of video understanding in computer vision application, and particularly relates to a three-dimensional action recognition residual error network visualization method by utilizing category activation mapping.
Background
Nowadays, the neural network method has excellent effect in various computer vision tasks, but because the learning of the network model is completed only by continuously updating internal parameters, people are difficult to intuitively understand the contents learned by the model. In order for the development of the technology in each related art to advance, the neural network method needs to be fair, transparent, clearly interpreted and understood by the public. Therefore, partial scholars design various neural network visualization methods to intuitively display the learning and understanding process of the network model to the data, and the interpretability of the network model is increased.
The deconvolution Network (deconvolution Network) can read the feature information of the selected neuron in the convolutional Neural Network on the premise of not changing the input data by using a deconvolution frame, and convert the feature information with general visibility into an input original image through reverse convolution, so that the information captured by the selected neuron in the input image is highlighted, and the identification degree of each neuron on the input data is effectively expressed. The Activation Maximization (Activation Maximization) method takes random noise as an input image, gradually generates the noise image into a visual image which meets the maximum Activation value of a selected neuron through gradient rise, and the generated result can intuitively embody what characteristics of each neuron are taken as Activation standards to enable the neuron to learn the contents of the neuron well. Class Activation Mapping (Class Activation Mapping) is mainly applied to a weak supervision learning situation for simultaneously classifying and positioning targets in a classification network when being proposed, but the Mapping result of the method can be used for target positioning, and people can also know which areas of an input image are used as a judgment basis for classification when a network model outputs the classification result through further visualization processing, so that the method has more intuitive understanding on how to obtain a reliable classification result for the network model.
Since the development of the video understanding field is relatively slower than that of the image understanding field, the current neural network visualization method mainly aims at a 2-dimensional neural network taking a plane image as input data, and the application of the related visualization method in a three-dimensional neural network taking video information as input is less. Video data has time domain dynamic information which image data does not have, so that different technical means are needed for assistance when a visualization method is used in the field of video understanding, and the obtained visual effect can be satisfactory. There is still much room for exploration and development in visualization techniques in the field of video understanding.
Disclosure of Invention
In order to improve the interpretability of a three-dimensional motion recognition residual error network to a video data understanding process, the invention provides the following technical scheme: a three-dimensional action recognition residual error network visualization method utilizing category activation mapping obtains an activation thermodynamic diagram through classification weight and characteristic information of a network model, and the activation thermodynamic diagram is used as an activation degree schematic diagram between network model action recognition and input data; and the activated thermodynamic diagram is superposed with the input data to obtain an attention result image, and the attention of the network model is shown to be mainly focused on the corresponding information of the input data under the corresponding classification result.
Further, the network model classifies the feature extraction of the data through the middle hidden layer, and the output layer obtains a classification result from the combined calculation of each class classification weight and each level feature.
The invention relates to another three-dimensional action recognition residual error network visualization method utilizing category activation mapping, which comprises the following implementation steps:
the first step is as follows: reading classification weight w connected between an output layer and a global average pooling layer of the network model, wherein the data size is [ t, c '], t represents the total number of action categories which can be identified by the model, and c' represents the number of connecting channels between the output layer and the previous layer;
the second step is that: processing the action video to be recognized into a single-frame image data set X according to the set channel number c, the pixel height h and the width d, and taking every n frames of images as a set of input data
Figure BDA0002691968340000021
Concrete formulaic representation of single-frame image data set XComprises the following steps:
X=[x1,x2…xa]
wherein x represents a single frame image of size [ c, h, d ], a represents the total number of frames of a motion video image data set, and the group I input data is represented by the formula:
Figure BDA0002691968340000022
wherein the content of the first and second substances,
Figure BDA0002691968340000023
will be converted into data size [ c, n, h, d ] when applied]The tensor of (a);
the third step: inputting all groups of input data into the network for action recognition, reading a convolution layer before the global average pooling layer of the network, and according to different input data groups
Figure BDA0002691968340000024
Obtaining each characteristic information group fI,fIHas a data size of [ c ', n, h ', d '],fIHas a pixel height h 'and a width d';
the fourth step: selecting a certain action recognition result tiSorting the results t by channel ciCorresponding weight values w and convolution characteristics fIMultiplication and subsequent accumulation of the same tensor H with zero initial valueIThe following steps of (1);
the fifth step: h is to beIMapping to a three-channel jet color space and sizing the height and width of the pixel data by [ h ', d']Conversion to [ h, d]And obtaining the visual activation thermodynamic diagram h under the current action categoryI,hIHas a size of [3, h, d];
And a sixth step: h is divided according to corresponding different data groupsIWith corresponding input data
Figure BDA0002691968340000025
And overlapping to obtain an attention result graph of model classification.
Further, in the second step: when a cannot be divided by n, the redundant image frame will be cut off; the default channel number c is 3, the frame number n is 16, and the pixel height h and width d are 128.
Further, in the fourth step: classification weight wiIs expressed as: w is ai=w[ti,c′];HIIs represented by the following programming language: hI+=fI[c']×w[ti,c′]。
Has the advantages that: the visualization result of the invention is helpful for more comprehensive understanding of the network model. Because video motion recognition requires the model to learn and understand the time information and the spatial information in the video data at the same time, how to better balance the learning degree of the two groups of information to ensure the classification performance of the network model is one of the important difficult point and the breakthrough point of the motion recognition technology. Through the visualization result of the model classification basis, people can better master the fact that the judgment of the model during classification is more inclined to time information or space information, or the dependence degree of the model on the two kinds of information is more balanced. When the model excessively depends on certain information as a judgment basis to generate an overfitting phenomenon, the network model can be debugged and improved more pertinently by using visual judgment brought by a visual result.
The invention utilizes the category activation mapping to carry out the visualization of classification attention on the three-dimensional action recognition residual neural network, and is helpful for people to better understand the region of the input data where the model puts attention during the action recognition process by using the network model, and which information is the judgment basis of the action recognition result. The interpretability of the network model is improved, meanwhile, the trust degree of a user can be improved, and the action recognition technology can be better applied and popularized in the fields of medical treatment, logistics transportation, automatic driving, safety monitoring and the like.
In the practical application of the motion recognition technology, it is possible to use a plurality of different models or multitask models to construct one recognition system according to different situations. The visualization result obtained by the method can more intuitively show the performance difference of the network model in different application scenes, for example, the specific performance of the model is evaluated by comparing the visualization results in different examples by using different targeted application examples such as day, cloudy day, dusk, night, pedestrians, buildings, vehicles and the like. The identification system generates different identification tasks according to different application examples, and network models are grouped and placed in the system according to specific performance of the different application examples, so that the overall performance of the system is improved, the application scenes of the system can be expanded by using the different models, and different identification functions are added.
Drawings
FIG. 1 is a logic diagram of the present method.
Fig. 2 shows the original image of the 10 th frame to which the motion video data is input in example 1.
Fig. 3 shows the original image of the 60 th frame to which the motion video data is input in example 1.
Fig. 4 is an activation thermodynamic diagram of an input group in which the original image of the 10 th frame in example 1 is implemented.
Fig. 5 is an activation thermodynamic diagram of the input group in which the original image of the 60 th frame in example 1 is implemented.
Fig. 6 is a result view of attention obtained by superimposing the original image of the 10 th frame and the corresponding thermodynamic diagram in example 1.
Fig. 7 is a result view of attention obtained by superimposing the original image of the 60 th frame and the corresponding thermodynamic diagram in example 1.
Fig. 8 shows the original image of the 31 st frame to which the motion video data is input in example 2.
Fig. 9 shows the 154 th frame original of the input motion video data in example 2.
Fig. 10 is an activation thermodynamic diagram of an input group in which the original image of the 31 st frame of embodiment 2 is located.
Fig. 11 is an activation thermodynamic diagram of an input group in which the 154 th frame of original image is located in embodiment 2.
Fig. 12 is a result view of attention obtained by superimposing the original image of the 31 st frame and the corresponding thermodynamic diagram in example 2.
Fig. 13 is a result view of attention obtained by superimposing the 154 th frame original image and the corresponding thermodynamic diagram in example 2.
Fig. 14 shows the original image of the 31 st frame for inputting the motion video data in example 3.
Fig. 15 shows the 154 th frame original for inputting motion video data in example 3.
Fig. 16 is an activation thermodynamic diagram of an input group in which the original image of the 31 st frame in embodiment 3 is located.
Fig. 17 is an activation thermodynamic diagram of an input group in which the 154 th frame of original image in embodiment 3 is located.
Fig. 18 is a result view of attention obtained by superimposing the original image of the 31 st frame and the corresponding thermodynamic diagram in example 3.
Fig. 19 is a result view of attention obtained by superimposing the 154 th frame original image and the corresponding thermodynamic diagram in example 3.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
in one embodiment, a three-dimensional action recognition residual error network visualization method using class activation mapping is described, an activation thermodynamic diagram is obtained by using classification weights and feature information of a model as an activation degree schematic diagram between model action recognition and input data, and a regional display network model takes which part of information of the input data as a classification basis. The activation thermodynamic diagram is superposed with the input data to obtain an attention result image, and the image can visually show which information of the input data is mainly focused on by the model under the corresponding classification result.
The classification basis of the display network model which is composed of the classification weight and the characteristic information and can be regionalized by the activation thermodynamic diagram is disclosed. When the network model classifies input data, the characteristics of the data need to be extracted by using the middle hidden layer, and the output layer obtains a final classification result from the combined calculation of each class classification weight and each level of characteristics. Therefore, although the method cannot know the specific learning content of the neuron, the method can intuitively embody the sensitivity of each classification result to the information of which part of the region in the input data and the sensitivity difference with other part of the information by visualizing the combination result of each classification weight and each hierarchy feature;
the sensitivity of the specified classification result to each region information in the input data is equivalent to the correlation degree of the classification result and the region information, and the region with higher correlation degree has more obvious influence on the classification result, namely the information of the region is the main judgment basis for obtaining the specified classification result by the output layer;
the output layer obtains a final classification result according to the combination of classification weights of all classes and different features, the classification weight parameters are independent according to the classification classes, and different motion recognition results have different classification weight parameter sets;
wherein, different visual effects can be realized by using the classification weight parameters of different action recognition results. When the classification weight of the correct recognition result is used as a merging basis, the visualization result can provide a judgment basis for obtaining the correct classification of the network model, so that the interpretability of the model is enhanced; when the classification weight of the error recognition result is used as a merging basis, the visualization result can indicate which information in the input data causes the network model to generate misjudgment, so that the method is convenient for deeply knowing the intercommunity between different data and the defects of the performance of the network model, and is also convenient for performing more targeted optimization on the performance of the model.
According to the method, when video data is input, data is required to be cut into blocks and then input as different data groups, and the activation thermodynamic diagram is required to be generated in a grouping mode according to the characteristics of the different video blocks;
the method requires that the structure of the network model meets the following requirements: only a global average pooling layer is arranged between the last output layer and the last convolution layer of the model, and other full-connection layers are not arranged;
further, the specific steps of the 3D residual action recognition network visualization method using category activation mapping are as follows:
the first step is as follows: reading a classification weight w connected between an output layer and a global average pooling layer of the 3D residual action recognition network model to be visualized;
wherein, the size of w is [ t, c '], wherein t represents the total number of the action types which can be identified by the model, c' represents the number of connecting channels between the output layer and the previous layer, and each channel has different connecting weights;
wherein, the values of t and c' are determined according to the structural difference of the network model, and have no default value;
the second step is that: processing the motion video to be recognized into a single-frame image data set X, and further grouping the motion video into each input data set according to the fixed frame number
Figure BDA0002691968340000051
Wherein the image data set X is formulated as:
X=[x1,x2……xa]
x represents a single frame image of [ c, h, d ] after processing according to the set channel number c, pixel height h and width d, and a represents the total frame number of the image data group depending on the time length of the input video. The default number of channels c is 3, and the pixel height h and width d are both 128;
further, the first set of input data may be represented by the following formula
Figure BDA0002691968340000052
Figure BDA0002691968340000053
Wherein the content of the first and second substances,
Figure BDA0002691968340000054
has a data size of [ c, n, h, d ]]And n represents the number of single-frame images possessed by a group of input data, and is set as 16 by default. When n cannot divide a completely, the redundant image frames in the image data set X will be discarded as input data. As will be used hereinafter
Figure BDA0002691968340000055
To refer to a certain set of input data;
the third step: will input data
Figure BDA0002691968340000056
Inputting the information into a network for action recognition, and reading the characteristic information f of a convolutional layer in front of a network global average pooling layer;
wherein the characteristic information f is to be based on different input data sets
Figure BDA0002691968340000057
Is divided into corresponding characteristic information groups fI
Wherein, due to layer-by-layer iteration of the network model on the data, fIHas a data size of [ c ', n, h ', d '],fIThe pixel height h 'and the width d' are compared with the input data set
Figure BDA0002691968340000058
The height h and the width d of the pixel have certain difference, and the specific numerical value difference depends on the structure of the network model;
the fourth step: corresponding merging of selected action categories t by channels ciBy classification weight wiWith convolution characteristic fIAt tensor HIThe combined result of each channel is accumulated to obtain HIMapping to three-channel jet color space to obtain current action category tiVisual activation thermodynamic diagram hI
Wherein the classification weight wiIs expressed as: w is ai=w[ti,c′];
Wherein, tensor HIAll initial values of the internal values are zero, and the sizes are [ c ', h ', d '];
Wherein, H isISibling of jet color space mapped to three channels requires that the data height and width be sized by [ h ', d']Conversion to [ h, d]Visual activation thermodynamic diagram hIHas a size of [3, h, d];
The fifth step: drawing graph hIAccording to the corresponding grouping and input data
Figure BDA0002691968340000059
And (5) overlapping to obtain a classification attention result graph of the model.
The three-dimensional action recognition residual error network visualization method utilizing category activation mapping belongs to the field of video understanding in computer vision application, and in order to improve interpretability of a three-dimensional action recognition residual error network to a video data understanding process, an activation thermodynamic diagram is obtained by utilizing classification weight and characteristic information of a model to serve as an activation degree schematic diagram between the model and input data, and a regional display network model takes which part of information of the input data as a classification basis. The thermodynamic diagram is activated to be superposed with the input data to obtain an attention result image, the image can visually show that the attention of the model is mainly focused on the area where the input data information is located under the corresponding classification result, so that the action recognition technology can be better applied and popularized in the fields of medical treatment, logistics transportation, automatic driving, safety monitoring, network short video, live screening and the like.
In another embodiment, a three-dimensional motion recognition residual error network visualization method using category activation mapping is described, a logic schematic of the method is shown in fig. 1, and the algorithm is implemented by the following steps:
the first step is as follows: reading classification weight w of connection between an output layer and a global average pooling layer of the network model, wherein the data size is [ t, c '], t represents the total number of action categories which can be identified by the model, and c' represents the number of connection channels between the output layer and the previous layer;
the second step is that: processing the action video to be recognized into a single-frame image data set X according to the set channel number c, the pixel height h and the width d, and taking every n frames of images as a set of input data
Figure BDA0002691968340000061
The single frame image data set X is specifically formulated as:
X=[x1,x2…xa]
where x represents a single frame image of size [ c, h, d ], a represents the total number of frames of a motion video image data set, and when a cannot be divided by n, the redundant image frames will be discarded. The first set of input data may be formulated as:
Figure BDA0002691968340000062
wherein the content of the first and second substances,
Figure BDA0002691968340000063
will be converted into data size [ c, n, h, d ] when applied]The default number of channels c of the method is 3, the number of frames n is 16, and the height h and the width d of the pixel are both 128. As will be used hereinafter
Figure BDA0002691968340000064
To refer to a certain set of input data;
the third step: will input data
Figure BDA0002691968340000065
Inputting the data into the network for action recognition, reading the convolution layer before the global average pooling layer of the network according to different input data sets
Figure BDA0002691968340000066
Each obtained feature information group fI. Due to layer-by-layer iteration of the network model on the data, fIThe pixel height h' is compared with the width d
Figure BDA0002691968340000067
Has a certain difference between the height h and the width d of the pixel fIHas a data size of [ c ', n, h ', d ']. Each specific numerical value will be different due to the structural difference of the network model in practical application;
the fourth step: selecting a certain action recognition result tiSorting the results t by channel ciCorresponding weight values w and convolution characteristics fIMultiplication and subsequent accumulation of the same tensor H with zero initial valueILower, HIThe calculation process of (a) can be represented by the following programming language:
HI+=fI[c']×w[ti,c′]
the fifth step:h is to beIMapping to a three-channel jet color space and sizing the height and width of the pixel data by [ h ', d']Conversion to [ h, d]And obtaining the visual activation thermodynamic diagram h under the current action categoryI,hIHas a size of [3, h, d];
And a sixth step: h is divided according to different data groupsIWith corresponding input data
Figure BDA0002691968340000071
And overlapping to obtain an attention result graph of model classification.
In view of the above, the present invention provides a three-dimensional motion recognition residual error network visualization method using class activation mapping, which combines classification weights and feature information to obtain an activation thermodynamic diagram as a visual activation result between model motion recognition and input data, and displays a network model in a regionalized manner according to which part of information of the input data is used as an activation condition of a classification basis. The thermodynamic diagram is activated to be overlapped with the input data to obtain an attention result image, the model attention is visually shown in the corresponding classification result, the information of the input data is mainly focused on, and the interpretability of the classification result obtained by the three-dimensional action recognition residual error network is improved.
The solutions disclosed in the prior art in connection with the present disclosure are as follows:
in 2017, the invention patent application of 'method for visualizing a convolutional neural network' (publication number: CNIO7392085A) discloses a method for visualizing a convolutional neural network, which uses customized network parameters and sets a convolutional neural network feature extraction function corresponding to the parameters, effectively executes a feature extraction program, stores response features of all neurons, and visualizes the larger response and the maximum response features of a single neuron in a specified layer by combining the calculated response domain parameters of the specified neurons. The difference is that the example does not need to calculate the response parameters additionally, and the visualization is completed only by reading and combining the feature information and the classification weight.
In 2017, the invention patent application of intermediate information analysis device, optimization device and feature visualization device of a neural network (publication number: CN109460812A) discloses an intermediate information analysis device, an optimization device and a feature visualization device of a neural network, which simplify the hidden state output by a long and short term memory layer in the neural network through a dimensionality reduction algorithm, analyze the intermediate information of the neural network according to the simplified hidden state, the input data of the network and the output result, realize the feature visualization of the hidden state output by the long and medium term memory layer according to the intermediate information, and are beneficial to further research and learning of the neural network with the long and short term memory layer. The difference is that the network model applied in the example does not have a long-short term memory layer and visualizes the information of the middle layer in the network, but utilizes the superposition effect of the classification weight and different feature activation areas to map the classification basis of the network into a visible image through parameter calculation.
In 2018, the invention patent application of a convolutional neural network visualization method based on regularization of a gram matrix (publication number: CNIO8470209A) discloses a convolutional neural network visualization method based on regularization of a gram matrix, which can visualize the characteristics of each intermediate layer of a convolutional neural network, improve the characteristic visualization effect by utilizing regularization based on the gram matrix and obtain the effect of resisting the visualized fool effect. The difference lies in that the calculation structure of this example is comparatively simple, does not add other regularization algorithm and handles visual effect, only needs map the middle tensor to the color space to obtain the visual effect that can carry out the contrast of colorization degree.
Example 1:
in the present embodiment, a group of "open-close jump" motion videos is input into the network model, and activation mapping visualization is performed by using the correct classification label "open-close jump", fig. 2 and 3 are original images of 10 th and 60 th frames of the motion videos, fig. 4 and 5 are activation thermodynamic diagrams of the input group in which the original image of 10 th and 60 th frames is located, and fig. 6 and 7 are two thermodynamic diagrams and an attention result diagram obtained by superimposing the original images.
Example 2:
in the present embodiment, a group of "rope skipping" motion videos is input into the network model and is visualized by activation mapping using the correct classification label "rope skipping", fig. 8 and 9 are original images of the 31 st and 154 th frames of the motion videos, fig. 10 and 11 are activation thermodynamic diagrams of the input group in which the original image of the 31 st and 154 th frames is located, and fig. 12 and 13 are two thermodynamic diagrams and an attention result diagram obtained by superimposing the original images.
Example 3:
in the present embodiment, a set of "rope skipping" motion videos is input to the network model and is visualized by activation mapping using the error classification label "jump open and closed", fig. 14 and 15 show original images of frames 31 and 154 of the motion videos, fig. 16 and 17 show activation thermodynamic diagrams of input groups in which the original images of frames 31 and 154 are located, and fig. 18 and 19 show two thermodynamic diagrams and an attention result diagram obtained by superimposing the original images.

Claims (5)

1. A three-dimensional action recognition residual error network visualization method using category activation mapping is characterized in that:
acquiring an activation thermodynamic diagram through the classification weight and the characteristic information of the network model, and taking the activation thermodynamic diagram as an activation degree schematic diagram between the action recognition and the input data of the network model; and the activated thermodynamic diagram is superposed with the input data to obtain an attention result image, and the attention of the network model is shown to be mainly focused on the corresponding information of the input data under the corresponding classification result.
2. The method for three-dimensional motion recognition residual network visualization using category activation mapping of claim 1, wherein: the network model classifies the feature extraction of the data through the middle hidden layer, and the output layer obtains a classification result from the combined calculation of each class classification weight and each level feature.
3. A three-dimensional action recognition residual error network visualization method using category activation mapping is characterized in that: the implementation steps are as follows:
the first step is as follows: reading classification weight w connected between an output layer and a global average pooling layer of the network model, wherein the data size is [ t, c '], t represents the total number of action categories which can be identified by the model, and c' represents the number of connecting channels between the output layer and the previous layer;
the second step is that: processing the action video to be recognized into a single-frame image data set X according to the set channel number c, the pixel height h and the width d, and taking every n frames of images as a set of input data
Figure FDA0002691968330000011
The single frame image data set X is specifically formulated as:
X=[x1,x2…xa]
wherein x represents a single frame image of size [ c, h, d ], a represents the total number of frames of a motion video image data set, and the group I input data is represented by the formula:
Figure FDA0002691968330000012
wherein the content of the first and second substances,
Figure FDA0002691968330000013
will be converted into data size [ c, n, h, d ] when applied]The tensor of (a);
the third step: inputting all groups of input data into the network for action recognition, reading a convolution layer before the global average pooling layer of the network, and according to different input data groups
Figure FDA0002691968330000014
Obtaining each characteristic information group fI,fIHas a data size of [ c ', n, h ', d '],fIHas a pixel height h 'and a width d';
the fourth step: selecting a certain action recognition result tiSorting the results t by channel ciCorresponding weight values w and convolution characteristics fIMultiplication and subsequent accumulation of the same tensor H with zero initial valueIThe following steps of (1);
the fifth step: h is to beIMapping to a three-channel jet color space and sizing the height and width of the pixel data by [ h ', d']Conversion to [ h, d]And obtaining the visual activation thermodynamic diagram h under the current action categoryI,hIHas a size of [3, h, d];
And a sixth step: h is divided according to corresponding different data groupsIWith corresponding input data
Figure FDA0002691968330000021
And overlapping to obtain an attention result graph of model classification.
4. The method of three-dimensional motion recognition residual network visualization using category activation mapping of claim 3, wherein: in the second step:
when a cannot be divided by n, the redundant image frame will be cut off;
the default channel number c is 3, the frame number n is 16, and the pixel height h and width d are 128.
5. The method of three-dimensional motion recognition residual network visualization using category activation mapping of claim 3, wherein: in the fourth step:
classification weight wiIs expressed as: w is ai=w[ti,c′];
HIIs represented by the following programming language:
HI+=fI[c']×w[ti,c′]。
CN202010994268.7A 2020-09-21 2020-09-21 Three-dimensional action recognition residual error network visualization method using category activation mapping Pending CN112115872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010994268.7A CN112115872A (en) 2020-09-21 2020-09-21 Three-dimensional action recognition residual error network visualization method using category activation mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010994268.7A CN112115872A (en) 2020-09-21 2020-09-21 Three-dimensional action recognition residual error network visualization method using category activation mapping

Publications (1)

Publication Number Publication Date
CN112115872A true CN112115872A (en) 2020-12-22

Family

ID=73801544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010994268.7A Pending CN112115872A (en) 2020-09-21 2020-09-21 Three-dimensional action recognition residual error network visualization method using category activation mapping

Country Status (1)

Country Link
CN (1) CN112115872A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298084A (en) * 2021-04-01 2021-08-24 山东师范大学 Feature map extraction method and system for semantic segmentation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298084A (en) * 2021-04-01 2021-08-24 山东师范大学 Feature map extraction method and system for semantic segmentation

Similar Documents

Publication Publication Date Title
CN111898507B (en) Deep learning method for predicting earth surface coverage category of label-free remote sensing image
CN108830285B (en) Target detection method for reinforcement learning based on fast-RCNN
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
CN109886066B (en) Rapid target detection method based on multi-scale and multi-layer feature fusion
CN111612807B (en) Small target image segmentation method based on scale and edge information
CN107239730B (en) Quaternion deep neural network model method for intelligent automobile traffic sign recognition
CN107609602A (en) A kind of Driving Scene sorting technique based on convolutional neural networks
CN113128558B (en) Target detection method based on shallow space feature fusion and adaptive channel screening
CN109977978B (en) Multi-target detection method, device and storage medium
CN109035300B (en) Target tracking method based on depth feature and average peak correlation energy
CN112434723B (en) Day/night image classification and object detection method based on attention network
CN107622280B (en) Modularized processing mode image saliency detection method based on scene classification
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
Lyu et al. Small object recognition algorithm of grain pests based on SSD feature fusion
CN111881731A (en) Behavior recognition method, system, device and medium based on human skeleton
Cho et al. Semantic segmentation with low light images by modified CycleGAN-based image enhancement
CN108875456A (en) Object detection method, object detecting device and computer readable storage medium
CN114005085A (en) Dense crowd distribution detection and counting method in video
CN115238758A (en) Multi-task three-dimensional target detection method based on point cloud feature enhancement
CN114255403A (en) Optical remote sensing image data processing method and system based on deep learning
CN113870160A (en) Point cloud data processing method based on converter neural network
CN115937774A (en) Security inspection contraband detection method based on feature fusion and semantic interaction
CN114782979A (en) Training method and device for pedestrian re-recognition model, storage medium and terminal
CN112115872A (en) Three-dimensional action recognition residual error network visualization method using category activation mapping
CN112819510A (en) Fashion trend prediction method, system and equipment based on clothing multi-attribute recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination