CN113065450B - Human body action recognition method based on separable three-dimensional residual error attention network - Google Patents

Human body action recognition method based on separable three-dimensional residual error attention network Download PDF

Info

Publication number
CN113065450B
CN113065450B CN202110334547.5A CN202110334547A CN113065450B CN 113065450 B CN113065450 B CN 113065450B CN 202110334547 A CN202110334547 A CN 202110334547A CN 113065450 B CN113065450 B CN 113065450B
Authority
CN
China
Prior art keywords
attention
dimensional
channel
separable
sep
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110334547.5A
Other languages
Chinese (zh)
Other versions
CN113065450A (en
Inventor
张祖凡
彭月
甘臣权
张家波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202110334547.5A priority Critical patent/CN113065450B/en
Publication of CN113065450A publication Critical patent/CN113065450A/en
Application granted granted Critical
Publication of CN113065450B publication Critical patent/CN113065450B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a human body action recognition method based on a separable three-dimensional residual error attention network, and belongs to the field of computer vision. The method comprises the following steps: s1: replacing standard three-dimensional convolution in the 3D ResNet by separable three-dimensional convolution to build Sep-3D ResNet; s2: designing a channel attention module and a space attention module, and then stacking in sequence to construct a dual attention mechanism; s3: carrying out double attention weighting on the middle-layer convolution characteristics at different moments, expanding a double attention module in a time dimension, then embedding the double attention module into a Sep-3D RAB of Sep-3D ResNet, and constructing a Sep-3D RAN; s4: and performing joint end-to-end training on the Sep-3D RAN by using a multi-stage training strategy. The invention can improve the distinguishing capability of classification and discrimination characteristics, realizes the high-efficiency extraction of high-quality space-time visual characteristics, and can enhance the classification precision and the identification efficiency of the model.

Description

Human body action recognition method based on separable three-dimensional residual error attention network
Technical Field
The invention belongs to the field of computer vision, and relates to a human body action identification method based on a separable three-dimensional residual error attention network.
Background
Huge information is hidden in videos, and management, storage and identification of network videos are greatly challenged by huge user amount and high-speed growing market scale of network video markets, so that network video services are increasingly valued by all parties. In the field of human-centered computer vision research, human motion recognition tasks are an important research direction in computer vision tasks due to wide application in various fields such as human-computer interaction, smart home, automatic driving, virtual reality and the like. The main task of human body action recognition is to spontaneously recognize human body actions in an image sequence or a video, analyze the image sequence, analyze human body movement patterns, establish a mapping relation between video contents and action categories, further mine deep-level information contained in the video, learn and analyze the human body actions and behaviors in the video, and further understand the video contents. The human body action in the video is accurately identified, so that the unified classification management of the mass related video data by an internet platform is facilitated, and the construction of a harmonious network environment is facilitated. In addition, the development of human body action recognition technology promotes the maturity of video abnormity monitoring service, can assist social security managers to quickly predict crisis events in public occasions, and can timely monitor abnormal behaviors (such as faint, wrestling and the like) of users in family life so as to seek medical advice in time. Therefore, the human body action in the video is accurately identified, and the method has important academic value and application value.
The traditional action recognition algorithm depends on manual design features, specific feature design is often carried out according to different tasks, the performance of the recognition algorithm depends on a database, the complexity of processing processes on different data sets is increased, and the generalization capability and the universality are poor. In addition, under the background of the current information explosion era, image and video data are exponentially increased, people tend to extract more general feature representation by adopting a non-manual method, and therefore the action recognition method based on manual features cannot meet task requirements.
Deep learning benefits from a hierarchical training mode, high-dimensional features are automatically extracted from original video data through a layer-by-layer progressive feature extraction mechanism, and context semantic information of the video data is fully captured, so that the description capability of a deep model is improved, the final recognition and judgment are facilitated, and the method is widely applied to the field of action recognition. In recent years, the main technologies of deep learning applied to the field of human motion recognition include 2D CNN, 3D CNN, attention mechanism, and the like. The 2D CNN can effectively capture the spatial neighborhood correlation information of the RGB video frames, the 3D CNN can simultaneously capture the visual features on the space-time dimension, and the attention mechanism can realize flexible screening of key features, so that the identification performance of the model is improved. Although the 2D CNN is less complex and less parametric, it has insufficient extraction capability for dynamic features due to lack of time flow information; although the 3D CNN can directly perform the fusion of spatio-temporal features on the original input data, it will result in a large increase in the number of parameters of the model, which is not beneficial to the optimization process of the model. In addition, a large number of redundant features are included in the feature extraction process, so that the identification result of the model is interfered.
Therefore, a method for improving video recognition performance is needed.
Disclosure of Invention
In view of the above, the present invention provides a human body action recognition method based on a separable three-dimensional residual attention network, which adopts a reasonable kernel structure decomposition operation to alleviate the difficult phenomenon of deep three-dimensional convolution model optimization, and combines an attention mechanism to improve the flexibility of key feature screening, thereby preparing higher-quality spatio-temporal visual features to improve the recognition performance of the model.
In order to achieve the purpose, the invention provides the following technical scheme:
a human body action recognition method based on a separable three-dimensional residual attention network specifically comprises the following steps:
s1: constructing Separable three-dimensional convolution, and replacing standard three-dimensional convolution in a traditional three-dimensional residual network (3D residual network,3D ResNet) by utilizing the Separable three-dimensional convolution so as to build a Separable 3D residual network (Sep-3D ResNet) to relieve the phenomenon of difficult optimization of a deep three-dimensional convolution model;
s2: designing a channel attention module to capture channel level importance distributions, designing a spatial attention module to automatically weigh importance of each spatial location, and then stacking two attention modules in sequence to construct a dual attention mechanism;
s3: carrying out double attention weighting on middle-layer convolution characteristics at different moments, expanding a double attention module in a time dimension, embedding the double attention module into a Separable three-dimensional residual block of Sep-3D ResNet, and constructing to form a Separable 3D residual attention network (Sep-3D RAN) model;
s4: and performing combined end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy so as to relieve the overfitting effect of the model caused by insufficient training sample volume and improve the generalization capability of the model.
Further, in step S1, the specific process of constructing the separable three-dimensional convolution is as follows: the standard three-dimensional convolution in the spatiotemporal dimension is approximated by a three-dimensional convolution kernel decomposition operation as a two-dimensional convolution in the spatial dimension and a one-dimensional convolution in the temporal dimension to construct separable three-dimensional convolutions.
The separable three-dimensional convolution operation process comprises the following steps: assume that there is N in convolutional layer i i-1 An input feature, first applying N i-1 A feature and M i Each size is 1 Xh Xw XN i-1 Is convolved with h, w, N i-1 Respectively the height, width and channel dimension of the two-dimensional space convolution kernel; then with N i Each size is t × 1 × 1 × M i Is convolved with a one-dimensional time filter of (a) where t and M i Respectively representing the time scale and channel dimension of a one-dimensional time convolution kernel. Wherein M is i The design principle of (2) follows the rule that the decomposed three-dimensional convolution parameter quantity is approximately equal to the standard three-dimensional convolution parameter quantity, and is calculated by the following formula:
Figure BDA0002996871080000031
in order to build Sep-3D ResNet, 3D ResNet is selected as a reference framework of a model, and then the standard three-dimensional convolution operation in the 3D ResNet is replaced by the separable three-dimensional convolution operation. Compared with the original reference model, the Sep-3D ResNet multiplies the nonlinear activation function of the model under the condition of keeping the number of network layers unchanged, so that the complex function is easier to fit, the description capability of the model is improved on the basis of relieving the problem of difficulty in optimizing the deep three-dimensional convolution model, and the identification performance of the model is enhanced.
Further, in step S2, the input of the dual attention mechanism is first defined. Assuming model input as F ∈ R T ×H×W×C Wherein T, H, W respectively represent the time dimension and height of the input cubeDegree and width, C denotes the number of input channels. A middle layer feature mapping cube F' epsilon R obtained after one group or a series of separable three-dimensional convolutions T'×H'×W'×C' Defining the slice tensor at time t as F t ∈R H'×W'×C' Where T is 0,1, …, T'. The slice tensor is the input feature of the subsequent dual attention mechanism.
Introduction of a dual attention mechanism:
(1) designing a channel attention module, which specifically comprises: since capturing channel-level importance distributions requires explicit modeling of dependencies between channels, global average pooling operations are employed to aggregate spatial dimensions of input features to generate channel descriptors F C ∈R 1×1×C' Thereby avoiding the interference of local spatial information, and the expression formula is:
Figure BDA0002996871080000032
wherein, F t ∈R H'×W'×C' The slice tensor represents the time T, wherein T is 0,1, …, T ', H', W 'and C' respectively represent the time dimension, height, width and channel number of a middle-layer feature mapping cube obtained after a group or a series of separable three-dimensional convolutions of an input cube;
subsequently, a gating mechanism similar to the self-attention function is used to obtain the importance distribution set of each channel, i.e. the channel descriptor F C Sending the data into a multilayer perceptron with a hidden layer to excite non-normalized channel attention mapping; to limit the number of parameters of the model, the dimension of the hidden activation layer is set to C'/r, r is the reduction ratio and is usually set to 16; then, normalization operation is carried out by using a sigmoid activation function to obtain final channel attention mapping; the channel attention solving process expression is as follows:
M C (F t )=EP C (σ(MLP(F C ))))=EP C (σ(W 1 (δ(W 0 F C ))))
wherein σ (-) represents the sigmoid activation function, and δ (-) represents the relu activation function,W 0 、W 1 Representing weights of multi-layer perceptrons, EP C (. to) shows the expansion of the channel attention value to the original dimension along the spatial domain, i.e. let M C (F t )∈R C'×H'×W'
To perform automatic feature alignment, channel attention needs to be mapped to the original input features, and the refined slice tensor calculation process is:
Figure BDA0002996871080000041
wherein, the symbol
Figure BDA0002996871080000042
Refers to element-level multiplication operations.
(2) The design space attention module specifically comprises: similar to the channel attention module, to efficiently compute the spatial attention feature map, the F is aggregated using a global average pooling operation t ' to generate a two-dimensional space descriptor F S ∈R H'×W'×1 To summarize F t The global channel information of' has a specific calculation expression as follows:
Figure BDA0002996871080000043
subsequently, to obtain a feature map F t ' the correlation between different spatial positions and target actions, calculates the spatial attention value distribution by using two-dimensional convolution operation instead of multi-layer perceptron, namely:
M S (F t ')=EP S (σ(conv(F S )))
where conv (-) denotes a two-dimensional convolution operation, typically with the convolution kernel size set to 7 × 7 for best recognition performance, EP S (-) denotes a dimension transformation operation along the channel dimension with the aim of extending the channel dimension at different spatial locations to the original channel dimension, i.e. let M be S (F t ')∈R C'×H'×W'
Deducing the original slice tensor F t After the channel attention mapping and the space attention mapping, firstly, the characteristic calibration is realized by using a channel attention module to obtain a thinned slice tensor F t ', then M is mapped in spatial attention S (F t ') and F t ' between performing feature recalibration using element-level multiplication operations, resulting in an attention-weighted slice tensor F t The method can be used for distinguishing information intensive channels and identifying spatial significant regions at the same time, and inhibiting redundant background information; the resulting final refined tensor F t The calculation process of' is as follows:
Figure BDA0002996871080000044
further, in step S3, the building of the Sep-3D RAN model specifically includes: to achieve the aforementioned expansion of the dual attention mechanism in the time dimension, the inference process of channel attention mapping and spatial attention mapping needs to be applied to the middle layer convolution feature F' ∈ R T'×H'×W'×C' The double attention weighting process is repeated on all time dimensions, namely the slice tensors at all moments, and finally, the thinned slice tensors are arranged according to the original time dimension and stacked into a final thinned feature cube;
by embedding the channel attention module and the space attention module which are subjected to time domain expansion in sequence in the Separable three-dimensional residual block of the Sep-3D ResNet, a Separable 3D residual attention block (Sep-3D RAB) is obtained, so that richer attention resources are flexibly allocated to key features while abstract semantic information of input data is captured; and finally, building a Sep-3D RAN according to a model architecture of 3D ResNet, namely replacing a simple residual block in the 3D ResNet with a Sep-3D RAB.
Further, in step S4, performing joint end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy, specifically including: firstly, initializing network parameters by using pre-training weights to accelerate the convergence process of a model; considering that the Sep-3D RAN has four separable three-dimensional residual attention blocks, the training process of the model is divided into four stages; in the first stage, the attention mechanism is only embedded into the first residual error block, and then the network layer parameters before the module are fixed, and the subsequent network layer is trained; in the second stage, continuously embedding an attention mechanism into the second residual block, then initializing the network layer parameters before the current module by using the network weight learned in the first stage, and training the subsequent network layer; repeating the above process until all the residual blocks are embedded with the attention mechanism; due to the introduction of the pre-training weight, the model can realize rapid convergence, so the training process is not time-consuming and is easy to realize. Furthermore, the model is end-to-end trainable in all training phases, so the model can directly learn the mapping relationship from the original input to the target output.
In order to realize the end-to-end training mode, a full-connection layer is utilized to generate a final one-dimensional prediction vector I ∈ R C C refers to the total number of action categories of the target dataset, and then selects the softmax function to calculate the probability distribution of the category to which the input video belongs, i.e.:
Figure BDA0002996871080000051
wherein the content of the first and second substances,
Figure BDA0002996871080000052
representing the prediction probability that the nth video belongs to the action category i;
in the optimization stage, the error between the real value and the predicted value is adjusted by using a cross entropy loss function, and the loss function expression is as follows:
Figure BDA0002996871080000053
wherein, y n,i Representing the true label value for a given input video, N refers to the number of samples per batch in the training process.
The invention has the beneficial effects that: the invention can improve the distinguishing capability of classification distinguishing characteristics, realize the high-efficiency extraction of high-quality space-time visual characteristics and enhance the classification precision and the identification efficiency of the model; the concrete aspects are as follows:
1) the invention uses separable three-dimensional convolution to approximate standard three-dimensional convolution, simplifies the convolution operation in a three-dimensional space-time domain into convolution on a cascaded two-dimensional space plane and a one-dimensional time plane, and relieves the phenomenon of difficult optimization of a deep three-dimensional convolution model;
2) the channel attention module is used for capturing more meaningful channel information components, and the space attention module is used for paying attention to more remarkable space regions, so that the flexible screening of key features by the model is facilitated;
3) the model is trained using a multi-stage training strategy, avoiding the over-fitting effect of the model without adding additional regularization operations.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a process of human body action recognition based on a separable three-dimensional residual attention network according to the present invention;
FIG. 2 is a model diagram of a human body action recognition system based on a separable three-dimensional residual attention network according to the present invention;
FIG. 3 is a schematic diagram of a separable three-dimensional convolution;
FIG. 4 is a schematic view of a channel attention module;
fig. 5 is a schematic view of a spatial attention module.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Referring to fig. 1 to 5, the present invention provides a human body action recognition method based on a separable three-dimensional residual error attention network, as shown in fig. 1 and 2, which specifically includes the following steps:
the method comprises the following steps: approximating the standard three-dimensional convolution on the space-time dimension to a cascaded two-dimensional space convolution and a one-dimensional time convolution through a three-dimensional convolution kernel decomposition operation to construct a separable three-dimensional convolution, and then replacing the standard three-dimensional convolution in the 3D ResNet by the separable three-dimensional convolution to construct a Sep-3D ResNet;
step two: designing a channel attention module to generate modulation weight of each channel to capture channel-level importance distribution, automatically weighing neighborhood correlation of each spatial position by the designed spatial attention module, stacking the channel attention module and the spatial attention module in sequence, and sequentially deducing channel attention mapping and spatial attention mapping of input features so as to construct a dual attention mechanism;
step three: sequentially calculating a channel attention value and a space attention value of a slice tensor of each time dimension of a middle-layer convolution feature cube, stacking the thinned slice tensors according to the original time dimension, embedding the slice tensors into a separable three-dimensional residual block of Sep-3DResNet, and constructing a final Sep-3D RAN;
step four: by introducing the attention module in Sep-3D ResNet stages, sequentially training the sub-networks of each stage, and finally performing combined end-to-end training on the whole network, the model overfitting phenomenon caused by the condition of insufficient training samples is relieved while the attention layer is fully activated.
FIG. 3 is a schematic diagram of a separable three-dimensional convolution illustrating the operation of a separable three-dimensional convolution operation on an input feature to obtain a corresponding output feature in a given convolution layer.
The separable three-dimensional residual attention network module comprises:
as shown in fig. 3, the separable three-dimensional convolution operation process is: assume that there is N in convolutional layer i i-1 An input feature, N i-1 A feature first with M i Each size is 1 Xh Xw XN i-1 Is convolved, h, w, N i-1 Respectively the height, width and channel dimension of the convolution kernel in two-dimensional space, and then the convolution kernel is compared with N i Each size is t × 1 × 1 × M i Is convolved with a one-dimensional time filter of (a) and (b) i Respectively representing the time scale and channel dimension of a one-dimensional time convolution kernel, where M i The design principle of (2) follows the rule that the decomposed three-dimensional convolution parameter quantity is approximately equal to the standard three-dimensional convolution parameter quantity, and is calculated by the following formula:
Figure BDA0002996871080000071
in order to build Sep-3D ResNet, 3D ResNet is selected as a reference framework of a model, and then the standard three-dimensional convolution operation in the 3D ResNet is replaced by the separable three-dimensional convolution operation. Compared with the original reference model, the Sep-3D ResNet multiplies the nonlinear activation function of the model under the condition of keeping the number of network layers unchanged, so that the complex function is easier to fit, the description capability of the model is improved on the basis of relieving the problem of difficulty in optimizing the deep three-dimensional convolution model, and the identification performance of the model is enhanced.
Fig. 4 is a schematic diagram of a channel attention mapping inference process, where input features are subjected to global average pooling operation, a shallow multi-layer perceptron, and dimension transformation operation in spatial dimensions to obtain channel attention distribution. Fig. 5 is a schematic diagram of a spatial attention mapping inference process, where a spatial attention distribution is obtained after an input feature is subjected to a global average pooling operation, a two-dimensional convolution operation, and a dimension transformation operation in a channel dimension.
As shown in fig. 4, the inputs of the dual attention module are first defined. Assuming model input as F ∈ R T×H×W×C Wherein, T, H, W respectively represent the time dimension, height and width of the input cube, and C represents the number of input channels. A middle layer feature mapping cube F' epsilon R obtained after one group or a series of separable three-dimensional convolutions T'×H'×W'×C' Defining the slice tensor at time t as F t ∈R H'×W'×C' Where T is 0,1, …, T'. The slice tensor is the input feature of the subsequent dual attention mechanism.
The dual attention module contains two sub-modules, namely:
(1) the channel attention module. As shown in FIG. 4, since capturing channel-level importance distributions requires explicit modeling of dependencies between channels, a global average pooling operation is taken to aggregate the spatial dimensions of input features to generate channel descriptors F C ∈R 1×1×C' Therefore, the interference of the local spatial information is avoided, and the specific formula is as follows:
Figure BDA0002996871080000081
subsequently, a gating mechanism similar to the self-attention function is used to obtain the importance distribution set of each channel, i.e. the channel descriptor F C A multi-layered perceptron with a hidden layer is fed to excite non-normalized channel attention mapping. To limit the number of parameters of the model, the dimension of the hidden activation layer is set to C'/r, r being the reduction ratio, typically set to 16. And then, carrying out normalization operation by using a sigmoid activation function to obtain final channel attention mapping. The channel attention solution process can be summarized as:
M C (F t )=EP C (σ(MLP(F C )))=EP C (σ(W 1 (δ(W 0 F C ))))
wherein σ (-) refers to sigmoid activation function, σ (-) refers to relu activation function, W 0 ,W 1 Representing weights of multi-layer perceptrons, EP C (. to) shows the expansion of the channel attention value to the original dimension along the spatial domain, i.e. let M C (F t )∈R C'×H'×W'
To perform automatic feature alignment, channel attention needs to be mapped to the original input features, and the refined slice tensor calculation process is:
Figure BDA0002996871080000082
wherein, the symbol
Figure BDA0002996871080000083
Refers to element-level multiplication operations.
After feature calibration is performed by using the channel attention module, the model can automatically balance the importance of information components of each channel, so that the sensitivity to information-intensive features is gradually improved.
(2) A spatial attention module. As shown in FIG. 5, similar to the channel attention module, for efficient computation of spatial attention feature maps, an aggregation F of global average pooling operations is used t ' to generate a two-dimensional space descriptor F S ∈R H '×W'×1 To summarize F t The global channel information of' is specifically calculated as:
Figure BDA0002996871080000084
subsequently, to obtain a feature map F t ' the correlation between different spatial positions and target actions, calculates the spatial attention value distribution by using two-dimensional convolution operation instead of multi-layer perceptron, namely:
M S (F t ')=EP S (σ(conv(F S )))
where conv (-) denotes a two-dimensional convolution operation, typically with the convolution kernel size set to 7 × 7 for best recognition performance, EP S Denotes along the channel dimensionA dimension transformation operation for extending the channel dimensions at different spatial positions to the original channel dimensions, i.e. order M S (F t ')∈R C'×H'×W'
Deducing the original slice tensor F t After the channel attention mapping and the space attention mapping, firstly, the characteristic calibration is realized by using a channel attention module to obtain a thinned slice tensor F t ', then M is mapped in spatial attention S (F t ') and F t ' between performing feature recalibration using element-level multiplication operations, resulting in an attention-weighted slice tensor F t ", thereby realizing the identification of the spatial salient region while distinguishing the information-intensive channels, and suppressing redundant background information. The resulting final refined tensor F t The calculation process of' is as follows:
Figure BDA0002996871080000091
the three-dimensional residual attention network module can be separated. To achieve the aforementioned expansion of the dual attention mechanism in the time dimension, the inference process of channel attention mapping and spatial attention mapping needs to be applied to the middle layer convolution feature F' ∈ R T '×H'×W'×C' The above dual attention weighting process needs to be repeated for all time dimensions, that is, the slice tensors at all times, and finally, the thinned slice tensors are arranged according to the original time dimensions and stacked into a final thinned feature cube.
A channel attention module and a space attention module after time domain expansion are sequentially embedded in a separable three-dimensional residual block of the Sep-3D ResNet to obtain a separable three-dimensional residual attention block (Sep-3D RAB), so that richer attention resources are flexibly allocated to key features while abstract semantic information of input data is captured. And finally, building a Sep-3D RAN according to a model architecture of 3D ResNet, namely replacing a simple residual block in the 3D ResNet with a Sep-3D RAB.
Optionally, the module iv specifically includes:
a multi-stage training strategy module. The network parameters are first initialized with pre-training weights to speed up the convergence process of the model. Considering that the Sep-3D RAN has four separable three-dimensional residual attention blocks, the training process of the model is divided into four stages. In the first stage, the attention mechanism is embedded in the first residual block only, and then the network layer parameters before the module are fixed, and the subsequent network layer is trained. And in the second stage, continuously embedding an attention mechanism into the second residual block, then initializing the network layer parameters before the current module by using the network weights learned in the first stage, and training the subsequent network layer. The above process is repeated until all four attention modules are embedded in the network. Due to the introduction of the pre-training weight, the model can realize rapid convergence, so the training process is not time-consuming and is easy to realize. Furthermore, the model is end-to-end trainable in all training phases, so the model can directly learn the mapping relationship from the original input to the target output.
In order to realize the end-to-end training mode, a full-connection layer is utilized to generate a final one-dimensional prediction vector I ∈ R C C refers to the total number of action categories of the target dataset, and then selects the softmax function to calculate the probability distribution of the category to which the input video belongs, i.e.:
Figure BDA0002996871080000092
wherein the content of the first and second substances,
Figure BDA0002996871080000093
indicating the prediction probability that the nth video belongs to the action class i.
In the optimization stage, the error between the real value and the predicted value is adjusted by using a cross entropy loss function, and the loss function is calculated as follows:
Figure BDA0002996871080000094
wherein, y n,i Representing the true label value corresponding to a given input video, N being trainedNumber of samples per batch in the run.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (6)

1. A human body action recognition method based on a separable three-dimensional residual attention network is characterized by specifically comprising the following steps:
s1: constructing separable three-dimensional convolution, and replacing standard three-dimensional convolution in the 3D ResNet by utilizing the separable three-dimensional convolution so as to build Sep-3D ResNet; wherein Sep-3D ResNet is a separable three-dimensional residual error network;
s2: designing a channel attention module to capture channel level importance distributions, designing a spatial attention module to automatically weigh importance of each spatial location, and then stacking two attention modules in sequence to construct a dual attention mechanism;
designing a channel attention module, which specifically comprises: aggregating the spatial dimensions of the input features by adopting a global average pooling operation to generate a channel descriptor F C ∈R 1×1×C′ The expression formula is:
Figure FDA0003775734310000011
wherein, F t ∈R H′×W′×C′ The slice tensor represents the time T, wherein T is 0,1, …, T ', H', W 'and C' respectively represent the time dimension, height, width and channel number of a middle-layer feature mapping cube obtained after a group or a series of separable three-dimensional convolutions of an input cube;
subsequently, a gating mechanism similar to the self-attention function is used to obtain the importance distribution set of each channel, i.e. the channel descriptionSeed F C Sending the data into a multilayer perceptron with a hidden layer to excite non-normalized channel attention mapping; in order to limit the parameter number of the model, the dimension of the hidden activation layer is set to be C'/r, and r is a reduction ratio; then, normalization operation is carried out by using a sigmoid activation function to obtain final channel attention mapping; the channel attention solution process expression is:
M C (F t )=EP C (σ(MLP(F C )))=EP C (σ(W 1 (δ(W 0 F C ))))
wherein σ (-) represents sigmoid activation function, δ (-) represents relu activation function, W 0 、W 1 Representing weights of multi-layer perceptrons, EP C (. to) shows the expansion of the channel attention value to the original dimension along the spatial domain, i.e. let M C (F t )∈R C′×H′×W′
To perform automatic feature alignment, channel attention needs to be mapped to the original input features, and the refined slice tensor calculation process is:
Figure FDA0003775734310000012
wherein, the symbol
Figure FDA0003775734310000013
Element-level multiplication;
the design space attention module specifically comprises: polymerization F with Global average pooling operation t ' to generate a two-dimensional space descriptor F S ∈R H′×W′×1 To summarize F t The global channel information of' has a specific calculation expression as follows:
Figure FDA0003775734310000014
then, the spatial attention value distribution is calculated by using a two-dimensional convolution operation instead of a multi-layer perceptron, namely:
M S (F t ′)=EP S (σ(conv(F S )))
wherein conv (·) denotes a two-dimensional convolution operation, EP S () represents a dimension transformation operation along the channel scale;
deducing the original slice tensor F t After the channel attention mapping and the space attention mapping, firstly, the characteristic calibration is realized by using a channel attention module to obtain a thinned slice tensor F t ', then M is mapped in spatial attention S (F t ') and F t ' between performing feature recalibration using element-level multiplication operations, resulting in an attention-weighted slice tensor F t The method can be used for distinguishing information intensive channels and identifying spatial significant regions at the same time, and inhibiting redundant background information; the resulting final refined tensor F t The calculation process of' is as follows:
Figure FDA0003775734310000021
s3: performing double attention weighting on middle-layer convolution characteristics at different moments, expanding a double attention module in a time dimension, embedding the double attention module into a separable three-dimensional residual block of Sep-3D ResNet, and constructing to form a Sep-3D RAN model; wherein, the Sep-3D RAN is a separable three-dimensional residual error attention network;
s4: performing joint end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy, which specifically comprises the following steps: generating a final one-dimensional prediction vector I e R by using a full-connection layer C C refers to the total number of action categories of the target dataset, and then selects the softmax function to calculate the probability distribution of the category to which the input video belongs, i.e.:
Figure FDA0003775734310000022
wherein the content of the first and second substances,
Figure FDA0003775734310000023
representing the prediction probability that the nth video belongs to the action category i;
in the optimization stage, the error between the real value and the predicted value is adjusted by using a cross entropy loss function, and the loss function expression is as follows:
Figure FDA0003775734310000024
wherein, y n,i Representing the true label value for a given input video, N refers to the number of samples per batch in the training process.
2. The human motion recognition method of claim 1, wherein in step S1, the constructing of the separable three-dimensional convolution is to approximate a standard three-dimensional convolution in a space-time dimension to a two-dimensional convolution in a space dimension and a one-dimensional convolution in a time dimension by a three-dimensional convolution kernel decomposition operation to construct the separable three-dimensional convolution.
3. The human motion recognition method according to claim 1 or 2, wherein in step S1, the constructing separable three-dimensional convolution specifically comprises: assume that there is N in convolutional layer i i-1 An input feature of firstly N i-1 A feature and M i Each size is 1 Xh Xw XN i-1 Is convolved with h, w, N i-1 Respectively the height, width and channel dimension of the convolution kernel in the two-dimensional space; then with N i Each size is t × 1 × 1 × M i Is convolved with a one-dimensional time filter of (a) where t and M i Respectively representing the time scale and channel dimensions of a one-dimensional time convolution kernel.
4. The human motion recognition method of claim 3, wherein M is M i The design principle of (2) follows the rule that the decomposed three-dimensional convolution parameter quantity is approximately equal to the standard three-dimensional convolution parameter quantity, and is calculated by the following formula:
Figure FDA0003775734310000031
5. the human motion recognition method of claim 1, wherein in step S3, the building of the Sep-3D RAN model specifically comprises: repeating the double attention weighting process on the slice tensors at each moment, and finally arranging and stacking the thinned slice tensors according to the original time dimension to form a final thinned feature cube;
sequentially embedding a channel attention module and a space attention module which are subjected to time domain expansion into a separable three-dimensional residual error block of Sep-3D ResNet to obtain a separable three-dimensional residual error attention block; and finally, building a Sep-3D RAN (separation-three-dimensional radio access network) according to a model architecture of 3D ResNet, namely replacing a simple residual block in the 3D ResNet with a separable three-dimensional residual attention block.
6. The human motion recognition method of claim 1, wherein in step S4, performing joint end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy specifically comprises: firstly, initializing network parameters by using pre-training weights to accelerate the convergence process of a model; considering that the Sep-3D RAN has four separable three-dimensional residual attention blocks, the training process of the model is divided into four stages; in the first stage, an attention mechanism is only embedded into a first residual block, and then the parameters of the network layer before the module are fixed, and the subsequent network layer is trained; in the second stage, continuously embedding an attention mechanism into the second residual block, then initializing the network layer parameters before the current module by using the network weights learned in the first stage, and training the subsequent network layer; the above process is repeated until all the residual blocks have embedded the attention mechanism.
CN202110334547.5A 2021-03-29 2021-03-29 Human body action recognition method based on separable three-dimensional residual error attention network Active CN113065450B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110334547.5A CN113065450B (en) 2021-03-29 2021-03-29 Human body action recognition method based on separable three-dimensional residual error attention network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110334547.5A CN113065450B (en) 2021-03-29 2021-03-29 Human body action recognition method based on separable three-dimensional residual error attention network

Publications (2)

Publication Number Publication Date
CN113065450A CN113065450A (en) 2021-07-02
CN113065450B true CN113065450B (en) 2022-09-20

Family

ID=76564513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110334547.5A Active CN113065450B (en) 2021-03-29 2021-03-29 Human body action recognition method based on separable three-dimensional residual error attention network

Country Status (1)

Country Link
CN (1) CN113065450B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255616B (en) * 2021-07-07 2021-09-21 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN113887419B (en) * 2021-09-30 2023-05-12 四川大学 Human behavior recognition method and system based on extracted video space-time information
CN114550162B (en) * 2022-02-16 2024-04-02 北京工业大学 Three-dimensional object recognition method combining view importance network and self-attention mechanism
CN117575915A (en) * 2024-01-16 2024-02-20 闽南师范大学 Image super-resolution reconstruction method, terminal equipment and storage medium
CN117831301B (en) * 2024-03-05 2024-05-07 西南林业大学 Traffic flow prediction method combining three-dimensional residual convolution neural network and space-time attention mechanism

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190113119A (en) * 2018-03-27 2019-10-08 삼성전자주식회사 Method of calculating attention for convolutional neural network
US11361225B2 (en) * 2018-12-18 2022-06-14 Microsoft Technology Licensing, Llc Neural network architecture for attention based efficient model adaptation
CN109871777B (en) * 2019-01-23 2021-10-01 广州智慧城市发展研究院 Behavior recognition system based on attention mechanism
CN111415342B (en) * 2020-03-18 2023-12-26 北京工业大学 Automatic detection method for pulmonary nodule images of three-dimensional convolutional neural network by fusing attention mechanisms
CN111428699B (en) * 2020-06-10 2020-09-22 南京理工大学 Driving fatigue detection method and system combining pseudo-3D convolutional neural network and attention mechanism
CN112288041B (en) * 2020-12-15 2021-03-30 之江实验室 Feature fusion method of multi-mode deep neural network

Also Published As

Publication number Publication date
CN113065450A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN113065450B (en) Human body action recognition method based on separable three-dimensional residual error attention network
Mascarenhas et al. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification
CN111242208B (en) Point cloud classification method, segmentation method and related equipment
CN113378906B (en) Unsupervised domain adaptive remote sensing image semantic segmentation method with feature self-adaptive alignment
WO2021147325A1 (en) Object detection method and apparatus, and storage medium
Jin et al. Pedestrian detection with super-resolution reconstruction for low-quality image
CN110378381A (en) Object detecting method, device and computer storage medium
WO2021155792A1 (en) Processing apparatus, method and storage medium
EP4002161A1 (en) Image retrieval method and apparatus, storage medium, and device
CN110222717A (en) Image processing method and device
Tan et al. Photograph aesthetical evaluation and classification with deep convolutional neural networks
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN113870160B (en) Point cloud data processing method based on transformer neural network
Ming et al. 3D-TDC: A 3D temporal dilation convolution framework for video action recognition
CN111524140B (en) Medical image semantic segmentation method based on CNN and random forest method
CN115294563A (en) 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability
CN113536970A (en) Training method of video classification model and related device
CN113378938A (en) Edge transform graph neural network-based small sample image classification method and system
Cai et al. Combination of temporal‐channels correlation information and bilinear feature for action recognition
Zhou et al. Discriminative attention-augmented feature learning for facial expression recognition in the wild
Yuan et al. A lightweight network for smoke semantic segmentation
Wang et al. Global contextual guided residual attention network for salient object detection
Li et al. Application of Semi-supervised Learning in Image Classification: Research on Fusion of Labeled and Unlabeled Data
Ma et al. Relative-position embedding based spatially and temporally decoupled Transformer for action recognition
WO2023174256A1 (en) Data compression method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant