CN113065450B

CN113065450B - Human body action recognition method based on separable three-dimensional residual error attention network

Info

Publication number: CN113065450B
Application number: CN202110334547.5A
Authority: CN
Inventors: 张祖凡; 彭月; 甘臣权; 张家波
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-03-29
Filing date: 2021-03-29
Publication date: 2022-09-20
Anticipated expiration: 2041-03-29
Also published as: CN113065450A

Abstract

The invention relates to a human body action recognition method based on a separable three-dimensional residual error attention network, and belongs to the field of computer vision. The method comprises the following steps: s1: replacing standard three-dimensional convolution in the 3D ResNet by separable three-dimensional convolution to build Sep-3D ResNet; s2: designing a channel attention module and a space attention module, and then stacking in sequence to construct a dual attention mechanism; s3: carrying out double attention weighting on the middle-layer convolution characteristics at different moments, expanding a double attention module in a time dimension, then embedding the double attention module into a Sep-3D RAB of Sep-3D ResNet, and constructing a Sep-3D RAN; s4: and performing joint end-to-end training on the Sep-3D RAN by using a multi-stage training strategy. The invention can improve the distinguishing capability of classification and discrimination characteristics, realizes the high-efficiency extraction of high-quality space-time visual characteristics, and can enhance the classification precision and the identification efficiency of the model.

Description

Human body action recognition method based on separable three-dimensional residual error attention network

Technical Field

The invention belongs to the field of computer vision, and relates to a human body action identification method based on a separable three-dimensional residual error attention network.

Background

Huge information is hidden in videos, and management, storage and identification of network videos are greatly challenged by huge user amount and high-speed growing market scale of network video markets, so that network video services are increasingly valued by all parties. In the field of human-centered computer vision research, human motion recognition tasks are an important research direction in computer vision tasks due to wide application in various fields such as human-computer interaction, smart home, automatic driving, virtual reality and the like. The main task of human body action recognition is to spontaneously recognize human body actions in an image sequence or a video, analyze the image sequence, analyze human body movement patterns, establish a mapping relation between video contents and action categories, further mine deep-level information contained in the video, learn and analyze the human body actions and behaviors in the video, and further understand the video contents. The human body action in the video is accurately identified, so that the unified classification management of the mass related video data by an internet platform is facilitated, and the construction of a harmonious network environment is facilitated. In addition, the development of human body action recognition technology promotes the maturity of video abnormity monitoring service, can assist social security managers to quickly predict crisis events in public occasions, and can timely monitor abnormal behaviors (such as faint, wrestling and the like) of users in family life so as to seek medical advice in time. Therefore, the human body action in the video is accurately identified, and the method has important academic value and application value.

The traditional action recognition algorithm depends on manual design features, specific feature design is often carried out according to different tasks, the performance of the recognition algorithm depends on a database, the complexity of processing processes on different data sets is increased, and the generalization capability and the universality are poor. In addition, under the background of the current information explosion era, image and video data are exponentially increased, people tend to extract more general feature representation by adopting a non-manual method, and therefore the action recognition method based on manual features cannot meet task requirements.

Deep learning benefits from a hierarchical training mode, high-dimensional features are automatically extracted from original video data through a layer-by-layer progressive feature extraction mechanism, and context semantic information of the video data is fully captured, so that the description capability of a deep model is improved, the final recognition and judgment are facilitated, and the method is widely applied to the field of action recognition. In recent years, the main technologies of deep learning applied to the field of human motion recognition include 2D CNN, 3D CNN, attention mechanism, and the like. The 2D CNN can effectively capture the spatial neighborhood correlation information of the RGB video frames, the 3D CNN can simultaneously capture the visual features on the space-time dimension, and the attention mechanism can realize flexible screening of key features, so that the identification performance of the model is improved. Although the 2D CNN is less complex and less parametric, it has insufficient extraction capability for dynamic features due to lack of time flow information; although the 3D CNN can directly perform the fusion of spatio-temporal features on the original input data, it will result in a large increase in the number of parameters of the model, which is not beneficial to the optimization process of the model. In addition, a large number of redundant features are included in the feature extraction process, so that the identification result of the model is interfered.

Therefore, a method for improving video recognition performance is needed.

Disclosure of Invention

In view of the above, the present invention provides a human body action recognition method based on a separable three-dimensional residual attention network, which adopts a reasonable kernel structure decomposition operation to alleviate the difficult phenomenon of deep three-dimensional convolution model optimization, and combines an attention mechanism to improve the flexibility of key feature screening, thereby preparing higher-quality spatio-temporal visual features to improve the recognition performance of the model.

In order to achieve the purpose, the invention provides the following technical scheme:

a human body action recognition method based on a separable three-dimensional residual attention network specifically comprises the following steps:

s1: constructing Separable three-dimensional convolution, and replacing standard three-dimensional convolution in a traditional three-dimensional residual network (3D residual network,3D ResNet) by utilizing the Separable three-dimensional convolution so as to build a Separable 3D residual network (Sep-3D ResNet) to relieve the phenomenon of difficult optimization of a deep three-dimensional convolution model;

s2: designing a channel attention module to capture channel level importance distributions, designing a spatial attention module to automatically weigh importance of each spatial location, and then stacking two attention modules in sequence to construct a dual attention mechanism;

s3: carrying out double attention weighting on middle-layer convolution characteristics at different moments, expanding a double attention module in a time dimension, embedding the double attention module into a Separable three-dimensional residual block of Sep-3D ResNet, and constructing to form a Separable 3D residual attention network (Sep-3D RAN) model;

s4: and performing combined end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy so as to relieve the overfitting effect of the model caused by insufficient training sample volume and improve the generalization capability of the model.

Further, in step S1, the specific process of constructing the separable three-dimensional convolution is as follows: the standard three-dimensional convolution in the spatiotemporal dimension is approximated by a three-dimensional convolution kernel decomposition operation as a two-dimensional convolution in the spatial dimension and a one-dimensional convolution in the temporal dimension to construct separable three-dimensional convolutions.

The separable three-dimensional convolution operation process comprises the following steps: assume that there is N in convolutional layer i _i-1 An input feature, first applying N _i-1 A feature and M _i Each size is 1 Xh Xw XN _i-1 Is convolved with h, w, N _i-1 Respectively the height, width and channel dimension of the two-dimensional space convolution kernel; then with N _i Each size is t × 1 × 1 × M _i Is convolved with a one-dimensional time filter of (a) where t and M _i Respectively representing the time scale and channel dimension of a one-dimensional time convolution kernel. Wherein M is _i The design principle of (2) follows the rule that the decomposed three-dimensional convolution parameter quantity is approximately equal to the standard three-dimensional convolution parameter quantity, and is calculated by the following formula:

in order to build Sep-3D ResNet, 3D ResNet is selected as a reference framework of a model, and then the standard three-dimensional convolution operation in the 3D ResNet is replaced by the separable three-dimensional convolution operation. Compared with the original reference model, the Sep-3D ResNet multiplies the nonlinear activation function of the model under the condition of keeping the number of network layers unchanged, so that the complex function is easier to fit, the description capability of the model is improved on the basis of relieving the problem of difficulty in optimizing the deep three-dimensional convolution model, and the identification performance of the model is enhanced.

Further, in step S2, the input of the dual attention mechanism is first defined. Assuming model input as F ∈ R ^T ^×H×W×C Wherein T, H, W respectively represent the time dimension and height of the input cubeDegree and width, C denotes the number of input channels. A middle layer feature mapping cube F' epsilon R obtained after one group or a series of separable three-dimensional convolutions ^{T'×H'×W'×C'} Defining the slice tensor at time t as F _t ∈R ^H'×W'×C' Where T is 0,1, …, T'. The slice tensor is the input feature of the subsequent dual attention mechanism.

Introduction of a dual attention mechanism:

(1) designing a channel attention module, which specifically comprises: since capturing channel-level importance distributions requires explicit modeling of dependencies between channels, global average pooling operations are employed to aggregate spatial dimensions of input features to generate channel descriptors F _C ∈R ^1×1×C' Thereby avoiding the interference of local spatial information, and the expression formula is:

wherein, F _t ∈R ^H'×W'×C' The slice tensor represents the time T, wherein T is 0,1, …, T ', H', W 'and C' respectively represent the time dimension, height, width and channel number of a middle-layer feature mapping cube obtained after a group or a series of separable three-dimensional convolutions of an input cube;

subsequently, a gating mechanism similar to the self-attention function is used to obtain the importance distribution set of each channel, i.e. the channel descriptor F _C Sending the data into a multilayer perceptron with a hidden layer to excite non-normalized channel attention mapping; to limit the number of parameters of the model, the dimension of the hidden activation layer is set to C'/r, r is the reduction ratio and is usually set to 16; then, normalization operation is carried out by using a sigmoid activation function to obtain final channel attention mapping; the channel attention solving process expression is as follows:

M _C (F _t )＝EP _C (σ(MLP(F _C ))))＝EP _C (σ(W ₁ (δ(W ₀ F _C ))))

wherein σ (-) represents the sigmoid activation function, and δ (-) represents the relu activation function，W ₀ 、W ₁ Representing weights of multi-layer perceptrons, EP _C (. to) shows the expansion of the channel attention value to the original dimension along the spatial domain, i.e. let M _C (F _t )∈R ^C'×H'×W' ；

To perform automatic feature alignment, channel attention needs to be mapped to the original input features, and the refined slice tensor calculation process is:

wherein, the symbol

Refers to element-level multiplication operations.

(2) The design space attention module specifically comprises: similar to the channel attention module, to efficiently compute the spatial attention feature map, the F is aggregated using a global average pooling operation _t ' to generate a two-dimensional space descriptor F _S ∈R ^H'×W'×1 To summarize F _t The global channel information of' has a specific calculation expression as follows:

subsequently, to obtain a feature map F _t ' the correlation between different spatial positions and target actions, calculates the spatial attention value distribution by using two-dimensional convolution operation instead of multi-layer perceptron, namely:

M _S (F _t ')＝EP _S (σ(conv(F _S )))

where conv (-) denotes a two-dimensional convolution operation, typically with the convolution kernel size set to 7 × 7 for best recognition performance, EP _S (-) denotes a dimension transformation operation along the channel dimension with the aim of extending the channel dimension at different spatial locations to the original channel dimension, i.e. let M be _S (F _t ')∈R ^C'×H'×W' ；

Deducing the original slice tensor F _t After the channel attention mapping and the space attention mapping, firstly, the characteristic calibration is realized by using a channel attention module to obtain a thinned slice tensor F _t ', then M is mapped in spatial attention _S (F _t ') and F _t ' between performing feature recalibration using element-level multiplication operations, resulting in an attention-weighted slice tensor F _t The method can be used for distinguishing information intensive channels and identifying spatial significant regions at the same time, and inhibiting redundant background information; the resulting final refined tensor F _t The calculation process of' is as follows:

further, in step S3, the building of the Sep-3D RAN model specifically includes: to achieve the aforementioned expansion of the dual attention mechanism in the time dimension, the inference process of channel attention mapping and spatial attention mapping needs to be applied to the middle layer convolution feature F' ∈ R ^{T'×H'×W'×C'} The double attention weighting process is repeated on all time dimensions, namely the slice tensors at all moments, and finally, the thinned slice tensors are arranged according to the original time dimension and stacked into a final thinned feature cube;

by embedding the channel attention module and the space attention module which are subjected to time domain expansion in sequence in the Separable three-dimensional residual block of the Sep-3D ResNet, a Separable 3D residual attention block (Sep-3D RAB) is obtained, so that richer attention resources are flexibly allocated to key features while abstract semantic information of input data is captured; and finally, building a Sep-3D RAN according to a model architecture of 3D ResNet, namely replacing a simple residual block in the 3D ResNet with a Sep-3D RAB.

Further, in step S4, performing joint end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy, specifically including: firstly, initializing network parameters by using pre-training weights to accelerate the convergence process of a model; considering that the Sep-3D RAN has four separable three-dimensional residual attention blocks, the training process of the model is divided into four stages; in the first stage, the attention mechanism is only embedded into the first residual error block, and then the network layer parameters before the module are fixed, and the subsequent network layer is trained; in the second stage, continuously embedding an attention mechanism into the second residual block, then initializing the network layer parameters before the current module by using the network weight learned in the first stage, and training the subsequent network layer; repeating the above process until all the residual blocks are embedded with the attention mechanism; due to the introduction of the pre-training weight, the model can realize rapid convergence, so the training process is not time-consuming and is easy to realize. Furthermore, the model is end-to-end trainable in all training phases, so the model can directly learn the mapping relationship from the original input to the target output.

In order to realize the end-to-end training mode, a full-connection layer is utilized to generate a final one-dimensional prediction vector I ∈ R ^C C refers to the total number of action categories of the target dataset, and then selects the softmax function to calculate the probability distribution of the category to which the input video belongs, i.e.:

wherein the content of the first and second substances,

representing the prediction probability that the nth video belongs to the action category i;

in the optimization stage, the error between the real value and the predicted value is adjusted by using a cross entropy loss function, and the loss function expression is as follows:

wherein, y _n,i Representing the true label value for a given input video, N refers to the number of samples per batch in the training process.

The invention has the beneficial effects that: the invention can improve the distinguishing capability of classification distinguishing characteristics, realize the high-efficiency extraction of high-quality space-time visual characteristics and enhance the classification precision and the identification efficiency of the model; the concrete aspects are as follows:

1) the invention uses separable three-dimensional convolution to approximate standard three-dimensional convolution, simplifies the convolution operation in a three-dimensional space-time domain into convolution on a cascaded two-dimensional space plane and a one-dimensional time plane, and relieves the phenomenon of difficult optimization of a deep three-dimensional convolution model;

2) the channel attention module is used for capturing more meaningful channel information components, and the space attention module is used for paying attention to more remarkable space regions, so that the flexible screening of key features by the model is facilitated;

3) the model is trained using a multi-stage training strategy, avoiding the over-fitting effect of the model without adding additional regularization operations.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a process of human body action recognition based on a separable three-dimensional residual attention network according to the present invention;

FIG. 2 is a model diagram of a human body action recognition system based on a separable three-dimensional residual attention network according to the present invention;

FIG. 3 is a schematic diagram of a separable three-dimensional convolution;

FIG. 4 is a schematic view of a channel attention module;

fig. 5 is a schematic view of a spatial attention module.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1 to 5, the present invention provides a human body action recognition method based on a separable three-dimensional residual error attention network, as shown in fig. 1 and 2, which specifically includes the following steps:

the method comprises the following steps: approximating the standard three-dimensional convolution on the space-time dimension to a cascaded two-dimensional space convolution and a one-dimensional time convolution through a three-dimensional convolution kernel decomposition operation to construct a separable three-dimensional convolution, and then replacing the standard three-dimensional convolution in the 3D ResNet by the separable three-dimensional convolution to construct a Sep-3D ResNet;

step two: designing a channel attention module to generate modulation weight of each channel to capture channel-level importance distribution, automatically weighing neighborhood correlation of each spatial position by the designed spatial attention module, stacking the channel attention module and the spatial attention module in sequence, and sequentially deducing channel attention mapping and spatial attention mapping of input features so as to construct a dual attention mechanism;

step three: sequentially calculating a channel attention value and a space attention value of a slice tensor of each time dimension of a middle-layer convolution feature cube, stacking the thinned slice tensors according to the original time dimension, embedding the slice tensors into a separable three-dimensional residual block of Sep-3DResNet, and constructing a final Sep-3D RAN;

step four: by introducing the attention module in Sep-3D ResNet stages, sequentially training the sub-networks of each stage, and finally performing combined end-to-end training on the whole network, the model overfitting phenomenon caused by the condition of insufficient training samples is relieved while the attention layer is fully activated.

FIG. 3 is a schematic diagram of a separable three-dimensional convolution illustrating the operation of a separable three-dimensional convolution operation on an input feature to obtain a corresponding output feature in a given convolution layer.

The separable three-dimensional residual attention network module comprises:

as shown in fig. 3, the separable three-dimensional convolution operation process is: assume that there is N in convolutional layer i _i-1 An input feature, N _i-1 A feature first with M _i Each size is 1 Xh Xw XN _i-1 Is convolved, h, w, N _i-1 Respectively the height, width and channel dimension of the convolution kernel in two-dimensional space, and then the convolution kernel is compared with N _i Each size is t × 1 × 1 × M _i Is convolved with a one-dimensional time filter of (a) and (b) _i Respectively representing the time scale and channel dimension of a one-dimensional time convolution kernel, where M _i The design principle of (2) follows the rule that the decomposed three-dimensional convolution parameter quantity is approximately equal to the standard three-dimensional convolution parameter quantity, and is calculated by the following formula:

Fig. 4 is a schematic diagram of a channel attention mapping inference process, where input features are subjected to global average pooling operation, a shallow multi-layer perceptron, and dimension transformation operation in spatial dimensions to obtain channel attention distribution. Fig. 5 is a schematic diagram of a spatial attention mapping inference process, where a spatial attention distribution is obtained after an input feature is subjected to a global average pooling operation, a two-dimensional convolution operation, and a dimension transformation operation in a channel dimension.

As shown in fig. 4, the inputs of the dual attention module are first defined. Assuming model input as F ∈ R ^T×H×W×C Wherein, T, H, W respectively represent the time dimension, height and width of the input cube, and C represents the number of input channels. A middle layer feature mapping cube F' epsilon R obtained after one group or a series of separable three-dimensional convolutions ^{T'×H'×W'×C'} Defining the slice tensor at time t as F _t ∈R ^H'×W'×C' Where T is 0,1, …, T'. The slice tensor is the input feature of the subsequent dual attention mechanism.

The dual attention module contains two sub-modules, namely:

(1) the channel attention module. As shown in FIG. 4, since capturing channel-level importance distributions requires explicit modeling of dependencies between channels, a global average pooling operation is taken to aggregate the spatial dimensions of input features to generate channel descriptors F _C ∈R ^1×1×C' Therefore, the interference of the local spatial information is avoided, and the specific formula is as follows:

subsequently, a gating mechanism similar to the self-attention function is used to obtain the importance distribution set of each channel, i.e. the channel descriptor F _C A multi-layered perceptron with a hidden layer is fed to excite non-normalized channel attention mapping. To limit the number of parameters of the model, the dimension of the hidden activation layer is set to C'/r, r being the reduction ratio, typically set to 16. And then, carrying out normalization operation by using a sigmoid activation function to obtain final channel attention mapping. The channel attention solution process can be summarized as:

M _C (F _t )＝EP _C (σ(MLP(F _C )))＝EP _C (σ(W ₁ (δ(W ₀ F _C ))))

wherein σ (-) refers to sigmoid activation function, σ (-) refers to relu activation function, W ₀ ，W ₁ Representing weights of multi-layer perceptrons, EP _C (. to) shows the expansion of the channel attention value to the original dimension along the spatial domain, i.e. let M _C (F _t )∈R ^C'×H'×W' 。

wherein, the symbol

Refers to element-level multiplication operations.

After feature calibration is performed by using the channel attention module, the model can automatically balance the importance of information components of each channel, so that the sensitivity to information-intensive features is gradually improved.

(2) A spatial attention module. As shown in FIG. 5, similar to the channel attention module, for efficient computation of spatial attention feature maps, an aggregation F of global average pooling operations is used _t ' to generate a two-dimensional space descriptor F _S ∈R ^H ^'×W'×1 To summarize F _t The global channel information of' is specifically calculated as:

M _S (F _t ')＝EP _S (σ(conv(F _S )))

where conv (-) denotes a two-dimensional convolution operation, typically with the convolution kernel size set to 7 × 7 for best recognition performance, EP _S Denotes along the channel dimensionA dimension transformation operation for extending the channel dimensions at different spatial positions to the original channel dimensions, i.e. order M _S (F _t ')∈R ^C'×H'×W' 。

Deducing the original slice tensor F _t After the channel attention mapping and the space attention mapping, firstly, the characteristic calibration is realized by using a channel attention module to obtain a thinned slice tensor F _t ', then M is mapped in spatial attention _S (F _t ') and F _t ' between performing feature recalibration using element-level multiplication operations, resulting in an attention-weighted slice tensor F _t ", thereby realizing the identification of the spatial salient region while distinguishing the information-intensive channels, and suppressing redundant background information. The resulting final refined tensor F _t The calculation process of' is as follows:

the three-dimensional residual attention network module can be separated. To achieve the aforementioned expansion of the dual attention mechanism in the time dimension, the inference process of channel attention mapping and spatial attention mapping needs to be applied to the middle layer convolution feature F' ∈ R ^T ^{'×H'×W'×C'} The above dual attention weighting process needs to be repeated for all time dimensions, that is, the slice tensors at all times, and finally, the thinned slice tensors are arranged according to the original time dimensions and stacked into a final thinned feature cube.

A channel attention module and a space attention module after time domain expansion are sequentially embedded in a separable three-dimensional residual block of the Sep-3D ResNet to obtain a separable three-dimensional residual attention block (Sep-3D RAB), so that richer attention resources are flexibly allocated to key features while abstract semantic information of input data is captured. And finally, building a Sep-3D RAN according to a model architecture of 3D ResNet, namely replacing a simple residual block in the 3D ResNet with a Sep-3D RAB.

Optionally, the module iv specifically includes:

a multi-stage training strategy module. The network parameters are first initialized with pre-training weights to speed up the convergence process of the model. Considering that the Sep-3D RAN has four separable three-dimensional residual attention blocks, the training process of the model is divided into four stages. In the first stage, the attention mechanism is embedded in the first residual block only, and then the network layer parameters before the module are fixed, and the subsequent network layer is trained. And in the second stage, continuously embedding an attention mechanism into the second residual block, then initializing the network layer parameters before the current module by using the network weights learned in the first stage, and training the subsequent network layer. The above process is repeated until all four attention modules are embedded in the network. Due to the introduction of the pre-training weight, the model can realize rapid convergence, so the training process is not time-consuming and is easy to realize. Furthermore, the model is end-to-end trainable in all training phases, so the model can directly learn the mapping relationship from the original input to the target output.

wherein the content of the first and second substances,

indicating the prediction probability that the nth video belongs to the action class i.

In the optimization stage, the error between the real value and the predicted value is adjusted by using a cross entropy loss function, and the loss function is calculated as follows:

wherein, y _n,i Representing the true label value corresponding to a given input video, N being trainedNumber of samples per batch in the run.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A human body action recognition method based on a separable three-dimensional residual attention network is characterized by specifically comprising the following steps:

s1: constructing separable three-dimensional convolution, and replacing standard three-dimensional convolution in the 3D ResNet by utilizing the separable three-dimensional convolution so as to build Sep-3D ResNet; wherein Sep-3D ResNet is a separable three-dimensional residual error network;

designing a channel attention module, which specifically comprises: aggregating the spatial dimensions of the input features by adopting a global average pooling operation to generate a channel descriptor F _C ∈R ^1×1×C′ The expression formula is:

wherein, F _t ∈R ^{H′×W′×C′} The slice tensor represents the time T, wherein T is 0,1, …, T ', H', W 'and C' respectively represent the time dimension, height, width and channel number of a middle-layer feature mapping cube obtained after a group or a series of separable three-dimensional convolutions of an input cube;

subsequently, a gating mechanism similar to the self-attention function is used to obtain the importance distribution set of each channel, i.e. the channel descriptionSeed F _C Sending the data into a multilayer perceptron with a hidden layer to excite non-normalized channel attention mapping; in order to limit the parameter number of the model, the dimension of the hidden activation layer is set to be C'/r, and r is a reduction ratio; then, normalization operation is carried out by using a sigmoid activation function to obtain final channel attention mapping; the channel attention solution process expression is:

M _C (F _t )＝EP _C (σ(MLP(F _C )))＝EP _C (σ(W ₁ (δ(W ₀ F _C ))))

wherein σ (-) represents sigmoid activation function, δ (-) represents relu activation function, W ₀ 、W ₁ Representing weights of multi-layer perceptrons, EP _C (. to) shows the expansion of the channel attention value to the original dimension along the spatial domain, i.e. let M _C (F _t )∈R ^{C′×H′×W′} ；

wherein, the symbol

Element-level multiplication;

the design space attention module specifically comprises: polymerization F with Global average pooling operation _t ' to generate a two-dimensional space descriptor F _S ∈R ^{H′×W′×1} To summarize F _t The global channel information of' has a specific calculation expression as follows:

then, the spatial attention value distribution is calculated by using a two-dimensional convolution operation instead of a multi-layer perceptron, namely:

M _S (F _t ′)＝EP _S (σ(conv(F _S )))

wherein conv (·) denotes a two-dimensional convolution operation, EP _S () represents a dimension transformation operation along the channel scale;

s3: performing double attention weighting on middle-layer convolution characteristics at different moments, expanding a double attention module in a time dimension, embedding the double attention module into a separable three-dimensional residual block of Sep-3D ResNet, and constructing to form a Sep-3D RAN model; wherein, the Sep-3D RAN is a separable three-dimensional residual error attention network;

s4: performing joint end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy, which specifically comprises the following steps: generating a final one-dimensional prediction vector I e R by using a full-connection layer ^C C refers to the total number of action categories of the target dataset, and then selects the softmax function to calculate the probability distribution of the category to which the input video belongs, i.e.:

wherein the content of the first and second substances,

2. The human motion recognition method of claim 1, wherein in step S1, the constructing of the separable three-dimensional convolution is to approximate a standard three-dimensional convolution in a space-time dimension to a two-dimensional convolution in a space dimension and a one-dimensional convolution in a time dimension by a three-dimensional convolution kernel decomposition operation to construct the separable three-dimensional convolution.

3. The human motion recognition method according to claim 1 or 2, wherein in step S1, the constructing separable three-dimensional convolution specifically comprises: assume that there is N in convolutional layer i _i-1 An input feature of firstly N _i-1 A feature and M _i Each size is 1 Xh Xw XN _i-1 Is convolved with h, w, N _i-1 Respectively the height, width and channel dimension of the convolution kernel in the two-dimensional space; then with N _i Each size is t × 1 × 1 × M _i Is convolved with a one-dimensional time filter of (a) where t and M _i Respectively representing the time scale and channel dimensions of a one-dimensional time convolution kernel.

4. The human motion recognition method of claim 3, wherein M is M _i The design principle of (2) follows the rule that the decomposed three-dimensional convolution parameter quantity is approximately equal to the standard three-dimensional convolution parameter quantity, and is calculated by the following formula:

5. the human motion recognition method of claim 1, wherein in step S3, the building of the Sep-3D RAN model specifically comprises: repeating the double attention weighting process on the slice tensors at each moment, and finally arranging and stacking the thinned slice tensors according to the original time dimension to form a final thinned feature cube;

sequentially embedding a channel attention module and a space attention module which are subjected to time domain expansion into a separable three-dimensional residual error block of Sep-3D ResNet to obtain a separable three-dimensional residual error attention block; and finally, building a Sep-3D RAN (separation-three-dimensional radio access network) according to a model architecture of 3D ResNet, namely replacing a simple residual block in the 3D ResNet with a separable three-dimensional residual attention block.

6. The human motion recognition method of claim 1, wherein in step S4, performing joint end-to-end training on the Sep-3D RAN model by using a multi-stage training strategy specifically comprises: firstly, initializing network parameters by using pre-training weights to accelerate the convergence process of a model; considering that the Sep-3D RAN has four separable three-dimensional residual attention blocks, the training process of the model is divided into four stages; in the first stage, an attention mechanism is only embedded into a first residual block, and then the parameters of the network layer before the module are fixed, and the subsequent network layer is trained; in the second stage, continuously embedding an attention mechanism into the second residual block, then initializing the network layer parameters before the current module by using the network weights learned in the first stage, and training the subsequent network layer; the above process is repeated until all the residual blocks have embedded the attention mechanism.