CN112990116B

CN112990116B - Behavior recognition device and method based on multi-attention mechanism fusion and storage medium

Info

Publication number: CN112990116B
Application number: CN202110428650.6A
Authority: CN
Inventors: 桑高丽; 卢丽; 闫超; 黄俊洁
Original assignee: Sichuan Yifei Technology Co ltd
Current assignee: Sichuan Yifei Technology Co ltd
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2021-08-06
Anticipated expiration: 2041-04-21
Also published as: CN112990116A

Abstract

The invention discloses a behavior recognition device, a behavior recognition method and a storage medium based on multi-attention mechanism fusion.A main network of an adopted network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a mixed residual module in series, which are sequentially arranged from front to back, wherein the two-dimensional convolution layer is used for extracting the characteristic information of a sequence frame image, and the mixed residual module is used for extracting the characteristic information of different characteristics; the mixed residual error module comprises a multiple attention mechanism fusion module which is divided into a time domain attention mechanism module and a space domain attention mechanism module, and the output ends of the multiple attention mechanism fusion module are fused in an adding mode. According to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved.

Description

Behavior recognition device and method based on multi-attention mechanism fusion and storage medium

Technical Field

The invention belongs to the technical field of behavior recognition, and particularly relates to a behavior recognition device and method based on multi-attention mechanism fusion and a storage medium.

Background

With the rapid development of artificial intelligence technology, people begin to pay attention to how computers understand the world further, so that a discipline for understanding videos is derived and widely applied to the fields of reality augmentation, virtual reality, intelligent monitoring and the like. In the big data era, millions of videos are uploaded or downloaded each day, where applying video understanding techniques can play a key role. However, with the explosive growth of video streams, video understanding techniques have met with significant challenges in terms of accuracy and computational cost.

Behavior recognition is a basic direction in video understanding technology, and the core technology of the behavior recognition is that a computer classifies behaviors of targets in a video by learning characteristic information of sequence frame images, so that the purpose of recognition is achieved, and the behavior recognition method is commonly used for man-machine interaction and monitoring intelligent robots. The behavior recognition of the target in the video data usually has time dependency, and includes not only spatial information in each frame of image, but also time information between frames, such as behaviors of old people falling down, carrying articles, and the like.

In recent years, the mainstream method of the behavior recognition technology is also a network model based on three-dimensional convolution and a network model based on two-dimensional convolution. The former is to build a deep space-time network model by using a large number of three-dimensional convolutional layers, so that video sequence data can be effectively processed, but a space-time model purely based on three-dimensional convolutional layer modeling cannot completely acquire information contained in a video, and is easy to be over-fitted, so that a large number of false detection phenomena of the model can be caused. The time-space information and the time sequence information are considered respectively, compared with a network model considered unilaterally, the performance is greatly improved, but a large amount of calculation cost is generated in the process of extracting and processing the time sequence information.

Currently, most behavior recognition techniques are limited to falling into practical scenes due to the slow speed of model parameter derivation caused by the inherent heavy calculation amount. Therefore, it is urgently needed to provide a behavior recognition scheme capable of improving the accuracy and reducing the calculation cost, improving the feature expression capability of the model on the sequence frame images, and enhancing the behavior recognition performance.

Disclosure of Invention

The present invention aims to provide a behavior recognition apparatus, method and storage medium based on multi-attention mechanism fusion, and aims to solve the above problems.

The invention is mainly realized by the following technical scheme:

a behavior recognition device based on multi-attention mechanism fusion comprises a data processing module, a training module and a recognition module, wherein the data processing module is used for collecting and cutting videos to obtain training data; the training module is used for inputting training data into the network model for training and obtaining an optimized network model; the identification module is used for inputting the data to be detected into the optimized network model and outputting a behavior identification result;

the main network of the network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a mixed residual error module in series, wherein the two-dimensional convolution layer is used for extracting the characteristic information of a sequence frame image, and the mixed residual error module is used for extracting the characteristic information of different characteristics; the mixed residual error module comprises a multiple attention mechanism fusion module, the multiple attention mechanism fusion module is divided into a time domain attention mechanism module and a space domain attention mechanism module, and the output ends of the time domain attention mechanism module and the space domain attention mechanism module are fused in an adding mode.

The mixed residual module mainly comprises a multiple attention mechanism fusion module and a convolution layer, and can extract effective characteristic information and enhance the expression capability of the model; the multi-attention mechanism fusion module is an attention mechanism constructed from a time domain and a space domain, and makes the model learn process pay more attention to target behaviors and improve the performance and accuracy of the model by analyzing the importance degree of information between frame images and the importance degree of information inside the frame images in two domain directions. According to the technical characteristics of the behavior recognition direction, the weight distribution is respectively carried out on the feature information from the time sequence aspect and the space aspect, the relevance between the frame images and the information of the frame image feature level are learned by the model through constructing a time domain attention mechanism and a space domain attention mechanism, the modules are plug and play, and the recognition performance of the network model is effectively improved.

In the using process, the sequence frame images are input, are processed by a convolution layer at the beginning of a network, are extracted into convolution characteristic information, in order to better fit the characteristics of behavior recognition data, a characteristic moving layer is designed and established in a targeted manner, the characteristic diagram is subjected to enhanced operations such as translation, rotation and the like, the generalization of the characteristic information to behavior motion can be increased through the operations, and then the characteristic information is transmitted to a multi-attention mechanism fusion module to be processed in parallel. The time domain attention mechanism module can establish a more effective long-distance dependency relationship for video data. The spatial domain attention mechanism module can aggregate information of each feature point along with self-adaptive selection in the training process, and no additional calculation mode is provided.

The time domain attention mechanism module aims to increase the response capability of model features to different frame images by modeling the relation between frame sequence images, and the specific process is to construct a multi-branch structure. In the time domain attention mechanism module, firstly, a two-dimensional convolution layer is utilized to carry out dimension increasing operation on the characteristic information, transposition operation is carried out on the characteristic information of the second branch, and then multiplication operation is carried out on the characteristic information of the first branch and the second branch to obtain an associated information matrix between characteristic graph channels. Then, a time domain attention diagram is obtained through flexible maximum layer processing, and then the time domain attention diagram is multiplied by the input characteristic information point to obtain weighted characteristic information. In addition, in order to improve the semantic information of the weighted feature map, a three-dimensional convolution layer is added and used, and finally a gamma parameter layer is added to adjust the fusion of the weighted feature information and the original feature information, so that the optimal fusion mode is selected in a self-adaptive mode. Through the construction of the network layers, the expression capacity of the network model to the sequence data is greatly improved.

The dimension of the attention diagram in the time domain attention mechanism module is [ 2 ]B,N _frame ,H,W,C]The method is obtained by processing an associated strength information matrix in the channel direction by utilizing a flexible maximum layer, and multiplying original feature information and an attention map to obtain a weighted feature map, so that the contribution degree of a key frame in the model learning process is enhanced, and at the moment, the weighted featureThe graph also maintains the sequence relationship of the frame images in the channel direction, and each feature point on the graph is associated across channels. The expression of dimensions is common knowledge in the art and will not be described in detail.

The existing method directly uses the weighted feature map for fusion, but the long-distance dependency relationship contained in the weighted feature map has limitation, the product operation only can represent partial associated information, and the motion information cannot be effectively captured in the scene that the target behavior moves too fast, so the method carries out processing after the weighted feature map by a three-dimensional convolution layer, and enhances the relevance of a plurality of adjacent frame information in the channel direction by using the characteristic of a three-dimensional convolution kernel, thereby improving the expression capability of the feature information on the time sequence and acquiring more motion information. Secondly, because the weighted feature map and the original feature map contain different information such as semantics, dimension and the like, stronger feature information needs to be obtained through fusion again, but the weighted feature map and the original feature map are directly added and combined in a general fusion mode, and feature points with larger weight can cover original information of the feature points to cause feature degradation, so that a gamma parameter layer is added and fused in proportion, the initial gamma parameter is 0, the weighted feature map can be optimized along with model learning, and the optimal fusion mode is obtained in a self-adaptive manner to enhance the expression capability of the feature information.

The spatial domain attention mechanism module is used for modeling the interior of each frame image, enhancing the response capability of model features to position information of the interior of the frame image, and the specific process is to construct a multi-branch structure. In the spatial domain attention mechanism module, a channel maximum value pooling layer and a channel average pooling layer are respectively used in a first branch and a second branch for processing, local important information in global feature information is extracted, dimensionality is adjusted through a convolution layer with a convolution kernel of 1x1 and a feature deformation layer, then a one-dimensional convolution layer is used for increasing the degree of dependence between feature points, a spatial domain attention diagram is obtained through flexible maximum value layer processing, the final processing mode is similar to a time domain, weight adjustment is carried out through learnable parameters after weighted feature information is obtained, and therefore optimal feature information is obtained in a self-adaptive mode.

The invention provides a channel maximum pooling layer and a channel average pooling layer, wherein the main processing process is to directly carry out maximum pooling and average pooling in the channel direction, change the data dimension of a feature block into H multiplied by W multiplied by 1, and fuse position information in different frame images, so that the global information obtained by crossing channels is free from losing detailed information and is more suitable for target classification in a behavior recognition scene. Secondly, most of the existing attention methods are methods for general purposes, one-dimensional convolution is not used for extracting spatial information, the dependency between characteristic points is ignored, the characteristic information before the one-dimensional convolution is transmitted in the method provided by the invention is the extracted cross-channel global information, the global information is the characteristic value with the maximum weight extracted from each channel, the relevance between the characteristic values is weak, and the connection between the characteristic values is strengthened by using the one-dimensional convolution immediately in order to avoid the characteristic degradation phenomenon in subsequent calculation. The beta parameter layer functions as the gamma parameter layer.

In summary, according to the characteristics of the behavior recognition target, the invention processes the weight distribution between the frame images and inside the frame images in the two directions of the time domain and the space domain, so that various information contained in the sample, such as the space information and the time sequence information, can be processed more effectively, the utilization rate of the feature information is greatly improved, the expression capability of the feature information is enhanced, and the performance of the network model is improved.

In order to better realize the invention, the mixed residual error module is formed by packaging a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer and an activation function which are sequentially arranged from front to back; the feature moving layer is a network layer which integrates enhanced operations of translation and rotation on the feature map on the feature level.

In order to better implement the present invention, the time domain attention mechanism module further includes a first branch and a second branch, the first branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a feature transformation layer, which are sequentially arranged from front to back, and the second branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a feature transposition layer, and a feature transformation layer, which are sequentially arranged from front to back; the input characteristics are respectively input into the first branch and the second branch, the first branch is connected with the flexible maximum layer after being multiplied by the output end of the second branch, the output of the flexible maximum layer is multiplied by the input characteristics, and the multiplied characteristics are sequentially input into the three-dimensional convolution layer and the activation function layer from front to back to be processed and multiplied by the gamma parameter layer, and finally the multiplied characteristics are input into the characteristic splicing layer in combination with the input characteristics.

In order to better implement the present invention, the spatial domain attention mechanism module further includes a first branch and a second branch, the first branch is composed of a channel characteristic maximum value pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the second branch is composed of a channel characteristic average pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the input characteristics are respectively input into the first branch and the second branch, and an output end is sequentially connected with a characteristic splicing layer, a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic deformation layer, a one-dimensional convolution layer and a flexible maximum value layer from front to back, an output of the flexible maximum value layer is multiplied by the input characteristics and multiplied by a beta parameter layer, and finally, the output is input into the characteristic splicing layer in combination with the input characteristics.

The invention is mainly realized by the following technical scheme:

a behavior recognition method based on multi-attention mechanism fusion is carried out by adopting the behavior recognition device, and comprises the following steps:

step S100: acquiring and cutting a segment video containing behaviors, and manually marking to obtain training data;

step S200: inputting sequence frame images in training data into a network model for training; distributing different weights to the characteristic information of the sequence frame images from the directions of a time domain and a space domain according to the importance of a backbone network of the network model, and then conveying the characteristic information to a full connection layer for classification and identification; then, calculating a difference value between the predicted behavior category and the real behavior category by using a loss function, and performing model training end to end;

step S300: selecting an optimizer, presetting network related hyper-parameters, initializing network model weight parameters randomly, then optimizing a loss value by using the optimizer, iteratively updating the weight parameters, stopping training until the loss value is converged, and finally testing to obtain an optimal network model;

step S400: and inputting the data to be detected into the optimal network model and outputting a behavior recognition result.

In order to better implement the present invention, further, the loss function in step S200 is a cross entropy loss function for calculating a loss value between the prediction category of the sequence frame image and the real category of the sequence frame image.

In order to better implement the present invention, further, the activation function layer in the present invention adopts a parameter-modified linear unit layer.

A computer-readable storage medium storing computer program instructions which, when executed by a processor, implement the behavior recognition method described above.

The invention has the beneficial effects that:

(1) according to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved;

(2) according to the invention, the mixed residual error module is constructed, and the relevance between the frame images and the information of the frame image characteristic level are learned by the model by utilizing the time domain attention mechanism and the space domain attention mechanism, so that the module is plug-and-play, and the identification performance of the network model is effectively improved.

Drawings

FIG. 1 is a schematic diagram of a backbone network of a network model;

FIG. 2 is a schematic diagram of a hybrid residual module;

FIG. 3 is a schematic diagram of a multi-attention mechanism fusion module;

FIG. 4 is a schematic diagram of a time domain attention mechanism module;

fig. 5 is a schematic structural diagram of a spatial domain attention mechanism module.

Detailed Description

Example 1:

a behavior recognition device based on multi-attention mechanism fusion comprises a data processing module, a training module and a recognition module, wherein the data processing module is used for collecting and cutting videos to obtain training data; the training module is used for inputting training data into the network model for training and obtaining an optimized network model; and the identification module is used for inputting the data to be detected into the optimized network model and outputting a behavior identification result.

As shown in fig. 1, a backbone network of the network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a mixed residual module in series, which are sequentially arranged from front to back, wherein the two-dimensional convolution layer is used for extracting feature information of a sequence frame image, and the mixed residual module is used for extracting feature information of different characteristics; the hybrid residual error module includes a multiple attention mechanism fusion module, as shown in fig. 3, the multiple attention mechanism fusion module is divided into a time domain attention mechanism module and a space domain attention mechanism module, and output ends of the time domain attention mechanism module and the space domain attention mechanism module are fused in an addition manner.

In the using process, the sequence frame images are input, are processed by a convolution layer at the beginning of a network, are extracted into convolution characteristic information, in order to better fit the characteristics of behavior recognition data, a characteristic moving layer is designed and established in a targeted manner, the characteristic diagram is subjected to enhanced operations such as translation, rotation and the like, the generalization of the characteristic information to behavior motion can be increased through the operations, and then the characteristic information is transmitted to a multi-attention mechanism fusion module to be processed in parallel. According to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved.

Example 2:

in this embodiment, optimization is performed on the basis of embodiment 1, and as shown in fig. 2, the mixed residual error module is formed by encapsulating a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer, and an activation function, which are sequentially arranged from front to back; the feature moving layer is a network layer which integrates enhanced operations of translation and rotation on the feature map on the feature level.

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

the present embodiment is optimized based on embodiment 1 or 2, and as shown in fig. 4, the time domain attention mechanism module includes a first branch and a second branch, where the first branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a characteristic transformation layer, which are sequentially arranged from front to back, and the second branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic transposition layer, and a characteristic transformation layer, which are sequentially arranged from front to back; the input characteristics are respectively input into the first branch and the second branch, the first branch is connected with the flexible maximum layer after being multiplied by the output end of the second branch, the output of the flexible maximum layer is multiplied by the input characteristics, and the multiplied characteristics are sequentially input into the three-dimensional convolution layer and the activation function layer from front to back to be processed and multiplied by the gamma parameter layer, and finally the multiplied characteristics are input into the characteristic splicing layer in combination with the input characteristics.

In the time domain attention mechanism module, firstly, a two-dimensional convolution layer is utilized to carry out dimension increasing operation on the characteristic information, transposition operation is carried out on the characteristic information of the second branch, and then multiplication operation is carried out on the characteristic information of the first branch and the second branch to obtain an associated information matrix between characteristic graph channels. Then, a time domain attention diagram is obtained through flexible maximum layer processing, and then the time domain attention diagram is multiplied by the input characteristic information point to obtain weighted characteristic information. In addition, in order to improve the semantic information of the weighted feature map, a three-dimensional convolution layer is added and used, and finally a gamma parameter layer is added to adjust the fusion of the weighted feature information and the original feature information, so that the optimal fusion mode is selected in a self-adaptive mode. Through the construction of the network layers, the expression capacity of the network model to the sequence data is greatly improved.

Further, as shown in fig. 5, the spatial domain attention mechanism module includes a first branch and a second branch, the first branch is composed of a channel characteristic maximum pooling layer and a characteristic deformation layer sequentially arranged from front to back, the second branch is composed of a channel characteristic average pooling layer and a characteristic deformation layer sequentially arranged from front to back, the input characteristics are respectively input into the first branch and the second branch, and an output end is sequentially connected with the characteristic splicing layer, the two-dimensional convolution layer, the batch normalization layer, the activation function layer, the characteristic deformation layer, the one-dimensional convolution layer and the flexible maximum layer from front to back, an output of the flexible maximum layer is multiplied by the input characteristics and multiplied by the beta parameter layer, and finally, the multiplied output is input into the characteristic splicing layer by combining the input characteristics.

In the spatial domain attention mechanism module, a channel maximum value pooling layer and a channel average pooling layer are respectively used in a first branch and a second branch for processing, local important information in global feature information is extracted, dimensionality is adjusted through a convolution layer with a convolution kernel of 1x1 and a feature deformation layer, then a one-dimensional convolution layer is used for increasing the degree of dependence between feature points, a spatial domain attention diagram is obtained through flexible maximum value layer processing, the final processing mode is similar to a time domain, weight adjustment is carried out through learnable parameters after weighted feature information is obtained, and therefore optimal feature information is obtained in a self-adaptive mode.

The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.

Example 4:

Further, the loss function in step S200 is a cross entropy loss function, and is used for calculating a loss value between the prediction category of the sequence frame image and the real category of the sequence frame image.

Example 5:

a behavior identification method based on multi-attention mechanism fusion comprises the following steps:

acquiring and cutting a segment video containing behaviors as training data, and manually marking;

constructing a behavior recognition network according to a designed network structure diagram, inputting sequence frame images, then distributing different weights to the characteristic information of the sequence frame images from the direction of a time domain and a space domain by utilizing a main part network according to the importance, and then conveying the sequence frame images to a full connection layer for classification recognition;

calculating the difference between the predicted behavior category and the real behavior category by using a loss function, and performing model training end to end;

selecting the most suitable optimizer, presetting network related hyper-parameters, initializing model weight parameters randomly, then optimizing loss values by using the optimizer, iteratively updating the weight parameters until the loss values are converged, stopping training, and finally testing the obtained model.

Further, as shown in fig. 1, a main network portion of the network model is a serial structure composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a mixed residual module from front to back, and the network portion mainly has the functions of extracting feature information of the sequence frame image, performing upscaling and downsampling on the sequence frame image by using the convolution layer, and then extracting feature information of different characteristics by using the mixed residual module.

According to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved.

As shown in fig. 2, the hybrid residual module is formed by encapsulating, from front to back, a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer, and an activation function, the feature moving layer is a network layer that integrates enhancement operations such as translation and rotation on a feature map at a feature level, and the hybrid module that combines one multi-attention mechanism fusion module, two-dimensional convolution layers, and one three-dimensional convolution layer can improve the recognition performance without increasing additional parameter calculation amount.

As shown in fig. 3, the multi-attention mechanism fusion module is mainly divided into a time domain attention mechanism module and a space domain attention mechanism module, and performs fusion in an additive manner.

As shown in fig. 4, the time domain attention modeling module structure comprises a multi-branch structure from front to back, which is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic transformation layer, a characteristic transposition layer, a characteristic multiplication layer, a characteristic addition layer, a flexible maximum layer, a three-dimensional convolution layer and a gamma parameter layer. The expression is assigned to the different importance characteristic points in the time domain direction as follows:

wherein Input is the characteristic information Input into the module, 3Dconv is the three-dimensional convolution processing,Att _timethe attention weight graph calculated in the time domain direction represents the weight value to be given to each feature point, and gamma is a learnable parameter added by the gamma parameter layer and used for adjusting the importance of the feature points. Feature deformation layer main in time domain attention mechanism moduleIs to be used for the input dimension ofB,N _frame ,H,W,C]Becomes [ 2 ]B,N _frame ,H×W×C]Characteristic information of (1). The expression of the dimension here is a conventional expression in the art, and thus is not described in detail.

As shown in fig. 5, the spatial domain attention modeling module structure comprises a channel characteristic maximum value pooling layer, a channel characteristic average pooling layer, a characteristic splicing layer, a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic deformation layer, a one-dimensional convolution layer, a flexible maximum value layer, a characteristic multiplication layer and a characteristic addition layer from front to back to form a multi-branch structure. The expression is assigned to different importance feature points in the spatial domain direction as follows:

wherein the content of the first and second substances,Att _spatiofor the attention weight map calculated in the spatial domain direction, β is a learnable parameter added by the beta layer for controlling the proportion of each feature point. The characteristic deformation layer in the space domain attention mechanism module is mainly used for converting the dimension of [ 2 ]B,N _frame ,H,W,1]Becomes [ 2 ] to input characteristic informationB,1,N _frame ,H,W]Characteristic information of (1). The expression of the dimension here is a conventional expression in the art, and thus is not described in detail.

The invention activates the characteristic information from the time domain and the space domain directions, can effectively process the multi-type information required in the network learning process, thereby improving the model performance.

And finally, initializing a model weight parameter randomly, training by using a preset network related hyper-parameter, optimizing a loss value by using an optimizer, iteratively updating the weight parameter, stopping training until the loss value is converged, and testing the obtained model.

In summary, according to the technical characteristics of the behavior recognition direction, the invention can effectively improve the recognition performance and reduce the calculation amount generated by the model by constructing the method of enabling the model to learn the relevance between the frame images and the information of the frame image feature level through the time domain attention mechanism and the space domain attention mechanism.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A behavior recognition device based on multi-attention mechanism fusion is characterized by comprising a data processing module, a training module and a recognition module, wherein the data processing module is used for collecting and cutting videos to obtain training data; the training module is used for inputting training data into the network model for training and obtaining an optimized network model; the identification module is used for inputting the data to be detected into the optimized network model and outputting a behavior identification result;

the main network of the network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a mixed residual error module in series, wherein the two-dimensional convolution layer is used for extracting the characteristic information of a sequence frame image, and the mixed residual error module is used for extracting the characteristic information of different characteristics; the mixed residual error module comprises a multiple attention mechanism fusion module, the multiple attention mechanism fusion module is divided into a time domain attention mechanism module and a space domain attention mechanism module, and the output ends of the time domain attention mechanism module and the space domain attention mechanism module are fused in an adding mode;

in the time domain attention mechanism module, firstly, a two-dimensional convolution layer is utilized to carry out dimension increasing operation on the characteristic information of a first branch, transposition operation is carried out on the characteristic information of a second branch, then, product operation is carried out on the characteristic information of the first branch and the characteristic information of the second branch, and an association information matrix between characteristic map channels is obtained; then, processing by a flexible maximum layer to obtain a time domain attention diagram, and then performing point multiplication on the time domain attention diagram and the input characteristic information to obtain weighted characteristic information;

in the spatial domain attention mechanism module, a channel maximum value pooling layer and a channel average pooling layer are respectively used for processing in a first branch and a second branch, local important information in global feature information is extracted, dimensionality is adjusted through a convolution layer with a convolution kernel of 1x1 and a feature deformation layer, then a one-dimensional convolution layer is used for increasing the degree of dependence between feature points, a spatial domain attention diagram is obtained through flexible maximum value layer processing, the processing mode is similar to that of a time domain, and weight adjustment is carried out through learnable parameters after weighted feature information is obtained.

2. The behavior recognition device based on multi-attention mechanism fusion of claim 1, wherein the mixed residual module is formed by packaging a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer and an activation function which are sequentially arranged from front to back; the feature moving layer is a network layer which integrates enhanced operations of translation and rotation on the feature map on the feature level.

3. The behavior recognition device based on multi-attention mechanism fusion is characterized in that the time domain attention mechanism module comprises a first branch and a second branch, wherein the first branch consists of a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a characteristic deformation layer which are sequentially arranged from front to back, and the second branch consists of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic transposition layer and a characteristic deformation layer which are sequentially arranged from front to back; the input characteristics are respectively input into the first branch and the second branch, the first branch is connected with the flexible maximum layer after being multiplied by the output end of the second branch, the output of the flexible maximum layer is multiplied by the input characteristics, and the multiplied characteristics are sequentially input into the three-dimensional convolution layer and the activation function layer from front to back to be processed and multiplied by the gamma parameter layer, and finally the multiplied characteristics are input into the characteristic splicing layer in combination with the input characteristics.

4. The behavior recognition device based on multi-attention mechanism fusion is characterized in that the spatial domain attention mechanism module comprises a first branch and a second branch, the first branch is composed of a channel characteristic maximum value pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the second branch is composed of a channel characteristic average pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the input characteristics are respectively input into the first branch and the second branch, and an output end is sequentially connected with a characteristic splicing layer, a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic deformation layer, a one-dimensional convolution layer and a flexible maximum value layer from front to back, and the output of the flexible maximum value layer is multiplied by the input characteristics and multiplied by a beta parameter layer, and finally is input into the characteristic splicing layer in combination with the input characteristics.

5. A behavior recognition method based on multi-attention mechanism fusion, which is performed by using the behavior recognition apparatus according to any one of claims 1 to 4, and which comprises the steps of:

6. The method for identifying behaviors based on multi-attention mechanism fusion of claim 5, wherein the loss function in step S200 is a cross entropy loss function for calculating a loss value between a prediction class of the sequence frame image and a real class of the sequence frame image.

7. A computer-readable storage medium storing computer program instructions, which when executed by a processor implement the behavior recognition method of claim 5 or 6.