CN112990116B - Behavior recognition device and method based on multi-attention mechanism fusion and storage medium - Google Patents

Behavior recognition device and method based on multi-attention mechanism fusion and storage medium Download PDF

Info

Publication number
CN112990116B
CN112990116B CN202110428650.6A CN202110428650A CN112990116B CN 112990116 B CN112990116 B CN 112990116B CN 202110428650 A CN202110428650 A CN 202110428650A CN 112990116 B CN112990116 B CN 112990116B
Authority
CN
China
Prior art keywords
layer
module
attention mechanism
branch
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110428650.6A
Other languages
Chinese (zh)
Other versions
CN112990116A (en
Inventor
桑高丽
卢丽
闫超
黄俊洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yifei Technology Co ltd
Original Assignee
Sichuan Yifei Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yifei Technology Co ltd filed Critical Sichuan Yifei Technology Co ltd
Priority to CN202110428650.6A priority Critical patent/CN112990116B/en
Publication of CN112990116A publication Critical patent/CN112990116A/en
Application granted granted Critical
Publication of CN112990116B publication Critical patent/CN112990116B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a behavior recognition device, a behavior recognition method and a storage medium based on multi-attention mechanism fusion.A main network of an adopted network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a mixed residual module in series, which are sequentially arranged from front to back, wherein the two-dimensional convolution layer is used for extracting the characteristic information of a sequence frame image, and the mixed residual module is used for extracting the characteristic information of different characteristics; the mixed residual error module comprises a multiple attention mechanism fusion module which is divided into a time domain attention mechanism module and a space domain attention mechanism module, and the output ends of the multiple attention mechanism fusion module are fused in an adding mode. According to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved.

Description

Behavior recognition device and method based on multi-attention mechanism fusion and storage medium
Technical Field
The invention belongs to the technical field of behavior recognition, and particularly relates to a behavior recognition device and method based on multi-attention mechanism fusion and a storage medium.
Background
With the rapid development of artificial intelligence technology, people begin to pay attention to how computers understand the world further, so that a discipline for understanding videos is derived and widely applied to the fields of reality augmentation, virtual reality, intelligent monitoring and the like. In the big data era, millions of videos are uploaded or downloaded each day, where applying video understanding techniques can play a key role. However, with the explosive growth of video streams, video understanding techniques have met with significant challenges in terms of accuracy and computational cost.
Behavior recognition is a basic direction in video understanding technology, and the core technology of the behavior recognition is that a computer classifies behaviors of targets in a video by learning characteristic information of sequence frame images, so that the purpose of recognition is achieved, and the behavior recognition method is commonly used for man-machine interaction and monitoring intelligent robots. The behavior recognition of the target in the video data usually has time dependency, and includes not only spatial information in each frame of image, but also time information between frames, such as behaviors of old people falling down, carrying articles, and the like.
In recent years, the mainstream method of the behavior recognition technology is also a network model based on three-dimensional convolution and a network model based on two-dimensional convolution. The former is to build a deep space-time network model by using a large number of three-dimensional convolutional layers, so that video sequence data can be effectively processed, but a space-time model purely based on three-dimensional convolutional layer modeling cannot completely acquire information contained in a video, and is easy to be over-fitted, so that a large number of false detection phenomena of the model can be caused. The time-space information and the time sequence information are considered respectively, compared with a network model considered unilaterally, the performance is greatly improved, but a large amount of calculation cost is generated in the process of extracting and processing the time sequence information.
Currently, most behavior recognition techniques are limited to falling into practical scenes due to the slow speed of model parameter derivation caused by the inherent heavy calculation amount. Therefore, it is urgently needed to provide a behavior recognition scheme capable of improving the accuracy and reducing the calculation cost, improving the feature expression capability of the model on the sequence frame images, and enhancing the behavior recognition performance.
Disclosure of Invention
The present invention aims to provide a behavior recognition apparatus, method and storage medium based on multi-attention mechanism fusion, and aims to solve the above problems.
The invention is mainly realized by the following technical scheme:
a behavior recognition device based on multi-attention mechanism fusion comprises a data processing module, a training module and a recognition module, wherein the data processing module is used for collecting and cutting videos to obtain training data; the training module is used for inputting training data into the network model for training and obtaining an optimized network model; the identification module is used for inputting the data to be detected into the optimized network model and outputting a behavior identification result;
the main network of the network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a mixed residual error module in series, wherein the two-dimensional convolution layer is used for extracting the characteristic information of a sequence frame image, and the mixed residual error module is used for extracting the characteristic information of different characteristics; the mixed residual error module comprises a multiple attention mechanism fusion module, the multiple attention mechanism fusion module is divided into a time domain attention mechanism module and a space domain attention mechanism module, and the output ends of the time domain attention mechanism module and the space domain attention mechanism module are fused in an adding mode.
The mixed residual module mainly comprises a multiple attention mechanism fusion module and a convolution layer, and can extract effective characteristic information and enhance the expression capability of the model; the multi-attention mechanism fusion module is an attention mechanism constructed from a time domain and a space domain, and makes the model learn process pay more attention to target behaviors and improve the performance and accuracy of the model by analyzing the importance degree of information between frame images and the importance degree of information inside the frame images in two domain directions. According to the technical characteristics of the behavior recognition direction, the weight distribution is respectively carried out on the feature information from the time sequence aspect and the space aspect, the relevance between the frame images and the information of the frame image feature level are learned by the model through constructing a time domain attention mechanism and a space domain attention mechanism, the modules are plug and play, and the recognition performance of the network model is effectively improved.
In the using process, the sequence frame images are input, are processed by a convolution layer at the beginning of a network, are extracted into convolution characteristic information, in order to better fit the characteristics of behavior recognition data, a characteristic moving layer is designed and established in a targeted manner, the characteristic diagram is subjected to enhanced operations such as translation, rotation and the like, the generalization of the characteristic information to behavior motion can be increased through the operations, and then the characteristic information is transmitted to a multi-attention mechanism fusion module to be processed in parallel. The time domain attention mechanism module can establish a more effective long-distance dependency relationship for video data. The spatial domain attention mechanism module can aggregate information of each feature point along with self-adaptive selection in the training process, and no additional calculation mode is provided.
The time domain attention mechanism module aims to increase the response capability of model features to different frame images by modeling the relation between frame sequence images, and the specific process is to construct a multi-branch structure. In the time domain attention mechanism module, firstly, a two-dimensional convolution layer is utilized to carry out dimension increasing operation on the characteristic information, transposition operation is carried out on the characteristic information of the second branch, and then multiplication operation is carried out on the characteristic information of the first branch and the second branch to obtain an associated information matrix between characteristic graph channels. Then, a time domain attention diagram is obtained through flexible maximum layer processing, and then the time domain attention diagram is multiplied by the input characteristic information point to obtain weighted characteristic information. In addition, in order to improve the semantic information of the weighted feature map, a three-dimensional convolution layer is added and used, and finally a gamma parameter layer is added to adjust the fusion of the weighted feature information and the original feature information, so that the optimal fusion mode is selected in a self-adaptive mode. Through the construction of the network layers, the expression capacity of the network model to the sequence data is greatly improved.
The dimension of the attention diagram in the time domain attention mechanism module is [ 2 ]B,N frame ,H,W,C]The method is obtained by processing an associated strength information matrix in the channel direction by utilizing a flexible maximum layer, and multiplying original feature information and an attention map to obtain a weighted feature map, so that the contribution degree of a key frame in the model learning process is enhanced, and at the moment, the weighted featureThe graph also maintains the sequence relationship of the frame images in the channel direction, and each feature point on the graph is associated across channels. The expression of dimensions is common knowledge in the art and will not be described in detail.
The existing method directly uses the weighted feature map for fusion, but the long-distance dependency relationship contained in the weighted feature map has limitation, the product operation only can represent partial associated information, and the motion information cannot be effectively captured in the scene that the target behavior moves too fast, so the method carries out processing after the weighted feature map by a three-dimensional convolution layer, and enhances the relevance of a plurality of adjacent frame information in the channel direction by using the characteristic of a three-dimensional convolution kernel, thereby improving the expression capability of the feature information on the time sequence and acquiring more motion information. Secondly, because the weighted feature map and the original feature map contain different information such as semantics, dimension and the like, stronger feature information needs to be obtained through fusion again, but the weighted feature map and the original feature map are directly added and combined in a general fusion mode, and feature points with larger weight can cover original information of the feature points to cause feature degradation, so that a gamma parameter layer is added and fused in proportion, the initial gamma parameter is 0, the weighted feature map can be optimized along with model learning, and the optimal fusion mode is obtained in a self-adaptive manner to enhance the expression capability of the feature information.
The spatial domain attention mechanism module is used for modeling the interior of each frame image, enhancing the response capability of model features to position information of the interior of the frame image, and the specific process is to construct a multi-branch structure. In the spatial domain attention mechanism module, a channel maximum value pooling layer and a channel average pooling layer are respectively used in a first branch and a second branch for processing, local important information in global feature information is extracted, dimensionality is adjusted through a convolution layer with a convolution kernel of 1x1 and a feature deformation layer, then a one-dimensional convolution layer is used for increasing the degree of dependence between feature points, a spatial domain attention diagram is obtained through flexible maximum value layer processing, the final processing mode is similar to a time domain, weight adjustment is carried out through learnable parameters after weighted feature information is obtained, and therefore optimal feature information is obtained in a self-adaptive mode.
The invention provides a channel maximum pooling layer and a channel average pooling layer, wherein the main processing process is to directly carry out maximum pooling and average pooling in the channel direction, change the data dimension of a feature block into H multiplied by W multiplied by 1, and fuse position information in different frame images, so that the global information obtained by crossing channels is free from losing detailed information and is more suitable for target classification in a behavior recognition scene. Secondly, most of the existing attention methods are methods for general purposes, one-dimensional convolution is not used for extracting spatial information, the dependency between characteristic points is ignored, the characteristic information before the one-dimensional convolution is transmitted in the method provided by the invention is the extracted cross-channel global information, the global information is the characteristic value with the maximum weight extracted from each channel, the relevance between the characteristic values is weak, and the connection between the characteristic values is strengthened by using the one-dimensional convolution immediately in order to avoid the characteristic degradation phenomenon in subsequent calculation. The beta parameter layer functions as the gamma parameter layer.
In summary, according to the characteristics of the behavior recognition target, the invention processes the weight distribution between the frame images and inside the frame images in the two directions of the time domain and the space domain, so that various information contained in the sample, such as the space information and the time sequence information, can be processed more effectively, the utilization rate of the feature information is greatly improved, the expression capability of the feature information is enhanced, and the performance of the network model is improved.
In order to better realize the invention, the mixed residual error module is formed by packaging a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer and an activation function which are sequentially arranged from front to back; the feature moving layer is a network layer which integrates enhanced operations of translation and rotation on the feature map on the feature level.
In order to better implement the present invention, the time domain attention mechanism module further includes a first branch and a second branch, the first branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a feature transformation layer, which are sequentially arranged from front to back, and the second branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a feature transposition layer, and a feature transformation layer, which are sequentially arranged from front to back; the input characteristics are respectively input into the first branch and the second branch, the first branch is connected with the flexible maximum layer after being multiplied by the output end of the second branch, the output of the flexible maximum layer is multiplied by the input characteristics, and the multiplied characteristics are sequentially input into the three-dimensional convolution layer and the activation function layer from front to back to be processed and multiplied by the gamma parameter layer, and finally the multiplied characteristics are input into the characteristic splicing layer in combination with the input characteristics.
In order to better implement the present invention, the spatial domain attention mechanism module further includes a first branch and a second branch, the first branch is composed of a channel characteristic maximum value pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the second branch is composed of a channel characteristic average pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the input characteristics are respectively input into the first branch and the second branch, and an output end is sequentially connected with a characteristic splicing layer, a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic deformation layer, a one-dimensional convolution layer and a flexible maximum value layer from front to back, an output of the flexible maximum value layer is multiplied by the input characteristics and multiplied by a beta parameter layer, and finally, the output is input into the characteristic splicing layer in combination with the input characteristics.
The invention is mainly realized by the following technical scheme:
a behavior recognition method based on multi-attention mechanism fusion is carried out by adopting the behavior recognition device, and comprises the following steps:
step S100: acquiring and cutting a segment video containing behaviors, and manually marking to obtain training data;
step S200: inputting sequence frame images in training data into a network model for training; distributing different weights to the characteristic information of the sequence frame images from the directions of a time domain and a space domain according to the importance of a backbone network of the network model, and then conveying the characteristic information to a full connection layer for classification and identification; then, calculating a difference value between the predicted behavior category and the real behavior category by using a loss function, and performing model training end to end;
step S300: selecting an optimizer, presetting network related hyper-parameters, initializing network model weight parameters randomly, then optimizing a loss value by using the optimizer, iteratively updating the weight parameters, stopping training until the loss value is converged, and finally testing to obtain an optimal network model;
step S400: and inputting the data to be detected into the optimal network model and outputting a behavior recognition result.
In order to better implement the present invention, further, the loss function in step S200 is a cross entropy loss function for calculating a loss value between the prediction category of the sequence frame image and the real category of the sequence frame image.
In order to better implement the present invention, further, the activation function layer in the present invention adopts a parameter-modified linear unit layer.
A computer-readable storage medium storing computer program instructions which, when executed by a processor, implement the behavior recognition method described above.
The invention has the beneficial effects that:
(1) according to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved;
(2) according to the invention, the mixed residual error module is constructed, and the relevance between the frame images and the information of the frame image characteristic level are learned by the model by utilizing the time domain attention mechanism and the space domain attention mechanism, so that the module is plug-and-play, and the identification performance of the network model is effectively improved.
Drawings
FIG. 1 is a schematic diagram of a backbone network of a network model;
FIG. 2 is a schematic diagram of a hybrid residual module;
FIG. 3 is a schematic diagram of a multi-attention mechanism fusion module;
FIG. 4 is a schematic diagram of a time domain attention mechanism module;
fig. 5 is a schematic structural diagram of a spatial domain attention mechanism module.
Detailed Description
Example 1:
a behavior recognition device based on multi-attention mechanism fusion comprises a data processing module, a training module and a recognition module, wherein the data processing module is used for collecting and cutting videos to obtain training data; the training module is used for inputting training data into the network model for training and obtaining an optimized network model; and the identification module is used for inputting the data to be detected into the optimized network model and outputting a behavior identification result.
As shown in fig. 1, a backbone network of the network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a mixed residual module in series, which are sequentially arranged from front to back, wherein the two-dimensional convolution layer is used for extracting feature information of a sequence frame image, and the mixed residual module is used for extracting feature information of different characteristics; the hybrid residual error module includes a multiple attention mechanism fusion module, as shown in fig. 3, the multiple attention mechanism fusion module is divided into a time domain attention mechanism module and a space domain attention mechanism module, and output ends of the time domain attention mechanism module and the space domain attention mechanism module are fused in an addition manner.
In the using process, the sequence frame images are input, are processed by a convolution layer at the beginning of a network, are extracted into convolution characteristic information, in order to better fit the characteristics of behavior recognition data, a characteristic moving layer is designed and established in a targeted manner, the characteristic diagram is subjected to enhanced operations such as translation, rotation and the like, the generalization of the characteristic information to behavior motion can be increased through the operations, and then the characteristic information is transmitted to a multi-attention mechanism fusion module to be processed in parallel. According to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved.
Example 2:
in this embodiment, optimization is performed on the basis of embodiment 1, and as shown in fig. 2, the mixed residual error module is formed by encapsulating a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer, and an activation function, which are sequentially arranged from front to back; the feature moving layer is a network layer which integrates enhanced operations of translation and rotation on the feature map on the feature level.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
the present embodiment is optimized based on embodiment 1 or 2, and as shown in fig. 4, the time domain attention mechanism module includes a first branch and a second branch, where the first branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a characteristic transformation layer, which are sequentially arranged from front to back, and the second branch is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic transposition layer, and a characteristic transformation layer, which are sequentially arranged from front to back; the input characteristics are respectively input into the first branch and the second branch, the first branch is connected with the flexible maximum layer after being multiplied by the output end of the second branch, the output of the flexible maximum layer is multiplied by the input characteristics, and the multiplied characteristics are sequentially input into the three-dimensional convolution layer and the activation function layer from front to back to be processed and multiplied by the gamma parameter layer, and finally the multiplied characteristics are input into the characteristic splicing layer in combination with the input characteristics.
In the time domain attention mechanism module, firstly, a two-dimensional convolution layer is utilized to carry out dimension increasing operation on the characteristic information, transposition operation is carried out on the characteristic information of the second branch, and then multiplication operation is carried out on the characteristic information of the first branch and the second branch to obtain an associated information matrix between characteristic graph channels. Then, a time domain attention diagram is obtained through flexible maximum layer processing, and then the time domain attention diagram is multiplied by the input characteristic information point to obtain weighted characteristic information. In addition, in order to improve the semantic information of the weighted feature map, a three-dimensional convolution layer is added and used, and finally a gamma parameter layer is added to adjust the fusion of the weighted feature information and the original feature information, so that the optimal fusion mode is selected in a self-adaptive mode. Through the construction of the network layers, the expression capacity of the network model to the sequence data is greatly improved.
Further, as shown in fig. 5, the spatial domain attention mechanism module includes a first branch and a second branch, the first branch is composed of a channel characteristic maximum pooling layer and a characteristic deformation layer sequentially arranged from front to back, the second branch is composed of a channel characteristic average pooling layer and a characteristic deformation layer sequentially arranged from front to back, the input characteristics are respectively input into the first branch and the second branch, and an output end is sequentially connected with the characteristic splicing layer, the two-dimensional convolution layer, the batch normalization layer, the activation function layer, the characteristic deformation layer, the one-dimensional convolution layer and the flexible maximum layer from front to back, an output of the flexible maximum layer is multiplied by the input characteristics and multiplied by the beta parameter layer, and finally, the multiplied output is input into the characteristic splicing layer by combining the input characteristics.
In the spatial domain attention mechanism module, a channel maximum value pooling layer and a channel average pooling layer are respectively used in a first branch and a second branch for processing, local important information in global feature information is extracted, dimensionality is adjusted through a convolution layer with a convolution kernel of 1x1 and a feature deformation layer, then a one-dimensional convolution layer is used for increasing the degree of dependence between feature points, a spatial domain attention diagram is obtained through flexible maximum value layer processing, the final processing mode is similar to a time domain, weight adjustment is carried out through learnable parameters after weighted feature information is obtained, and therefore optimal feature information is obtained in a self-adaptive mode.
The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.
Example 4:
a behavior recognition method based on multi-attention mechanism fusion is carried out by adopting the behavior recognition device, and comprises the following steps:
step S100: acquiring and cutting a segment video containing behaviors, and manually marking to obtain training data;
step S200: inputting sequence frame images in training data into a network model for training; distributing different weights to the characteristic information of the sequence frame images from the directions of a time domain and a space domain according to the importance of a backbone network of the network model, and then conveying the characteristic information to a full connection layer for classification and identification; then, calculating a difference value between the predicted behavior category and the real behavior category by using a loss function, and performing model training end to end;
step S300: selecting an optimizer, presetting network related hyper-parameters, initializing network model weight parameters randomly, then optimizing a loss value by using the optimizer, iteratively updating the weight parameters, stopping training until the loss value is converged, and finally testing to obtain an optimal network model;
step S400: and inputting the data to be detected into the optimal network model and outputting a behavior recognition result.
Further, the loss function in step S200 is a cross entropy loss function, and is used for calculating a loss value between the prediction category of the sequence frame image and the real category of the sequence frame image.
Example 5:
a behavior identification method based on multi-attention mechanism fusion comprises the following steps:
acquiring and cutting a segment video containing behaviors as training data, and manually marking;
constructing a behavior recognition network according to a designed network structure diagram, inputting sequence frame images, then distributing different weights to the characteristic information of the sequence frame images from the direction of a time domain and a space domain by utilizing a main part network according to the importance, and then conveying the sequence frame images to a full connection layer for classification recognition;
calculating the difference between the predicted behavior category and the real behavior category by using a loss function, and performing model training end to end;
selecting the most suitable optimizer, presetting network related hyper-parameters, initializing model weight parameters randomly, then optimizing loss values by using the optimizer, iteratively updating the weight parameters until the loss values are converged, stopping training, and finally testing the obtained model.
Further, as shown in fig. 1, a main network portion of the network model is a serial structure composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, and a mixed residual module from front to back, and the network portion mainly has the functions of extracting feature information of the sequence frame image, performing upscaling and downsampling on the sequence frame image by using the convolution layer, and then extracting feature information of different characteristics by using the mixed residual module.
According to the method, the time domain attention mechanism module and the space domain attention mechanism module are constructed, different weights are given to the feature points with different importance in different directions, so that the model learns the relevance between the frame images and the information of the feature level of the frame images, and the identification performance is effectively improved.
As shown in fig. 2, the hybrid residual module is formed by encapsulating, from front to back, a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer, and an activation function, the feature moving layer is a network layer that integrates enhancement operations such as translation and rotation on a feature map at a feature level, and the hybrid module that combines one multi-attention mechanism fusion module, two-dimensional convolution layers, and one three-dimensional convolution layer can improve the recognition performance without increasing additional parameter calculation amount.
As shown in fig. 3, the multi-attention mechanism fusion module is mainly divided into a time domain attention mechanism module and a space domain attention mechanism module, and performs fusion in an additive manner.
As shown in fig. 4, the time domain attention modeling module structure comprises a multi-branch structure from front to back, which is composed of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic transformation layer, a characteristic transposition layer, a characteristic multiplication layer, a characteristic addition layer, a flexible maximum layer, a three-dimensional convolution layer and a gamma parameter layer. The expression is assigned to the different importance characteristic points in the time domain direction as follows:
Figure 119412DEST_PATH_IMAGE001
wherein Input is the characteristic information Input into the module, 3Dconv is the three-dimensional convolution processing,Att time the attention weight graph calculated in the time domain direction represents the weight value to be given to each feature point, and gamma is a learnable parameter added by the gamma parameter layer and used for adjusting the importance of the feature points. Feature deformation layer main in time domain attention mechanism moduleIs to be used for the input dimension ofB,N frame ,H,W,C]Becomes [ 2 ]B,N frame ,H×W×C]Characteristic information of (1). The expression of the dimension here is a conventional expression in the art, and thus is not described in detail.
As shown in fig. 5, the spatial domain attention modeling module structure comprises a channel characteristic maximum value pooling layer, a channel characteristic average pooling layer, a characteristic splicing layer, a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic deformation layer, a one-dimensional convolution layer, a flexible maximum value layer, a characteristic multiplication layer and a characteristic addition layer from front to back to form a multi-branch structure. The expression is assigned to different importance feature points in the spatial domain direction as follows:
Figure DEST_PATH_IMAGE002
wherein the content of the first and second substances,Att spatio for the attention weight map calculated in the spatial domain direction, β is a learnable parameter added by the beta layer for controlling the proportion of each feature point. The characteristic deformation layer in the space domain attention mechanism module is mainly used for converting the dimension of [ 2 ]B,N frame ,H,W,1]Becomes [ 2 ] to input characteristic informationB,1,N frame ,H,W]Characteristic information of (1). The expression of the dimension here is a conventional expression in the art, and thus is not described in detail.
The invention activates the characteristic information from the time domain and the space domain directions, can effectively process the multi-type information required in the network learning process, thereby improving the model performance.
And finally, initializing a model weight parameter randomly, training by using a preset network related hyper-parameter, optimizing a loss value by using an optimizer, iteratively updating the weight parameter, stopping training until the loss value is converged, and testing the obtained model.
In summary, according to the technical characteristics of the behavior recognition direction, the invention can effectively improve the recognition performance and reduce the calculation amount generated by the model by constructing the method of enabling the model to learn the relevance between the frame images and the information of the frame image feature level through the time domain attention mechanism and the space domain attention mechanism.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims (7)

1. A behavior recognition device based on multi-attention mechanism fusion is characterized by comprising a data processing module, a training module and a recognition module, wherein the data processing module is used for collecting and cutting videos to obtain training data; the training module is used for inputting training data into the network model for training and obtaining an optimized network model; the identification module is used for inputting the data to be detected into the optimized network model and outputting a behavior identification result;
the main network of the network model is formed by connecting a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a mixed residual error module in series, wherein the two-dimensional convolution layer is used for extracting the characteristic information of a sequence frame image, and the mixed residual error module is used for extracting the characteristic information of different characteristics; the mixed residual error module comprises a multiple attention mechanism fusion module, the multiple attention mechanism fusion module is divided into a time domain attention mechanism module and a space domain attention mechanism module, and the output ends of the time domain attention mechanism module and the space domain attention mechanism module are fused in an adding mode;
in the time domain attention mechanism module, firstly, a two-dimensional convolution layer is utilized to carry out dimension increasing operation on the characteristic information of a first branch, transposition operation is carried out on the characteristic information of a second branch, then, product operation is carried out on the characteristic information of the first branch and the characteristic information of the second branch, and an association information matrix between characteristic map channels is obtained; then, processing by a flexible maximum layer to obtain a time domain attention diagram, and then performing point multiplication on the time domain attention diagram and the input characteristic information to obtain weighted characteristic information;
in the spatial domain attention mechanism module, a channel maximum value pooling layer and a channel average pooling layer are respectively used for processing in a first branch and a second branch, local important information in global feature information is extracted, dimensionality is adjusted through a convolution layer with a convolution kernel of 1x1 and a feature deformation layer, then a one-dimensional convolution layer is used for increasing the degree of dependence between feature points, a spatial domain attention diagram is obtained through flexible maximum value layer processing, the processing mode is similar to that of a time domain, and weight adjustment is carried out through learnable parameters after weighted feature information is obtained.
2. The behavior recognition device based on multi-attention mechanism fusion of claim 1, wherein the mixed residual module is formed by packaging a feature moving layer, a multi-attention mechanism fusion module, a two-dimensional convolution layer, a three-dimensional convolution layer, a batch normalization layer and an activation function which are sequentially arranged from front to back; the feature moving layer is a network layer which integrates enhanced operations of translation and rotation on the feature map on the feature level.
3. The behavior recognition device based on multi-attention mechanism fusion is characterized in that the time domain attention mechanism module comprises a first branch and a second branch, wherein the first branch consists of a two-dimensional convolution layer, a batch normalization layer, an activation function layer and a characteristic deformation layer which are sequentially arranged from front to back, and the second branch consists of a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic transposition layer and a characteristic deformation layer which are sequentially arranged from front to back; the input characteristics are respectively input into the first branch and the second branch, the first branch is connected with the flexible maximum layer after being multiplied by the output end of the second branch, the output of the flexible maximum layer is multiplied by the input characteristics, and the multiplied characteristics are sequentially input into the three-dimensional convolution layer and the activation function layer from front to back to be processed and multiplied by the gamma parameter layer, and finally the multiplied characteristics are input into the characteristic splicing layer in combination with the input characteristics.
4. The behavior recognition device based on multi-attention mechanism fusion is characterized in that the spatial domain attention mechanism module comprises a first branch and a second branch, the first branch is composed of a channel characteristic maximum value pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the second branch is composed of a channel characteristic average pooling layer and a characteristic deformation layer which are sequentially arranged from front to back, the input characteristics are respectively input into the first branch and the second branch, and an output end is sequentially connected with a characteristic splicing layer, a two-dimensional convolution layer, a batch normalization layer, an activation function layer, a characteristic deformation layer, a one-dimensional convolution layer and a flexible maximum value layer from front to back, and the output of the flexible maximum value layer is multiplied by the input characteristics and multiplied by a beta parameter layer, and finally is input into the characteristic splicing layer in combination with the input characteristics.
5. A behavior recognition method based on multi-attention mechanism fusion, which is performed by using the behavior recognition apparatus according to any one of claims 1 to 4, and which comprises the steps of:
step S100: acquiring and cutting a segment video containing behaviors, and manually marking to obtain training data;
step S200: inputting sequence frame images in training data into a network model for training; distributing different weights to the characteristic information of the sequence frame images from the directions of a time domain and a space domain according to the importance of a backbone network of the network model, and then conveying the characteristic information to a full connection layer for classification and identification; then, calculating a difference value between the predicted behavior category and the real behavior category by using a loss function, and performing model training end to end;
step S300: selecting an optimizer, presetting network related hyper-parameters, initializing network model weight parameters randomly, then optimizing a loss value by using the optimizer, iteratively updating the weight parameters, stopping training until the loss value is converged, and finally testing to obtain an optimal network model;
step S400: and inputting the data to be detected into the optimal network model and outputting a behavior recognition result.
6. The method for identifying behaviors based on multi-attention mechanism fusion of claim 5, wherein the loss function in step S200 is a cross entropy loss function for calculating a loss value between a prediction class of the sequence frame image and a real class of the sequence frame image.
7. A computer-readable storage medium storing computer program instructions, which when executed by a processor implement the behavior recognition method of claim 5 or 6.
CN202110428650.6A 2021-04-21 2021-04-21 Behavior recognition device and method based on multi-attention mechanism fusion and storage medium Active CN112990116B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110428650.6A CN112990116B (en) 2021-04-21 2021-04-21 Behavior recognition device and method based on multi-attention mechanism fusion and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110428650.6A CN112990116B (en) 2021-04-21 2021-04-21 Behavior recognition device and method based on multi-attention mechanism fusion and storage medium

Publications (2)

Publication Number Publication Date
CN112990116A CN112990116A (en) 2021-06-18
CN112990116B true CN112990116B (en) 2021-08-06

Family

ID=76341478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110428650.6A Active CN112990116B (en) 2021-04-21 2021-04-21 Behavior recognition device and method based on multi-attention mechanism fusion and storage medium

Country Status (1)

Country Link
CN (1) CN112990116B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113963241B (en) * 2021-12-22 2022-03-08 苏州浪潮智能科技有限公司 FPGA hardware architecture, data processing method thereof and storage medium
CN114399839A (en) * 2022-01-18 2022-04-26 平安科技(深圳)有限公司 Behavior recognition method, behavior recognition device, behavior recognition equipment and storage medium based on feature fusion
CN114332592B (en) * 2022-03-11 2022-06-21 中国海洋大学 Ocean environment data fusion method and system based on attention mechanism
CN114764788B (en) * 2022-03-29 2022-12-16 首都医科大学附属北京天坛医院 Intracranial arterial stenosis detection method and system
CN114724021B (en) * 2022-05-25 2022-09-09 北京闪马智建科技有限公司 Data identification method and device, storage medium and electronic device
CN116070104B (en) * 2022-11-16 2023-06-16 北京理工大学 Method for monitoring rehabilitation behaviors in real time and wearable device
CN116056074A (en) * 2023-04-03 2023-05-02 微网优联科技(成都)有限公司 Wireless communication control method based on multiple verification and wireless router applying same

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107507234A (en) * 2017-08-29 2017-12-22 北京大学 Cone beam computed tomography image and x-ray image method for registering
CN109858419A (en) * 2019-01-23 2019-06-07 广州智慧城市发展研究院 It is a kind of from bottom to top-top-down Activity recognition system
CN110059620A (en) * 2019-04-17 2019-07-26 安徽艾睿思智能科技有限公司 Bone Activity recognition method based on space-time attention
CN110334607A (en) * 2019-06-12 2019-10-15 武汉大学 A kind of video human interbehavior recognition methods and system
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism
CN111985343A (en) * 2020-07-23 2020-11-24 深圳大学 Method for constructing behavior recognition deep network model and behavior recognition method
WO2021041176A1 (en) * 2019-08-27 2021-03-04 Nec Laboratories America, Inc. Shuffle, attend, and adapt: video domain adaptation by clip order prediction and clip attention alignment
CN112507920A (en) * 2020-12-16 2021-03-16 重庆交通大学 Examination abnormal behavior identification method based on time displacement and attention mechanism
CN112560656A (en) * 2020-12-11 2021-03-26 成都东方天呈智能科技有限公司 Pedestrian multi-target tracking method combining attention machine system and end-to-end training

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329525A (en) * 2020-09-27 2021-02-05 中国科学院软件研究所 Gesture recognition method and device based on space-time diagram convolutional neural network
CN112597824A (en) * 2020-12-07 2021-04-02 深延科技(北京)有限公司 Behavior recognition method and device, electronic equipment and storage medium
CN112507995B (en) * 2021-02-05 2021-06-01 成都东方天呈智能科技有限公司 Cross-model face feature vector conversion system and method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107507234A (en) * 2017-08-29 2017-12-22 北京大学 Cone beam computed tomography image and x-ray image method for registering
CN109858419A (en) * 2019-01-23 2019-06-07 广州智慧城市发展研究院 It is a kind of from bottom to top-top-down Activity recognition system
CN110059620A (en) * 2019-04-17 2019-07-26 安徽艾睿思智能科技有限公司 Bone Activity recognition method based on space-time attention
CN110334607A (en) * 2019-06-12 2019-10-15 武汉大学 A kind of video human interbehavior recognition methods and system
WO2021041176A1 (en) * 2019-08-27 2021-03-04 Nec Laboratories America, Inc. Shuffle, attend, and adapt: video domain adaptation by clip order prediction and clip attention alignment
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism
CN111985343A (en) * 2020-07-23 2020-11-24 深圳大学 Method for constructing behavior recognition deep network model and behavior recognition method
CN112560656A (en) * 2020-12-11 2021-03-26 成都东方天呈智能科技有限公司 Pedestrian multi-target tracking method combining attention machine system and end-to-end training
CN112507920A (en) * 2020-12-16 2021-03-16 重庆交通大学 Examination abnormal behavior identification method based on time displacement and attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《Residual Attention-based Fusion for Video Classification》;Samira Pouyanfar等;《2019 CVPR Workshop》;20191231;第1-3页 *
《基于安防视频的群体异常行为特征提取与识别技术研究》;卢丽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190415(第04期);第I138-789页 *
《基于深度学习的人体行为识别技术的研究与应用》;刘潇;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190815(第08期);第I138-871页 *

Also Published As

Publication number Publication date
CN112990116A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN112990116B (en) Behavior recognition device and method based on multi-attention mechanism fusion and storage medium
CN110175671B (en) Neural network construction method, image processing method and device
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN113128558B (en) Target detection method based on shallow space feature fusion and adaptive channel screening
CN111079532A (en) Video content description method based on text self-encoder
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
Akan et al. Stretchbev: Stretching future instance prediction spatially and temporally
CN110222718B (en) Image processing method and device
CN110222717A (en) Image processing method and device
CN113449573A (en) Dynamic gesture recognition method and device
CN114494981B (en) Action video classification method and system based on multi-level motion modeling
CN112215332A (en) Searching method of neural network structure, image processing method and device
EP4318313A1 (en) Data processing method, training method for neural network model, and apparatus
CN112232355A (en) Image segmentation network processing method, image segmentation device and computer equipment
CN113807183A (en) Model training method and related equipment
CN115222998B (en) Image classification method
CN115222946B (en) Single-stage instance image segmentation method and device and computer equipment
CN114708297A (en) Video target tracking method and device
CN113537462A (en) Data processing method, neural network quantization method and related device
CN115908772A (en) Target detection method and system based on Transformer and fusion attention mechanism
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
US20220222934A1 (en) Neural network construction method and apparatus, and image processing method and apparatus
CN110705564B (en) Image recognition method and device
Zheng et al. CLMIP: cross-layer manifold invariance based pruning method of deep convolutional neural network for real-time road type recognition
Xiao et al. Apple ripeness identification from digital images using transformers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant