CN113516028A

CN113516028A - Human body abnormal behavior identification method and system based on mixed attention mechanism

Info

Publication number: CN113516028A
Application number: CN202110468555.9A
Authority: CN
Inventors: 李洪均; 孙晓虎; 余阿祥; 申栩林; 陈金怡; 陈俊杰; 谢正光
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-10-19
Anticipated expiration: 2041-04-28
Also published as: CN113516028B

Abstract

The invention relates to a human body abnormal behavior identification method and system based on a mixed attention mechanism, wherein the identification method comprises the following steps: extracting the features of the original image to obtain low-level detail features F; screening the low-level detail features F to obtain main significant features F'; inputting the main significant features F' into a convolution feature extraction module to obtain high-level semantic features; fusing the high-level semantic features and the low-level detail features to obtain fused features; calculating the loss between the predicted value and the actual value of the training sample to obtain a loss value; optimizing a training parameter based on the loss value; training the neural network model based on the optimized training parameters and the fused features to obtain a trained abnormal behavior recognition model; and identifying the abnormal behaviors of the human body based on the trained abnormal behavior identification model. The method can improve the identification precision of the abnormal behaviors of the human body.

Description

Human body abnormal behavior identification method and system based on mixed attention mechanism

Technical Field

The invention relates to the field of human body abnormal behavior recognition, in particular to a human body abnormal behavior recognition method and system based on a mixed attention mechanism.

Background

Human body abnormal behavior detection has recently been gaining wide attention in academic and industrial fields as one of the research hotspots in the field of human body behavior recognition. With the rapid development of social economy, whether explosion-proof measures of important places such as gas stations are in place or not directly threatens the safety of surrounding people and buildings. According to incomplete statistics, the proportion of smokers in China is as high as 26.92%, and the explosion accident rate caused by smoking is as high as 12.2%. As is well known, flammable and explosive oil gas is reserved in the air of a gas station, so that the possibility of explosion accidents caused by smoking at and near the gas station is higher; in addition, the illegal behaviors such as smoking, calling and the like of the driver in the driving process also have great potential safety hazards. Therefore, people hope to analyze human behaviors, pertinently and emphatically strengthen prevention, and can send out hidden danger early warning before the occurrence of potential safety hazards to prevent the hidden dangers.

At present, classification and identification methods for human body abnormal behaviors are divided into two types according to different feature extraction modes, wherein one type is a traditional method for extracting features by relying on manual design, and the other type is a method based on deep learning. The method for extracting the artificial design features mainly judges whether abnormal behaviors occur or not through a series of means such as target detection, feature extraction and the like according to specific abnormal behavior characteristics. The traditional abnormal behavior recognition algorithm has the advantages and the disadvantages. On one hand, the traditional abnormal behavior recognition algorithm does not need complex calculation amount and strong hardware device support. Therefore, for sample data with a small calculation amount, the detection of abnormal behaviors by using the traditional recognition algorithm is more advantageous. On the other hand, there are also disadvantages, such as manually extracting features only for specific scenes, which causes its limitation and unity, and poor generalization ability. Different from the traditional method, the deep learning-based method does not need manual extraction, and mainly trains and learns the model by artificially defining some abnormal behaviors or directly based on data on the basis of the special requirements of a scene on the basis of human behavior recognition and classification, and the extracted deep features can effectively express the human behaviors and enhance the adaptability of the model to input data.

With the development of deep learning, attention mechanism is gradually and widely applied to the fields of computer vision and the like. Jaderberg et al think that the direct pooling method is too violent, and the key information cannot be identified due to direct combination of the information, so that a space conversion module is provided to perform corresponding space transformation on the space domain information in the picture, and the key information can be extracted; hu et al consider the contribution weight of the feature map for each channel to be different, and therefore propose a compressed excitation network that adaptively recalibrates the feature response in terms of channels by explicitly modeling the interdependencies between channels; although the channel attention mechanism shows great potential in improving the performance of deep convolutional neural networks, the existing method inevitably increases the complexity of the model while obtaining better performance, and Wang et al [7] propose an effective channel attention module that maintains performance while significantly reducing the complexity of the model in order to overcome the contradiction between performance and complexity; fu et al propose a double-attention mechanism, which, unlike the previous one by multi-scale feature fusion, extracts significant features with relevance from spatial dimension and channel dimension, adaptively integrates local features and their global dependency.

Inspired by an attention mechanism, the method for identifying the abnormal behavior of the mixed attention mechanism is provided, and the characteristics that a convolution block attention module can effectively extract spatial information and channel information are utilized to highlight the significance characteristics of an identified object; meanwhile, the hidden high-level semantic information is mined layer by using an improved convolution feature extraction module and is combined with the low-level information, so that the classification performance of the network is further improved.

Disclosure of Invention

The invention aims to provide a human body abnormal behavior identification method and system based on a mixed attention mechanism, which can realize accurate identification of human body abnormal behaviors.

In order to achieve the purpose, the invention provides the following scheme:

a human body abnormal behavior identification method based on a mixed attention mechanism comprises the following steps:

extracting the features of the original image to obtain low-level detail features F;

screening the low-level detail features F to obtain main significant features F';

inputting the main significant features F' into a convolution feature extraction module to obtain high-level semantic features;

fusing the high-level semantic features and the low-level detail features to obtain fused features;

calculating the loss between the predicted value and the actual value of the training sample to obtain a loss value;

optimizing a training parameter based on the loss value;

training the neural network model based on the optimized training parameters and the fused features to obtain a trained abnormal behavior recognition model;

and identifying the abnormal behaviors of the human body based on the trained abnormal behavior identification model.

Optionally, screening the low-level detail feature F to obtain a main significant feature F ″ specifically includes:

inputting the low-level detail features F into a global average pooling layer and a maximum pooling layer of a space dimension, and sending the low-level detail features F into a shared network MLP to obtain first average pooling features and first maximum pooling features;

splicing the first average pooling characteristic and the first maximum pooling characteristic, and obtaining a weight coefficient Mc through a Sigmoid activation function;

multiplying the weight coefficient Mc with the low-level detail feature F to obtain a new feature F' after zooming;

inputting the new feature F' into an average pooling layer and a maximum pooling layer of the channel dimension to obtain a second average pooling feature and a second maximum pooling feature;

splicing the second average pooling characteristic and the second maximum pooling characteristic, and obtaining a weight coefficient Ms through a Sigmoid activation function;

and multiplying the weight coefficient Ms and the scaled new feature F 'to obtain a main significant feature F'.

Optionally, inputting the main significant feature F ″ to a convolution feature extraction module to obtain a high-level semantic feature and fusing the high-level semantic feature and the low-level detail feature, where the obtaining of the fused feature specifically includes:

carrying out point-by-point convolution and depth separable convolution operations on the main significant features F' to obtain features G;

compressing the feature G by adopting global average pooling to obtain a compressed vector L;

carrying out excitation operation on the compressed vector L to obtain an output S;

weighting the output S to a characteristic G to obtain a re-calibrated characteristic I;

performing maximum pooling operation and average pooling operation on the re-calibrated characteristic I to obtain a maximum pooling characteristic and an average pooling characteristic;

splicing the maximum pooling characteristics and the average pooling characteristics, and generating a characteristic mapping Q by utilizing convolution_s；

Mapping the feature to Q_sWeighting on the characteristic I, performing characteristic recalibration, ending by point-by-point convolution of 1x1, recovering original channel dimension, performing connection inactivation and inputtingAnd jumping connection, namely fusing the high-level semantic features and the low-level detail features extracted by the convolution feature extraction module in a multi-level manner to obtain fused features.

Optionally, the loss between the predicted value and the actual value of the training sample is calculated, and the obtained loss value specifically adopts the following formula:

and y ═ 1-epsilon x y + epsilon/k, wherein k represents the number of classes in a specific task, y represents a k-dimensional matrix composed of k classes, epsilon represents a smoothing factor, and y' represents a k-dimensional matrix composed of k classes after label smoothing.

The invention further provides a human body abnormal behavior recognition system based on a mixed attention mechanism, which comprises:

the low-level detail feature extraction module is used for extracting features of the original image to obtain low-level detail features F;

the main significant feature screening module is used for screening the low-level detail features F to obtain main significant features F';

the high-level semantic feature extraction module is used for inputting the main significant features F' into the convolution feature extraction module to obtain high-level semantic features;

the feature fusion module is used for fusing the high-level semantic features and the low-level detail features to obtain fused features;

the loss value calculation module is used for calculating the loss between the predicted value and the actual value of the training sample to obtain a loss value;

the optimization module is used for optimizing the training parameters based on the loss values;

the training module is used for training the neural network model based on the optimized training parameters and the fused features to obtain a trained abnormal behavior recognition model;

and the abnormal behavior recognition module is used for recognizing the abnormal behavior of the human body based on the trained abnormal behavior recognition model.

Optionally, the module for screening for main significant features specifically includes:

the first average pooling feature and first maximum pooling feature extracting unit is used for inputting the low-level detail features F into a global average pooling layer and a maximum pooling layer of a space dimension and sending the low-level detail features F into a shared network MLP to obtain first average pooling features and first maximum pooling features;

the weight coefficient Mc calculating unit is used for splicing the first average pooling characteristic and the first maximum pooling characteristic and obtaining a weight coefficient Mc through a Sigmoid activation function;

the characteristic F 'determining unit is used for multiplying the weight coefficient Mc and the low-level detail characteristic F to obtain a new scaled characteristic F';

a second average pooling feature and second maximum pooling feature extracting unit, configured to input the new feature F' to an average pooling layer and a maximum pooling layer of a channel dimension to obtain a second average pooling feature and a second maximum pooling feature;

the weight coefficient Ms calculation unit is used for splicing the second average pooling characteristic and the second maximum pooling characteristic and obtaining a weight coefficient Ms through a Sigmoid activation function;

and the main significant feature F ' determination unit is used for multiplying the weight coefficient Ms and the scaled new feature F ' to obtain a main significant feature F '.

Optionally, the high-level semantic feature extraction module and the feature fusion module specifically include:

a point-by-point convolution and depth separable convolution operation unit for performing point-by-point convolution and depth separable convolution operations on the main significant feature F' to obtain a feature G;

the compression unit is used for performing compression operation on the characteristic G by adopting global average pooling to obtain a compressed vector L;

the excitation operation unit is used for carrying out excitation operation on the compressed vector L to obtain an output S;

the recalibration unit is used for weighting the output S to the characteristic G to obtain a recalibrated characteristic I;

a maximum pooling operation and average pooling operation unit for performing maximum pooling operation and average pooling operation on the re-calibrated feature I to obtain a maximum pooling feature and an average pooling feature;

a splicing unit for splicing the maximum pooling characteristic and the average pooling characteristic and generating a characteristic mapping Q by convolution_s；

A feature fusion unit for mapping the features Q_sWeighting the feature I, performing feature recalibration, ending by point-by-point convolution of 1x1, recovering the original channel dimension, performing connection inactivation and input jump connection, and fusing the high-level semantic features and the low-level detail features extracted by the convolution feature extraction module in a multi-level manner to obtain fused features.

Optionally, the loss value calculation module specifically adopts the following formula:

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the method comprises the steps of extracting features through an original image to obtain low-level detail features F; screening the low-level detail features F to obtain main significant features F'; inputting the main significant features F' into a convolution feature extraction module to obtain high-level semantic features; fusing the high-level semantic features and the low-level detail features to obtain fused features; calculating the loss between the predicted value and the actual value of the training sample to obtain a loss value; optimizing a training parameter based on the loss value; training the neural network model based on the optimized training parameters and the fused features to obtain a trained abnormal behavior recognition model; the abnormal behaviors of the human body are identified based on the trained abnormal behavior identification model, so that the identification precision and effect of the abnormal behaviors of the human body are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic diagram of an abnormal behavior recognition framework of a hybrid attentive mechanism according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for recognizing abnormal human behavior based on a hybrid attention mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a convolution block attention module according to an embodiment of the present invention;

FIG. 4 is a diagram of a convolution feature extraction module according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a human body abnormal behavior recognition system based on a hybrid attention mechanism according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic diagram of an abnormal behavior recognition framework of a hybrid attention mechanism according to an embodiment of the present invention, and fig. 2 is a flowchart of a human body abnormal behavior recognition method based on the hybrid attention mechanism according to an embodiment of the present invention, as shown in fig. 1 and fig. 2, the method includes:

step 101: and (4) performing feature extraction on the original image to obtain low-level detail features F.

Specifically, the original picture is processed by a Stem module to obtain a feature F.

Step 102: and screening the low-level detail feature F to obtain a main significant feature F'.

In order to enhance the significance characteristics and reduce the attention degree of other information, a convolution attention module is introduced, the low-level detail characteristics F extracted in the step 1 are scaled to obtain new main significance characteristics F ", the structure of the convolution attention module is shown in FIG. 3, and the specific processing flow is as follows:

to effectively focus on meaningful channel features, channel attention is calculated. Firstly, respectively carrying out global average pooling and maximum pooling on the features F through a space dimension, then respectively sending the features F into a shared network MLP, splicing the two obtained features, and then obtaining a weight coefficient M through a Sigmoid activation function_cAnd finally the weighting factor M_cMultiplying the original input feature F to obtain a new feature F' after scaling, which is defined as:

wherein the content of the first and second substances,

and

mean pooling characteristic and maximum pooling characteristic are represented, respectively, and σ represents Sigmoid activation function.

To effectively focus on meaningful spatial features, spatial attention is computed. Firstly, respectively carrying out average pooling and maximum pooling on the features F' in one channel dimension, and splicing the two features together; then, a convolution layer with an activation function of Sigmoid is used to obtain a weight coefficient M_sAnd finally the weighting factor M_sMultiplying the input feature F 'to obtain a scaled new feature F', which is defined as:

wherein, sigma represents Sigmoid activation function, f represents convolution operation,

and

representing the average pooling characteristic and the maximum pooling characteristic, respectively

Representing the point-by-point multiplication of the matrix and F "the final output.

Step 103: and inputting the main significant features F' into a convolution feature extraction module to obtain high-level semantic features.

In order to further mine high-level semantic information and improve the feature extraction capability of a network model, a convolution feature extraction module is provided, the significant feature F' obtained in the step 102 is input to obtain high-level semantic features and is fused with low-level detail features to enhance the interactivity of the network model, the structure of the convolution feature extraction module is shown in FIG. 4, and the specific processing flow is as follows:

firstly, performing 1 × 1 point-by-point convolution on input F ″, changing the dimensionality of an output channel according to an expansion ratio and adopting a deep separable convolution operation, and effectively reducing the parameter number while ensuring the independence among the channels, wherein the definition is as follows:

G＝f₂(f₁(F″)) (5)

wherein, F' represents the input of the input,g denotes the input feature map, f₁(. represents a point-by-point convolution, f₂(. cndot.) represents a depth separable convolution.

Then, in order to obtain the global distribution of response on the feature channel, the global mean pooling is used as a compression operation, the feature G is changed into a feature of 1 × 1 × C by a convolution operation, and the obtained vector has a global receptive field to some extent, and the formula is as follows:

wherein, U_sqIndicating a compression operation, L indicates a compressed vector, and H × W indicates its size.

Then, a full-link layer is adopted to form a Bottleneeck structure to learn the correlation among channels, a parameter W is introduced to generate a weight for each feature channel, wherein the parameter W is learnable and is convolved by the convolution of an activation proportion multiplied by a global feature dimension by a number of 1x1, and the formula is as follows:

S＝U_ex(L,W) (7)

wherein, U_exRepresenting the excitation operation, S being the output of the operation, and possibly characterizing the importance of different features, W adjusting the excitation operation based on a scale parameter.

The output weight of the excitation operation is regarded as the importance of each selected characteristic channel, and the selected characteristic channel is weighted to the previous characteristic channel by channel through multiplication, so that the recalibration of the original characteristic in the channel dimension is completed, and the formula is as follows:

I＝U_scale(G,S)＝G·S (8)

wherein, denotes a matrix multiplication operation, U_scaleIndicating an assign weight operation.

In order to obtain deeper high-level feature information, maximum pooling operation and average pooling operation are respectively carried out on the re-calibrated features, the unique features of the object are effectively extracted, the extracted pooling features are spliced, and a new feature is generated by convolutionSign mapping Q_sIt is defined as follows:

Q_s(I)＝σ(h([I_avg；I_max])) (9)

wherein, I_avgAnd I_maxMean pooling and maximum pooling features are represented, respectively, σ represents Sigmoid activation function, and h (·) represents convolution operation.

Finally, the features are mapped to Q_sWeighting the previous feature I, ending by point-by-point convolution of 1x1 after completing the re-calibration of the feature, recovering the original channel dimension, performing connection inactivation and input jump connection, fusing the high-level semantic features and the low-level detail features extracted by a convolution feature extraction module in a multi-level manner, and enhancing the interactivity, wherein the definition is as follows:

Z＝D(G′(U′_scale(I·Q_s))) (10)

wherein D (-) denotes a hopping connection,. denotes a matrix multiplication operation, G 'denotes a convolution operation, U'_scaleIndicating the assignment of weights, Z indicates the output characteristics.

Step 104: and fusing the high-level semantic features and the low-level detail features to obtain fused features.

Step 105: and calculating the loss between the predicted value and the actual value of the training sample to obtain a loss value.

The cross entropy loss function is corrected by adopting label smoothing, the curve is smooth, derivation is easy to conduct, the gradient is stable, the network has better generalization, finally, more accurate prediction is generated on invisible data, and the accuracy of image classification is improved.

y′＝(1-ε)×y+ε/k (11)

Wherein k represents the number of classes in a specific task, y represents a k-dimensional matrix composed of k classes, epsilon represents a smoothing factor, and y' represents a k-dimensional matrix composed of k classes after label smoothing.

Step 106: and optimizing the training parameters based on the loss values.

Step 107: and training the neural network model based on the optimized training parameters and the fused features to obtain a trained abnormal behavior recognition model.

Step 108: and identifying the abnormal behaviors of the human body based on the trained abnormal behavior identification model.

And training the model based on the characteristic information, the model parameters and all the training samples to obtain a trained abnormal behavior recognition model, and recognizing and classifying the abnormal behaviors in all the test samples based on the obtained abnormal behavior model.

All samples are calculated through the softmax function to obtain a corresponding probability, the probability sum is 1, and the corresponding abnormal behavior category with the maximum probability is judged.

P_iRepresenting the probability that the predicted object belongs to the i-th class behavior, exp (-) representing the mapping of the real output to zero to positive infinity, Σ_iMeaning that all probabilities are summed.

Fig. 5 is a schematic structural diagram of a human body abnormal behavior recognition system based on a hybrid attention mechanism according to an embodiment of the present invention, and as shown in fig. 5, the system includes:

the low-level detail feature extraction module 201 is configured to perform feature extraction on an original image to obtain a low-level detail feature F;

a main significant feature screening module 202, configured to screen the low-level detail feature F to obtain a main significant feature F ″;

the high-level semantic feature extraction module 203 is used for inputting the main significant features F' into the convolution feature extraction module to obtain high-level semantic features;

a feature fusion module 204, configured to fuse the high-level semantic features and the low-level detail features to obtain fused features;

a loss value calculation module 205, configured to calculate a loss between a predicted value and an actual value of the training sample to obtain a loss value;

an optimization module 206, configured to optimize a training parameter based on the loss value;

the training module 207 is configured to train the neural network model based on the optimized training parameters and the fused features to obtain a trained abnormal behavior recognition model;

and the abnormal behavior recognition module 208 is configured to recognize the abnormal behavior of the human body based on the trained abnormal behavior recognition model.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A human body abnormal behavior identification method based on a mixed attention mechanism is characterized by comprising the following steps:

optimizing a training parameter based on the loss value;

2. The method for recognizing the abnormal human behavior based on the mixed attention mechanism according to claim 1, wherein the step of screening the low-level detail features F to obtain the main significant features F ″ specifically comprises the steps of:

3. The method for recognizing the abnormal human behavior based on the mixed attention mechanism as claimed in claim 1, wherein the step of inputting the main significant features F "to a convolution feature extraction module to obtain high-level semantic features and the step of fusing the high-level semantic features and the low-level detail features to obtain fused features specifically comprises the steps of:

Mapping the feature to Q_sWeighting the feature I, performing feature recalibration, ending by point-by-point convolution of 1x1, recovering the original channel dimension, performing connection inactivation and input jump connection, and fusing the high-level semantic features and the low-level detail features extracted by the convolution feature extraction module in a multi-level manner to obtain fused features.

4. The method for recognizing the abnormal human behavior based on the mixed attention mechanism as claimed in claim 1, wherein the loss between the predicted value and the actual value of the training sample is calculated, and the following formula is specifically adopted to obtain the loss value:

5. A human body abnormal behavior recognition system based on a mixed attention mechanism is characterized in that the recognition system comprises:

6. The system for recognizing the abnormal human behavior based on the mixed attention mechanism as claimed in claim 5, wherein the main significant feature screening module specifically comprises:

7. The system for recognizing the abnormal human behavior based on the mixed attention mechanism as claimed in claim 5, wherein the high-level semantic feature extraction module and the feature fusion module specifically comprise:

8. The system for recognizing abnormal human behavior based on the mixed attention mechanism as claimed in claim 5, wherein the loss value calculating module specifically adopts the following formula: