CN117649701A

CN117649701A - Human behavior recognition method and system based on multi-scale attention mechanism

Info

Publication number: CN117649701A
Application number: CN202410116542.9A
Authority: CN
Inventors: 魏鹏鹏
Original assignee: Jiangxi University of Technology
Current assignee: Jiangxi University of Technology
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-03-05

Abstract

The application provides a human behavior recognition method and system based on a multi-scale attention mechanism, wherein the method comprises the following steps: acquiring human body joint motion information in human body behavior data and inputting the human body joint motion information into a three-dimensional matrix to obtain a three-dimensional feature map; performing data enhancement on the three-dimensional feature map to generate an action sample data set; constructing a multi-scale convolutional neural network comprising a spatial pyramid pooling layer and a global average pooling layer, and inputting an action sample data set into the multi-scale convolutional neural network to extract local action features and global action features; constructing a soft attention mechanism to obtain human behavior characteristics containing attention weights; human behavior features containing attention weights are input into a classifier to identify human behavior. The method and the device can effectively solve the problems that the original skeleton coordinate data is not strong in characterizability, the influence of the position of the sampling camera when the data are acquired, and the human body behavior recognition accuracy is reduced due to the fact that the motion samples are different in class and different between classes.

Description

Human behavior recognition method and system based on multi-scale attention mechanism

Technical Field

The application relates to the technical field of human behavior recognition, in particular to a human behavior recognition method and system based on a multi-scale attention mechanism.

Background

With the rapid development of internet technology, artificial intelligence theory is continuously and deeply researched, and computer vision technology is also rapidly developed. In the fields of image classification, target detection, human behavior recognition, and the like, computer vision technology is playing a vital role. Along with the continuous progress of artificial intelligence and behavior recognition technology, the application of human behavior recognition in the fields of sports fitness, intelligent medical treatment, intelligent home and the like is gradually revealed, and the method has important academic value and social significance for the research of the behavior recognition technology.

Understanding and describing human-generated behavior is a widely focused research hotspot in the fields of pattern recognition and machine learning, and computer vision. The human body behavior recognition technology is mainly used for processing and analyzing human body actions in a video frame sequence, and the individual and interactive behaviors of people in the frame sequence are understood and recognized through moving target detection and classification, and are expressed in a natural language form.

However, the current human behavior recognition research is limited by the weak characterizability of the original skeleton coordinate data and the adverse effect brought by the position of a sampling camera when the data are acquired, so that an ideal experimental result is difficult to obtain; meanwhile, the duration of each action sample is not fixed, the action samples have intra-class differences and inter-class differences, and the algorithm has the defects of complex structure, low efficiency and the like. Therefore, the original skeleton coordinate data is not strong in characterization, the influence of the position of the sampling camera when the data are acquired, and the human behavior recognition accuracy is reduced due to intra-class differences and inter-class differences of the action samples.

Disclosure of Invention

Based on the above, the application provides a human behavior recognition method and system based on a multi-scale attention mechanism, which aim to solve the problems that the original skeleton coordinate data is not strong in characterizability, the influence of the position of a sampling camera when the data are acquired, and the human behavior recognition accuracy is reduced due to intra-class differences and inter-class differences of action samples.

A first aspect of the embodiments provides a human behavior recognition method based on a multi-scale attention mechanism, including:

acquiring human behavior data, and marking human joint motion information in the human behavior data, wherein the human joint motion information comprises instantaneous motion displacement, instantaneous motion direction and relative motion displacement, and inputting the instantaneous motion displacement, the instantaneous motion direction and the relative motion displacement into a three-dimensional matrix to obtain a three-dimensional feature map;

performing data enhancement on the three-dimensional feature map to generate an action sample data set;

constructing a multi-scale convolutional neural network, wherein the multi-scale convolutional neural network comprises a spatial pyramid pooling layer and a global average pooling layer, and the action sample data set is input into the spatial pyramid pooling layer and the global average pooling layer so as to extract local action features and global action features respectively;

Constructing a soft attention mechanism, wherein the soft attention mechanism comprises soft pooled channel attention and soft pooled space attention, and inputting the local action feature and the global action feature into the soft pooled channel attention and soft pooled space attention to obtain human behavior features containing attention weights;

and inputting the human behavior characteristics containing the attention weight into a classifier to obtain the category of the human behavior.

Compared with the prior art, the human behavior recognition method based on the multi-scale attention mechanism provided by the application has the advantages that firstly, according to the definition of classical physics on motion and the three-dimensional coordinates of each joint point of a human skeleton, loss of data precision is avoided, original motion data information is reserved to the greatest extent, and a new motion sample is generated by carrying out scaling and mirror image operation on the human skeleton in an equal proportion. After the work is finished, the convolutional neural network model is modified from the angles of a multi-scale learning strategy and an attention mechanism. Because the spatial pyramid pooling layer and the global average pooling layer can both ignore the size of input data, and convert the feature map into feature vectors with the same size, the application firstly proposes a transformation strategy of the multi-scale convolutional neural network model: integrating the spatial pyramid pooling layer and the global average pooling layer, and connecting the features output by the spatial pyramid pooling layer and the global average pooling layer in series to obtain a multi-scale convolutional neural network, so that the global features are reserved, and the capability of distinguishing similar behaviors of a model is improved; and secondly, the model is further modified and upgraded by adding the modified mixed attention module, so that the model is ensured to pay more attention to the characteristics effective for behavior classification, and irrelevant characteristics are ignored. Therefore, the problems that the original skeleton coordinate data is not strong in representation, the influence of the position of the sampling camera when the data are acquired and the human behavior recognition accuracy is reduced due to the fact that the motion sample is different in class and different between classes can be effectively solved, and the problems that the original skeleton coordinate data is not strong in representation, the influence of the position of the sampling camera when the data are acquired and the human behavior recognition accuracy is reduced due to the fact that the motion sample is different in class and different between classes can be effectively solved.

As an optional implementation manner of the first aspect, the step of labeling human body joint movement information in the human body behavior data includes:

the spine center joint point among the human skeleton joint points is used for replacing a depth camera as a new coordinate origin, the relative coordinates of the rest joint points are obtained by making a difference with the coordinates of the spine center joint point, and a specific relative coordinate calculation formula is as follows:

，

wherein,representing the three-dimensional coordinates of the central articulation point of the spine, +.>Representing coordinates of a certain node outside the central node of the spine, < ->Then the relative coordinates of the central node of the spine and r is used for the relative coordinate representation.

As an optional implementation of the first aspect, the step of the human body articulation information includes instantaneous movement displacement, instantaneous movement direction and relative movement displacement includes:

the position change of the joint point in the front frame and the rear frame is calculated to represent the instantaneous motion displacement, and a specific instantaneous motion displacement calculation formula is as follows:

，

wherein f _t - f _t-1 Representing the displacement of motion between two adjacent frames, d _t Representing the instantaneous motion displacement of the T frame, N represents the point belonging to the N joint in the human skeleton, and T represents that the motion sample comprises a T frame;

the motion direction of the joint point in the front frame and the rear frame is calculated to represent the instantaneous motion direction, and a specific instantaneous motion direction calculation formula is as follows:

，

Wherein { is as follows，/>,/>The displacement in the x, y and z axes of the current frame compared to the previous frame, { P }, is represented _XY , P _XZ , P _YZ -representing the projection of the instantaneous direction of motion in the xy, xz and yz planes;

the total displacement of the joint point in unit time is calculated to represent the relative motion displacement, and a specific relative motion displacement calculation formula is as follows:

，

wherein f _t - f ₁ Representing the total displacement of motion between the current frame and the first frame, D _t Representing the relative motion displacement of the t-th frame.

As an optional implementation manner of the first aspect, the step of performing data enhancement on the three-dimensional feature map to generate an action sample data set includes:

reading preset joint points of each frame in the three-dimensional feature map, scaling the joint points in equal proportion according to preset proportion, and storing the joint points in an action sample data set;

the spine part in the human skeleton is used as a mirror image operation plane to perform symmetrical operation so as to obtain new data which are symmetrical left and right and store the new data in an action sample data set, and a specific symmetrical operation formula is expressed as follows:

，

wherein,representing any joint coordinates outside the spine, < ->Then the mirror image coordinates of the spinal node are represented and p is represented as a symmetrical operation.

As an optional implementation manner of the first aspect, the constructing a multi-scale convolutional neural network, where the multi-scale convolutional neural network includes a spatial pyramid pooling layer and a global average pooling layer, and the step of inputting the motion sample dataset into the spatial pyramid pooling layer and the global average pooling layer to extract local motion features and global motion features respectively includes:

And connecting the outputs of the spatial pyramid pooling layer and the global average pooling layer in series to enable the extracted local action features and the global action features to be subjected to feature fusion.

As an optional implementation manner of the first aspect, the calculation formula of the soft pooling channel attention is:

，

wherein F represents a three-dimensional feature map, M _C (F) The method is characterized in that the method comprises the steps of representing weight coefficients of attention output of a three-dimensional feature map through a soft pooling channel, sigma represents an activation function, MLP represents a multi-layer perceptron, avgPool represents average pooling, maxPool represents maximum pooling, and SoftPool represents soft pooling.

As an optional implementation manner of the first aspect, the calculation formula of the soft pooled spatial attention is:

，

wherein F represents a three-dimensional feature map, M _s (F) The method is characterized in that the method comprises the steps of representing weight coefficients of a three-dimensional feature map which is subjected to soft pooling spatial attention output, sigma represents an activation function, f7x7 represents convolution with a convolution kernel of 7 x 7, avgPool represents average pooling, maxPool represents maximum pooling, and SoftPool represents soft pooling.

A second aspect of embodiments of the present application provides a human behavior recognition system based on a multi-scale attention mechanism, including:

the acquisition data module is used for acquiring human behavior data and labeling human joint motion information in the human behavior data, wherein the human joint motion information comprises instantaneous motion displacement, instantaneous motion direction and relative motion displacement, and the instantaneous motion displacement, the instantaneous motion direction and the relative motion displacement are input into a three-dimensional matrix to obtain a three-dimensional feature map;

The data enhancement module is used for carrying out data enhancement on the three-dimensional feature map so as to generate an action sample data set;

the multi-scale feature extraction module is used for constructing a multi-scale convolutional neural network, the multi-scale convolutional neural network comprises a spatial pyramid pooling layer and a global average pooling layer, and the action sample data set is input into the spatial pyramid pooling layer and the global average pooling layer so as to extract local action features and global action features respectively;

the soft attention distribution weight module is used for constructing a soft attention mechanism, wherein the soft attention mechanism comprises soft pooled channel attention and soft pooled space attention, and the local action feature and the global action feature are input into the soft pooled channel attention and soft pooled space attention so as to obtain human behavior features containing attention weights;

and the behavior recognition and classification module is used for inputting the human behavior characteristics containing the attention weight into a classifier so as to obtain the category of the human behavior.

A third aspect of the embodiments of the present application provides a storage medium storing one or more programs that when executed by a processor implement the human behavior recognition method described above.

A fourth aspect of the embodiments of the present application provides a computer device, where the memory is configured to store a computer program, and the processor is configured to implement the method for identifying human behavior when executing the computer program stored on the memory.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

FIG. 1 is a flowchart of a human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application;

FIG. 2 is a comparison of 3 frames of selected calls in one embodiment of the present application, (a) representing the use of the original coordinate data; (b) represents based on relative coordinate data;

FIG. 3 is a schematic diagram of a GM layer according to an embodiment of the present application;

FIG. 4 is a flowchart of another human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application;

FIG. 5 is a process diagram showing a quantitative representation of behavior based on skeleton nodes in an embodiment of the present application;

FIG. 6 is a flowchart of another human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application;

FIG. 7 is a flowchart of another human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application;

FIG. 8 is a schematic diagram of still another spatial attention after soft pooling is added in an embodiment of the present application;

FIG. 9 is a schematic diagram of still another spatial attention after soft pooling is added in an embodiment of the present application;

fig. 10 is a schematic structural diagram of a human behavior recognition system based on a multi-scale attention mechanism according to a third embodiment of the present application;

fig. 11 is an experimental flowchart of a human behavior recognition method and system based on a multi-scale attention mechanism according to the present application.

The following detailed description will further illustrate the application in conjunction with the above-described figures.

Detailed Description

In order to facilitate an understanding of the present application, a more complete description of the present application will now be provided with reference to the relevant figures. Several embodiments of the present application are presented in the accompanying drawings. This application may, however, be embodied in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In order to illustrate the technical solutions described in the present application, the following description is made by specific examples.

Referring to fig. 1, a flowchart of a human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application is shown, and the detailed description is as follows:

step S1, acquiring human body behavior data, marking human body joint motion information in the human body behavior data, wherein the human body joint motion information comprises instantaneous motion displacement, instantaneous motion direction and relative motion displacement, and inputting the instantaneous motion displacement, the instantaneous motion direction and the relative motion displacement into a three-dimensional matrix to obtain a three-dimensional feature map.

Currently, all three-dimensional coordinate data sets in the behavior recognition field default to the position of a depth camera as the origin of the three-dimensional coordinates. However, there is a possibility that a slight shift in the position of the camera occurs during the process of collecting the sample data, thereby causing the sample data to be contaminated.

Therefore, to avoid the adverse effect of the factor of the deep camera shake on the whole experiment, as an example, the spine center joint among the human skeleton joints is used as a new origin of coordinates instead of the deep camera, and the relative coordinates of the rest joints are obtained by making a difference with the coordinates of the spine center joint, and a specific relative coordinate calculation formula is as follows:

，

Wherein,representing the three-dimensional coordinates of the central articulation point of the spine, +.>Representing coordinates of a certain node outside the central node of the spine, < ->Then it is the spineThe relative coordinates of the center node, r, are used for relative coordinate representation.

The relative coordinate data obtained through the method has stronger generalization capability compared with the original coordinate data.

Illustratively, selecting an action sample of making a call from a public data set (such as a Florence-3d data set), and selecting 3 frames from the action sample to draw a contrast image, as shown in FIG. 2, (a) represents that the action sample is based on original coordinate data; (b) representing based on relative coordinate data.

The action is understood as the spatial variation of the main joint point of the human body and the skeleton connected with the joint point in unit time. Thus, the characteristic representation of the motion can be achieved by tracking the spatial variation of the human body node from frame to frame. Through preliminary verification, various motion information (instantaneous motion displacement, instantaneous motion direction and relative motion displacement) of the joints can be accurately extracted through frame-by-frame tracking and compactly stored in a three-dimensional matrix.

In the human behavior recognition field, spine (Spine) joint points in skeleton nodes are used as new coordinate origins instead of cameras (default coordinate origins), and relative coordinates of the rest joint points are obtained by making differences with the Spine node coordinates, and the method has the advantages that:

Stability enhancement: depth camera shake is a common problem, especially in dynamic environments. The use of a Spine joint as a new origin of coordinates may reduce the impact of such jitter on the coordinate system, as the Spine joint is generally relatively stable and not as susceptible to external disturbances as a camera.

The identification accuracy is improved: by calculating the coordinate difference relative to the Spine node, the motion of each joint of the human body can be more accurately captured and described. The method is helpful for reducing errors caused by movement of the camera, so that accuracy of behavior recognition is improved.

Adaptability enhancement: the use of a coordinate system relative to the Spine node means that the recognition system can more easily adapt to different camera settings and environmental conditions. Regardless of how the camera moves or shakes, the system can continue to effectively perform behavior recognition as long as the Spine node can be accurately captured.

Simplifying data processing: the Spine node is used as a new origin of coordinates, so that the data processing process can be simplified. The use of a Spine node as a reference point reduces the amount of data that needs to be processed and reduces the complexity of the algorithm relative to a fixed but possibly dithered camera coordinate system.

Enhancing real-time performance: due to the fact that data processing amount and algorithm complexity are reduced, the speed of behavior recognition can be increased by using the Spine node as the origin of coordinates, and real-time performance of the system is improved. This is particularly important for application scenarios (such as security monitoring, human-computer interaction, etc.) where a fast response is required.

In summary, the Spine joint point in the skeleton node is used as a new origin of coordinates, which has many advantages in the human behavior recognition field, can effectively cope with challenges such as camera shake, and improves the stability, accuracy and real-time performance of the recognition system.

Step S2: and carrying out data enhancement on the three-dimensional characteristic map to generate an action sample data set.

It should be noted that for deep learning, a small number of samples is not enough for the neural network model to be used for training, which in turn results in the occurrence of overfitting. Especially when the neural network model is complex (the layer number is too deep or the width is wider), the overfitting phenomenon is more serious. Aiming at the above-mentioned overfitting problem, the application provides two data enhancement algorithms aiming at human behaviors, which are used for generating new sample data meeting training requirements.

The advantages of this approach include mainly the following:

Angle diversity: data enhancement is originally carried out only through zooming in and out, and human behaviors at different distances and scales can be simulated, but angle change is lacked. The novel strategy increases the angle diversity of the sample by rotating the framework, so that the model can learn the behavior characteristics of the human body in different directions.

More comprehensive feature learning: the rotating skeleton may help the model capture some behavioral features that are difficult to observe at a fixed viewing angle. For example, the recognition of certain actions may require viewing the sides or back of the human body, while these perspectives may be less likely to occur in an unrotated dataset.

The method is suitable for actual application scenes: in practical applications, human behavior recognition systems may face input data from various angles. By introducing the rotated samples in the training stage, the model can be better adapted to the actual application scenes, and the recognition accuracy is improved.

The shortfalls of this approach include the following:

fixed viewing angle bias: if the training data comes primarily from a single or limited view, the model may prejudice these views, resulting in reduced recognition performance at other views. By rotating the skeleton to generate new samples, this fixed viewing angle bias can be eliminated to some extent.

Lack of directional sensitivity: the non-rotated data enhancement may result in the model being direction insensitive, i.e., unable to distinguish between forward and reverse motion (e.g., waving and stroking). Rotating the sample can help the model learn the directional sensitivity, better distinguishing such actions.

Limited data utilization efficiency: under the original strategy of only using zoom-in and zoom-out, the data utilization efficiency may not be high, because the model may not fully utilize the skeleton structure information. After introducing the rotation strategy, useful information in these data can be mined and exploited through a wider variety of perspectives.

In general, the newly added rotation strategy can improve the angle diversity and feature learning capability of the model, and meanwhile, the problems of prejudice of a fixed visual angle and insufficient direction sensitivity are solved, so that the overall performance of the human behavior recognition system is improved.

Step S3: the method comprises the steps of constructing a multi-scale convolutional neural network, wherein the multi-scale convolutional neural network comprises a spatial pyramid pooling layer and a global average pooling layer, and inputting the action sample data set into the spatial pyramid pooling layer and the global average pooling layer to extract local action features and global action features respectively.

It should be noted that, for any convolutional neural network, the feature map of the last convolutional layer output must have a fixed dimension, otherwise, the feature map cannot be used as the input of the fully connected layer at the same time, which requires that the input image of the entire convolutional neural network is the same size. However, in the practical research process, it is not guaranteed that the size of the image input by us always meets the size required in input, which makes it necessary to preprocess the image and modify all the images to the same size. In the field of object detection, the most used processing method is to scale or crop the input image. However, the relative positions of elements in the picture are easy to change due to stretching, the original aspect ratio and the original size of the image are changed, original information is directly discarded after clipping, and the picture is irreversibly affected.

Therefore, in order to break the limitation of the image size on the convolutional neural network, a new multi-scale transformation strategy is proposed in the application, namely a GM Layer (Global muli-scale Layer), namely a spatial pyramid pooling Layer (Spatial Pyramid Pooling Layer, SPP) and a Global average pooling Layer (Global Average Pooling Layer, GAP).

Specifically, as shown in fig. 3, the GM layer is a schematic structural diagram, where the sizes of three pooling windows of the SPP layer are set to be [1, 2, 4], so that 21 feature values can be mapped to each feature map, and the outputs of the spatial pyramid pooling layer and the global average pooling layer are connected in series, so that the extracted local motion features and the global motion features are fused.

Therefore, the GM layer not only contains the characteristics of the GAP layer and the SPP layer in feature mapping, but also has no limitation on the size of input data, so that the network model has the multi-scale learning capability, and higher recognition accuracy can be obtained while the model training efficiency is improved.

Step S4: and constructing a soft attention mechanism, wherein the soft attention mechanism comprises soft pooled channel attention and soft pooled space attention, and inputting the local action characteristic and the global action characteristic into the soft pooled channel attention and soft pooled space attention to obtain human behavior characteristics containing attention weights.

Note that, the attention mechanism may dynamically allocate resources according to the importance degree of the task. Under the attention mechanism, the computer can process important tasks preferentially and ignore unimportant tasks, so that the computing efficiency and accuracy are improved. In the field of computer vision, existing scientific researchers combine the attention mechanism with a network model, and the effect of the attention mechanism is achieved mostly by improving the weight occupied by key area information, so that the model can pay more attention to the key information like a person.

In view of the good performance of embedding the spatial attention module and the channel attention module in the model, respectively, it may be attempted to combine the spatial attention module and the channel attention module together in a parallel or sequential manner, which is a typical CBAM.

The CBAM is used as an attention module for a feedforward convolutional neural network, and in a mode of combining a channel attention module with a preceding space attention module, a characteristic diagram is processed through the space attention module and the channel attention module, and cross-channel information and space attention information are mixed together to extract information characteristics.

Although global average pooling and global maximum pooling adopted by the CBAM attention mechanism can retain global information and significant information to some extent, there is still a shortage in retaining information in that the difference of different region information is not considered when feature map information is input to the next layer. Thus, the present application adds soft pooling (SoftPool), known as S-CBAM, on the basis of the original CBAM.

Soft pooling is a pooling operation technique used in deep learning. In conventional pooling operations, such as maximum pooling or average pooling, the values within each pooled region are processed separately, with only the maximum value or calculated average value being selected as the output, while soft pooling introduces more flexibility, and soft pooling allows the values within each pooled region to be aggregated with different weights.

There are a number of implementations of soft pooling, one common approach being to calculate the weights within each pooled region by using a learnable parameter. These weights may be dynamically adjusted according to the characteristics of the input data, thereby making the soft pooling operation adaptive.

In summary, the advantages of the improved mixed-attention mechanism (S-CBAM) include the following:

improving the feature extraction capability: the mixed attention mechanism combines the attention mechanisms of space and channels to be able to focus more fully on important information in the input data. In human behavior recognition, the method can help the model to better extract and recognize key action features, so that the recognition accuracy is improved.

Robustness of the enhancement model: the CBAM can adaptively adjust the weights of different channels and spaces, so that the model has stronger adaptability to the change of input data. In human behavior recognition, this means that the model can better handle different scene, visual angle, illumination condition and other changing factors, thereby improving the robustness of the model.

Step S5: and inputting the human behavior characteristics containing the attention weight into a classifier to obtain the category of the human behavior.

The loss function for evaluating the performance of the classifier is a guiding function for guiding the weight adjustment of the classifier, and the loss function can be used for knowing how to improve the weight coefficient. The classifier of the softmax function is used for judging the classification result according to the output probability value distribution.

Illustratively, assume that the probability value distribution is:

,

wherein, category 1 indicates running, category 2 indicates making a badminton, category 3 indicates making a call, the true value indicates belonging to or not belonging to the category, and 1 in the matrix indicates the category, so that the output result is category 3, i.e., making a call.

In summary, according to the human behavior recognition method based on the multi-scale attention mechanism, a new motion information quantization representation form 3D-JMM is provided according to the definition of classical physics on motion and the three-dimensional coordinates of each joint point of a human skeleton, loss of data precision is avoided, original motion data information is reserved to the greatest extent, and a new motion sample is generated by carrying out equal-proportion scaling and mirror image operation on the human skeleton. After the work is finished, the convolutional neural network model is modified from the angles of a multi-scale learning strategy and an attention mechanism. Because the GAP layer and the SPP layer can ignore the size of input data and convert the feature map into feature vectors with the same size, the application firstly proposes a transformation strategy of the multi-scale convolutional neural network model: and integrating the GAP layer and the SPP layer, and connecting the characteristics output by the GAP layer and the SPP layer in series to obtain the GM layer. Not only is global features preserved, but also the ability of the model to distinguish similar behaviors is improved. And secondly, the model is further modified and upgraded by adding the modified mixed attention module, so that the model is ensured to pay more attention to the characteristics effective for behavior classification, and irrelevant characteristics are ignored.

Referring to fig. 4, a flowchart of a human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application is shown, and step S1 may include steps S11 to S14, which specifically include the following steps:

step S11: the instantaneous motion displacement is calculated.

As a first dimension of the three-dimensional matrix, the instantaneous motion displacement represents the motion distance of a joint mapped on the x-axis, y-axis and z-axis respectively in two adjacent frame times. From a physical point of view, the position change of the node in the front and back frames is calculated, and time sequence information is included.

For example, assume that a certain action sample A contains a T frame { f ₁ , f ₂ , f ₃ ..., f _T -randomly selecting a frame from the listC= (x, y, z) represents the three-dimensional coordinates of a certain node, t represents the frame belonging to the t-th frame in sample a, and N represents the node belonging to the N-th node in the human skeleton. Therefore, the instantaneous motion displacement of the motion sample A is { d } ₁ , d ₂ , d ₃ ..., d _T }；

Specifically, the position change of the joint point in the front frame and the rear frame is calculated to represent the instantaneous motion displacement, and a specific instantaneous motion displacement calculation formula is as follows:

，

wherein f _t - f _t-1 Representing the displacement of motion between two adjacent frames, d _t The instantaneous motion displacement of the T frame is represented, N represents the point belonging to the N joint in the human skeleton, and T represents that the motion sample comprises a T frame.

Step S12: the instantaneous direction of motion is calculated.

As the second dimension of the three-dimensional matrix, the instantaneous motion direction represents the motion direction of a certain node between the front and rear frames. The instantaneous direction of motion includes not only timing information of the motion samples, but also spatial information.

Specifically, the motion direction of the joint point in the front frame and the rear frame is calculated to represent the instantaneous motion direction, and a specific instantaneous motion direction calculation formula is as follows:

，

wherein { is as follows，/>,/>The displacement in the x, y and z axes of the current frame compared to the previous frame, { P }, is represented _XY , P _XZ , P _YZ And the projection of the instantaneous direction of motion in the xy, xz and yz planes.

Step S13: the relative motion displacement is calculated.

As the third dimension of the three-dimensional matrix, the relative motion displacement represents the total displacement of a certain joint point in unit time, and the physical meaning of the relative motion displacement is that the change condition of the joint point is recorded from the space level. Comparing each frame of the joint point in the action sample A with the first frame, and respectively projecting the frames on an x axis, a y axis and a z axis;

specifically, the total displacement of the joint point in unit time is calculated to represent the relative motion displacement, and a specific calculation formula of the relative motion displacement is as follows:

，

Step S14: the instantaneous movement displacement, instantaneous movement direction and relative movement displacement are input into a three-dimensional matrix.

After extracting the instantaneous movement displacement, instantaneous movement direction and relative movement displacement of the joint, the movement information is stored in a three-dimensional matrix (3D-JMM), and as shown in fig. 5, a behavior quantization representation process diagram based on skeleton nodes is shown.

Referring to fig. 6, a flowchart of a human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application is shown, and step S2 may include steps S21 to S22, which are specifically as follows:

step S21: and reading preset joint points of each frame in the three-dimensional feature map, scaling the joint points in equal proportion according to a preset proportion, and storing the joint points in an action sample data set.

It should be appreciated that in any one of the widely used common data sets, there will be multiple executives for each action sample. Although the features of height, weight, etc. of these persons are different, since the human skeleton structure is substantially the same, the skeleton shape of all persons is substantially the same, which means that the skeleton shape between different executives can be regarded as different examples of the same skeleton. Therefore, the method and the device are based on the original skeleton, and the skeleton is randomly enlarged or reduced in equal proportion within a certain range, so that new action samples with different body types are generated.

Illustratively, as shown in algorithm 1:

algorithm 1: data enhancement algorithm for scaling in equal proportion based on human skeleton

Reading the total frame number F of the sample;

for (f = 0; f < F; f++){

reading original coordinate information of 25 joint points (j 1, j2, …, j 25) of a human body in each frame in the action sample;

for (S = 0.76; S <= 1.25; S+=0.01){

for ( j = 1; j <= 25; j++){

calculating and storing the three-dimensional coordinates of the equal scaling of the 25 joint points according to the S;

}

according to the compact storage rule of the 3D-JMM, storing the compact storage rule as a new action sample;

}}

in a detailed implementation of the above method, firstly, a motion sample is randomly selected from the public dataset, and is scaled between [0.76, 1.25] to generate a new three-dimensional matrix (3D-JMM) by scaling the human skeleton in proportion R from the first frame to the last frame.

Step S22: and taking the vertebra part in the human skeleton as a mirror image operation plane to perform symmetrical operation so as to obtain new data which are symmetrical left and right and store the new data in an action sample data set.

It should be appreciated that in three dimensions, the spine is typically the primary support structure for the human skeleton, so that the overall symmetry of the human skeleton is better maintained with the spine as a mirrored plane.

For example, for a person's arm lifting action to the left, a mirror image operation may be performed to generate a sample of the arm lifting action to the right.

The specific symmetrical operation is expressed as:

，

wherein,representing coordinates of any joint point outside the spine，/>Then the mirror image coordinates of the spinal node are represented and p is represented as a symmetrical operation.

In a specific embodiment, the step of data enhancing the three-dimensional feature map to generate the action sample dataset further comprises: reading preset joint points of each frame in the three-dimensional feature map, scaling the joint points in equal proportion according to preset proportion, and storing the joint points in an action sample data set;

the strategy is to track the three-dimensional coordinates of all joints in each frame on a frame-by-frame basis and based on the y-axis (i.e.: aroundyShaft), every 5 ^o The physical meaning of the strategy is that the motion of the same behavior in different directions in a three-dimensional space is simulated, and the strategy is the most effective and the least loss of motion information by rotating the human body in the human body direction like the proportional structure of the human body skeleton. The coordinates of each joint point of the human body after rotation can be calculated by the formula:

，

in the formula {x, y, zRepresented is the three-dimensional coordinates of a particular node in a frame,is surrounded byyCoordinates of the axis after rotation, +. >Is the angle of rotation.

Referring to fig. 7, a flowchart of a human behavior recognition method based on a multi-scale attention mechanism according to an embodiment of the present application is shown, and step S4 may include steps S41 to S42, which are specifically as follows:

step S41: soft pooled channel attention is calculated.

Specifically, as shown in fig. 8, a schematic diagram of channel attention after soft pooling is added is shown;

the calculation formula of the soft pooling channel attention is as follows:

，

Step S42: soft pooled spatial attention is calculated.

Specifically, as shown in fig. 9, a schematic diagram of spatial attention after soft pooling is added is shown;

the calculation formula of the soft pooling spatial attention is as follows:

，

wherein F represents a three-dimensional feature map, M _s (F) The weight coefficient representing the output of the three-dimensional feature map through soft pooling space attention, and f7x7 represents convolution with a convolution kernel of 7 x 7.

Referring to fig. 10, a schematic structural diagram of a human behavior recognition system based on a multi-scale attention mechanism according to an embodiment of the present application is shown, where the system includes:

The acquisition data module 10 is configured to acquire human behavior data, and annotate human joint motion information in the human behavior data, where the human joint motion information includes an instantaneous motion displacement, an instantaneous motion direction, and a relative motion displacement, and input the instantaneous motion displacement, the instantaneous motion direction, and the relative motion displacement into a three-dimensional matrix to obtain a three-dimensional feature map;

a data enhancement module 20, configured to perform data enhancement on the three-dimensional feature map to generate an action sample data set;

a multi-scale feature extraction module 30, configured to construct a multi-scale convolutional neural network, where the multi-scale convolutional neural network includes a spatial pyramid pooling layer and a global average pooling layer, and input the motion sample dataset into the spatial pyramid pooling layer and the global average pooling layer to extract local motion features and global motion features respectively;

a soft attention allocation weight module 40, configured to construct a soft attention mechanism, where the soft attention mechanism includes soft pooled channel attention and soft pooled space attention, and input the local motion feature and global motion feature into the soft pooled channel attention and soft pooled space attention to obtain a human behavior feature containing an attention weight;

The behavior recognition and classification module 50 is configured to input the human behavior characteristics including the attention weight into a classifier to obtain the category to which the human behavior belongs.

Another aspect of the present application also provides a storage medium storing one or more programs that when executed by a processor implement the above-described human behavior recognition method.

In another aspect, the present application further proposes a computer device, including a memory and a processor, where the memory is configured to store a computer program, and the processor is configured to implement the above-mentioned human behavior recognition method when executing the computer program stored on the memory.

Referring to fig. 11, an experimental flowchart of a human behavior recognition method and system based on a multi-scale attention mechanism in the present application is shown, including:

step S100: a common dataset is acquired.

The present application will use the following data sets in the experimental section: florence 3D Actions Dataset and UTKinect-Action3D Dataset.

Florence 3D Actions Dataset (Florence-3D): this dataset was photographed by the university of Florence using a Kinect camera in 2012, containing a total of 9 different actions: waving hands, taking up the bottle to drink water, making a call, clapping hands, tying shoelaces, sitting down, standing up, watching a watch, and bowing. During data acquisition, 10 volunteers were required to perform the above-described actions 2-3 times, so there were 215 action samples in the dataset.

UTKinect-Action3D Dataset (UT-3D): this dataset was taken by a stationary Kinect camera containing ten different actions (as shown): walking, sitting down, standing up, picking up, holding, throwing out, pushing, pulling, waving and clapping hands. During the shooting, there were a total of 10 volunteers, each of which performed all actions twice as required, thus a total of 200 action samples.

The software and hardware configurations used in this experiment are shown in table 1:

table 1 software and hardware configuration table

Experimental equipment	Detailed parameters
		Operating system	Windows 10
GPU	NVIDIA GTX-3060 11G
		CPU	Intel Core i7-11700KF
Programming language	python-3.7
		Deep learning frame	PyTorch-GPU-1.9.0

Step S200: and (5) verifying multi-scale feature learning.

Thus, to ensure that the network model can extract more semantic features from different sized datasets, the timescales of the datasets are unified to different sizes. Meanwhile, the multi-scale training can also increase the diversity of data, alleviate the problem of insufficient data, avoid overfitting and improve the generalization capability of the model. The experiment is based on Florrence-3D data set and UT-3D data set, and hierarchical sampling is carried out according to the proportion of 8:2. Table 2 records the results of the experiments of the data sets at different time scales, the experimental settings are as follows:

Each dataset is set to three different scales depending on the duration of its sample. For the Florence-3D dataset, 18 frames, 20 frames and 22 frames are selected; for the UT-3D dataset, 40 frames, 50 frames, and 60 frames were selected.

Each set of experiments was trained for 150 rounds. That is, if only one scale is used, the dataset will be trained 150 times; for a combination of two scales, the dataset for each scale will be trained 75 times; for a combination of three scales, the data set for each scale will be trained 50 times. By controlling the total training times, the influence caused by different training times can be avoided.

Table 2 Multi-scale feature learning and accuracy contrast experiments

Experimental results show that after multi-scale transformation is performed, the convolutional neural network obtains higher accuracy than single-scale training under different-scale training of the same sample set. This illustrates that the modified network can learn multi-scale features from samples of different scales, thereby improving recognition performance.

Step S300: verifying the validity of the attention mechanism.

The S-CBAM module not only has the capacity of the space attention module to focus limited attention of the model on key features, but also has the function of the channel attention module to sort all channel features according to importance degrees. The section performs validity verification on the network model embedded with the S-CBAM module based on the data enhancement strategy and the multi-scale learning strategy. The neural network with the S-CBAM module nested will be trained and tested on the Florence-3D data set and the UT-3D data set.

To further verify the effectiveness of the methods presented herein, comparison with methods presented by other researchers is performed, with the comparison results shown in tables 3 and 4:

TABLE 3 classification accuracy contrast on Florence-3D dataset

Method	Accuracy rate of
		Comparative method 1-Joint Subset Selection (JSS)	90.10%
Comparison method 2-Lie group+CNN	93.00%
		Comparative method 3-Chen	96.59%
Comparative method 4-Spatio-Temporal Weighted	92.46%
		Comparison method 5-DJMI+ZFNet+LSTM	93.77%
Comparative method 6-ConvRNN+PSF	96.00%
		The method	97.80%

TABLE 4 Classification accuracy contrast on UT-3D datasets

Method	Accuracy rate of
		Comparative method 1-Joint Subset Selection (JSS)	95.78%
Comparative method 2-Chen	97.48%
		Comparative method 3 DJMI+ZFNet+LSTM	94.23%
Comparative method 4 ConvRNN+PSF	94.73%
		Comparative method 5-Composite Latent Structures (CLS)	95.50%
The method	98.69%

In Table 3, the Florence-3D data set was tested, and it was found that the data enhancement +GM layer +S-CBAM module +ZFNet method used in this application achieved 97.80% higher accuracy than 93.77% of the DJMI +ZFNet +LSTM method by 4.03%. And in the comparative experiments performed on UT-3D datasets in table 4, it was found that the proposed Chen method still performed 1.21% higher than the best performing Chen method. Through the two groups of comparison experiments, the network model can clearly pay more attention to the characteristics useful for the behavior classification task after the S-CBAM module is added, unnecessary characteristics are successfully restrained, and the classification accuracy of the model is improved.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A human behavior recognition method based on a multi-scale attention mechanism, the method comprising:

Acquiring human body behavior data, marking human body joint motion information in the human body behavior data, wherein the human body joint motion information comprises instantaneous motion displacement, instantaneous motion direction and relative motion displacement, and inputting the instantaneous motion displacement, the instantaneous motion direction and the relative motion displacement into a three-dimensional matrix to obtain a three-dimensional feature map, and the step of marking the human body joint motion information in the human body behavior data comprises the following steps:

，

wherein,representing the three-dimensional coordinates of the central articulation point of the spine, +.>Representing coordinates of a certain node outside the central node of the spine, < ->Then the relative coordinates of the central node of the spine, r for relative coordinate representation;

2. The human behavior recognition method according to claim 1, wherein the human articulation information includes instantaneous movement displacement, instantaneous movement direction, and relative movement displacement, and the step of:

，

wherein f _t - f ₁ Representing the currentTotal displacement of motion between frame and first frame, D _t Representing the relative motion displacement of the t-th frame.

3. The method of claim 1, wherein the step of data enhancing the three-dimensional feature map to generate an action sample dataset comprises:

，

4. The method of claim 1, wherein the step of constructing a multi-scale convolutional neural network comprising a spatial pyramid pooling layer and a global averaging pooling layer, and inputting the motion sample dataset into the spatial pyramid pooling layer and the global averaging pooling layer to extract local motion features and global motion features, respectively, comprises:

5. The human behavior recognition method according to claim 1, wherein the calculation formula of the soft pooling channel attention is:

，

6. The human behavior recognition method according to claim 1, wherein the calculation formula of the soft pooled spatial attention is:

，

7. A human behavior recognition system based on a multi-scale attention mechanism, the system comprising:

The device comprises an acquisition data module, a three-dimensional feature map and a human body motion data processing module, wherein the acquisition data module is used for acquiring human body behavior data and labeling human body joint motion information in the human body behavior data, the human body joint motion information comprises instantaneous motion displacement, instantaneous motion direction and relative motion displacement, the instantaneous motion direction and the relative motion displacement are input into the three-dimensional matrix to obtain the three-dimensional feature map, and the step of labeling the human body joint motion information in the human body behavior data comprises the following steps:

，

8. A storage medium, comprising: the storage medium stores one or more programs which, when executed by a processor, implement the human behavior recognition method of any one of claims 1-6.

9. A computer device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to implement the human behavior recognition method of any one of claims 1 to 6 when executing the computer program stored on the memory.