CN115690906A

CN115690906A - Human body action recognition method based on self-attention mechanism and Bi-GRU

Info

Publication number: CN115690906A
Application number: CN202211304941.5A
Authority: CN
Inventors: 路永乐; 修蔚然; 韩亮; 杨杰; 孙旗; 罗毅; 彭慧; 刘宇
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-02-03

Abstract

The invention requests to protect a human body action recognition method based on a self-attention mechanism and Bi-GRU. Comprising the following steps, S1: recording inertial sensor data of human body action, and intercepting the data and an action category label corresponding to the data through a sliding window; s2: inputting data into an Encoder for coding, extracting time correlation characteristics among the input data through a multi-head self-attention layer, and splicing the time correlation characteristics with the original input data; s3: inputting the output data of the Encoder into a Bi-GRU for further time sequence feature extraction; s4: inputting the output characteristics of the Bi-GRU into the full-connection layer to obtain an output vector; s5: and training the model according to the sample data, and inputting the inertial sensor data with unknown classification labels into the trained model to obtain the human body motion category. The invention solves the problems that effective time sequence characteristics are difficult to extract and the identification precision is low in the existing human body action identification.

Description

Human body action recognition method based on self-attention mechanism and Bi-GRU

Technical Field

The invention belongs to the field of human body action recognition, and particularly relates to a human body action recognition method based on a self-attention mechanism and Bi-GRU.

Background

Human motion recognition refers to classifying motion into predefined human motion classes based on data obtained by sensors. Has very important function in the fields of health monitoring systems, remote medical care, motion detection and the like. The human body action recognition based on the inertial sensor has the advantages of no external interference, no scene limitation, strong anti-interference capability and the like, and is more suitable for daily sports and military application.

The proposal of deep learning makes the machine learning make a breakthrough progress, and brings a new development direction for human action recognition. Deep learning can automatically learn deep features from original data, and the problem that the feature extraction of the traditional machine learning depends on the prior knowledge of researchers, so that the algorithm generalization capability is poor is solved.

The human body motion recognition technology based on the convolutional neural network and the cyclic neural network is a technology which is used more in the current human body motion recognition technology based on deep learning. The convolutional neural network can extract spatial features, and the cyclic neural network can extract temporal features. The following problems still remain: 1. for a task with strong time correlation, such as human motion recognition, the spatial features extracted by the convolutional network are not effective enough, so that the accuracy rate of complex motion recognition is low. 2. The convolution network has too much calculation complexity and parameter quantity. 3. The cyclic neural network is difficult to extract time characteristics among data with long time intervals, so that the human body action recognition accuracy is not high enough. Therefore, a new feature extraction and identification method is needed to be provided to improve the human body motion identification precision and reduce the algorithm complexity.

The invention is essentially different from the patent CN 114639169A. The data source of the present invention is inertial sensing, CN114639169A uses WiFi, and the present invention does not use complex convolution algorithm.

According to the invention, the global time correlation characteristic is extracted through the self-attention mechanism, and in order to ensure that the Bi-GRU can extract the local time sequence characteristic of the original data, the output of the self-attention mechanism is spliced with the original input data. And then, the Bi-GRU is used for extracting the local time sequence characteristics, so that the complete extraction of the time domain characteristics is realized. Meanwhile, the self-attention mechanism combined with the Bi-GRU has a simple structure and low parameter quantity, and solves the problems of large parameter quantity and complex structure of a convolution network.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. A human body action recognition method based on a self-attention mechanism and Bi-GRU is provided. The technical scheme of the invention is as follows:

a human body action recognition method based on a self-attention mechanism and Bi-GRU comprises the following steps:

s1: recording inertial sensor data of human body action, and intercepting the data and an action category label corresponding to the data through a sliding window;

s2: constructing an Encoder-Decoder model; the Encoder-Decoder model comprises an Encoder and a Decoder, data are input into an Encoder Encoder to be encoded, time correlation characteristics among the input data are extracted through a multi-head self-attention layer in the Encoder Encoder, and then the time correlation characteristics are spliced with original input data;

s3: decoding by Decoder: the Decoder comprises a bidirectional gating circulating unit Bi-GRU, a full connection layer and a Softmax layer, and the output data of the Encoder is input into the bidirectional gating circulating unit Bi-GRU for further time sequence feature extraction; the full connection layer integrates the features into vectors, and the Softmax layer converts the output of the full connection layer into probability distribution;

s4: inputting the output characteristics of the Bi-GRU into the full connection layer to obtain an output vector, wherein the dimensionality of the output vector is the total number of the classification labels, and the N-dimension value of the vector is the possibility that the action corresponding to the input inertial sensor data is the Nth action;

s5: and training the model according to the sample data, and inputting the inertial sensor data with unknown classification labels into the trained model to obtain the human body motion category.

Further, the S1 specifically includes:

the method comprises the steps of recording time sequence data of inertial sensors related to human body actions by using the inertial sensors positioned on a trunk, setting sliding windows with certain lengths, and intercepting the data with corresponding lengths and the human body action categories corresponding to the sliding windows.

Further, the multi-headed self-attention layer in step S2 includes three fully-connected layers: query, key and value, the input data respectively obtain Q, K and V matrixes through the three full connection layers, and then an Attention-Score matrix is obtained through further calculation, in order to ensure that the Bi-GRU can learn the time domain characteristics of the original data, the Attention-Score matrix and the original data are spliced in the last dimension, and the output of the Encoder is obtained.

Further, the equation for the orientation-Score matrix is as follows:

where Head _ size represents the dimension of each Head of the Multi-Head, and Softmax represents the Softmax function, calculated for each row of the matrix, the Softmax formula being as follows:

wherein y is _a Values, y, representing the a-th column of a row of the Attention-Score matrix _b The value of the b-th column of a certain row of the matrix is shown, and w represents the number of columns of the matrix.

Further, the full connection layer is connected with a Softmax layer in a back mode, the Softmax layer classifies sensor time sequence data x which is input into the Encoder-Decoder model at present and is calculated into a probability Q (i | x) of i according to a vector output by the full connection layer by utilizing a Softmax formula; the Softmax formula is as follows:

wherein z is _i Representing the output of the ith neuron of the last fully-connected layer corresponding to the input sequence x, where z _c Representing the output of the c < th > neuron of the full connection layer, and the N < th > dimensional value is the probability that the action corresponding to the inertial sensor data in the input sliding window is the N < th > action, wherein Softmax (z is _i )＝Q(i|x)；

And selecting the action i corresponding to the maximum Q (i | x) as the human body action recognition result.

If Softmax (z) _i ) The maximum value of the result of the Softmax function, the action recognition result corresponding to the input data x is the i-th type tag action.

Further, the loss function adopts a balanced cross entropy function:

where the first half of the equation to the right is the equilibrium cross entropy loss function, α _i Representing the loss weight of the ith action, N representing the number of action types, P representing the probability distribution of the real label converted into one-hot code, and Q representing that the vector output by the model is regarded as the action probability distribution; p (x) _ji ) Represents the probability of the ith action in the real label corresponding to the jth input sequence x, Q (x) _ji ) Representing the i-th action in the model output for the j-th input sequence xProbability; the problem of unbalanced sample size of the data set can be solved by distributing different loss weights; the rear half part is an L2 regular term; wherein lambda is a regular term coefficient, theta represents a set of learnable parameters in the algorithm, and m is the number of learnable parameters in the algorithm.

The invention has the following advantages and beneficial effects:

the Encoder-Decoder model is a neural network model with simple network structure and light weight. Different from the common human body action recognition method based on deep cyclic neural network learning, the method firstly extracts the global time correlation characteristics among data regardless of time intervals through the self-attention mechanism coding in the Encoder, and solves the defect that the cyclic neural network is difficult to extract the time correlation characteristics among data with longer time intervals. And secondly, splicing the Attention-Score matrix and the original data in the last dimension to obtain the output of the Encoder in order to ensure that the Bi-GRU can learn the time domain characteristics of the original data. And the Encoder outputs the time sequence characteristics of the data extracted by a gating circulation unit of the Decoder, so that the human body action identification precision is improved. The invention can efficiently process the inertial sensor data and can automatically learn complete and effective time sequence characteristics from the sensor data. Meanwhile, the invention only uses the recurrent neural network and does not use the convolutional neural network, and has simpler structure and lower parameter number, thereby having simpler calculation and less consumption of computer resources. The Attention-Score matrix and the original data are spliced in the last dimension, so that the integrity of time domain characteristics of the original data is guaranteed, and the identification precision is improved. The invention can provide a new visual field and new thinking for human body action recognition, and is beneficial to the development of human body action recognition.

Drawings

Fig. 1 is a schematic structural diagram of an Encoder-Decoder model according to a preferred embodiment of the present invention.

Fig. 2 is a flowchart of a method implementation according to an embodiment of the present invention.

FIG. 3 is a flow chart of an Encoder-Decoder model generation process.

FIG. 4 is a diagram illustrating an Attention-Score matrix calculation method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the invention firstly provides a human body action recognition method based on a self-attention mechanism and Bi-GRU (Bi-GRU), which comprises the steps of collecting and processing data, inputting the data into a model and obtaining a human body action recognition result as shown in figure 2. The specific steps of model training are shown in fig. 3, and include the following steps one, two and three:

the method comprises the following steps: recording time sequence data of inertial sensors about human body actions by using the inertial sensors positioned on the trunk, setting sliding windows with certain lengths, and intercepting the data with corresponding lengths and the human body action category corresponding to each sliding window;

step two: constructing an Encoder-Decoder model;

as shown in FIG. 1, the Encoder-Decoder model includes an Encoder and a Decoder. Wherein Encoder: comprises a Multi-Head-Self-orientation layer; a Decoder: the system comprises a bidirectional gating cycle unit network, a full connection layer and a Softmax layer;

in the present invention, the Multi-Head-Self-orientation layer comprises three fully-connected layers: query, key, value. The input data respectively obtain Q, K and V matrixes through the three full-connection layers. And then obtaining an Attention-Score matrix through further calculation, and splicing the Attention-Score matrix and the original data in the last dimension to obtain the output of the Encoder in order to ensure that the Bi-GRU can learn the time domain characteristics of the original data. The Decoder comprises a bidirectional gating circulation unit and is used for further extracting time sequence characteristics; a full connection layer for integrating the output of the bidirectional gating circulation unit and outputting a vector, wherein the vector dimension is the total number of the classification labels; and a Softmax layer, wherein the output of the full connection layer obtains a vector through a Softmax function, and the dimension of the vector is the total number of the classification tags. And the vector N dimension value is the probability that the action corresponding to the inertial sensor data in the input sliding window is the N action.

The first module of the Encoder-Decoder model is an Encoder, and comprises a Multi-Head-Self-attachment layer, wherein the Multi-Head-Self-attachment layer comprises three fully-connected layers: query, key, value. The input data respectively obtain Q, K and V matrixes through the three full connection layers, and then an Attention-Score matrix is obtained through further calculation, wherein the calculation formula of the Attention-Score matrix is as follows:

where Head _ size represents the dimension size of each Head of the Multi-Head and Softmax represents the Softmax function, calculated for each row of the matrix. The Softmax formula is as follows:

wherein y is _a Value, y, representing the a column of a row of the Attention-Score matrix _b The value of the b-th column of a certain row of the matrix is shown, and w represents the number of columns of the matrix.

FIG. 4 shows the Attention-Score matrix computation process with an input time series length of 2, data dimensions of 1 × 3 for each time step, a head number of 4, and a head dimension of 3. The Attention-score matrix (S in FIG. 3) calculated for each head ₀ Etc.) are concatenated in columns to obtain the output of the Multi-Head-Self-orientation layer (S in fig. 3).

And splicing the Attention-Score matrix and the original data in the last dimension to obtain the output of the Encoder in order to ensure that the Bi-GRU can learn the time domain characteristics of the original data. The second module is a Decoder and comprises a bidirectional gating circulation unit: extracting high-dimensional time sequence characteristics of the data; full connection layer: integrating and outputting the high-dimensional time sequence characteristics obtained by the bidirectional gating circulation unit into a vector, and recording the vector as (z) ₁ ,z ₂ ,...,z _N ) The vector dimension is the number of action categories; softmax function: calculating the output of the full connection layer by a Softmax formula to obtain a vector, wherein the vector dimension isThe degree is the total number of classification labels. And the vector N-th dimension value is the probability that the action corresponding to the inertial sensor data in the input sliding window is the N-th action. The Softmax formula is as follows:

wherein z is _i Representing the output of the ith neuron of the last fully-connected layer corresponding to the input sequence x, where z _c Representing the output of the c-th neuron of the full connection layer, and the N-dimensional value is the probability that the action corresponding to the inertial sensor data in the input sliding window is the N-th action, wherein Softmax (z) _i )＝Q(i|x)；

If Softmax (z) _i ) And the maximum value of the result of the Softmax function is obtained, the action identification result corresponding to the input data x is the ith type tag action, and N is the action type number.

Step three: and (4) training the Encoder-Decoder model according to the sensor time sequence data sample intercepted in the first step and the human body action class label corresponding to the sensor time sequence data sample, namely stopping training when the loss function value is lower than a set threshold value.

In order to enable the neural network to learn more distinguishing features, the degree of closeness of the actual output of the Encoder-Decoder model to the expected output is judged through a loss function. The loss function of the invention adopts the following balanced cross entropy loss function:

where the first half of the equation to the right is the equilibrium cross entropy loss function, α _i And the loss weight of the ith action is represented, N represents the sample size of one training, N represents the action number, P represents the probability distribution of the true label converted into one-hot coding, and Q represents that the vector output by the model is regarded as the action probability distribution. P (x) _ji ) Represents the probability of the ith action in the real label corresponding to the jth input sequence x, Q (x) _ji ) Outline of i-th action in model output corresponding to j-th input sequence xAnd (4) the ratio. The problem of data set sample size imbalance can be solved by assigning different loss weights. The latter half is an L2 regular term, which helps to mitigate overfitting of the algorithm. Where λ is the regular term coefficient, θ represents the set of learnable parameters in the algorithm (weights and biases), and m is the number of learnable parameters in the algorithm.

Step four: and identifying and classifying the human body action by using the trained Encoder-Decoder model.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the present invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A human body action recognition method based on a self-attention mechanism and Bi-GRU is characterized by comprising the following steps:

s3: decoding by a Decoder: the Decoder comprises a bidirectional gating circulating unit Bi-GRU, a full connection layer and a Softmax layer, and the output data of the Encoder is input into the bidirectional gating circulating unit Bi-GRU for further time sequence feature extraction; the full connection layer integrates the features into vectors, and the Softmax layer converts the output of the full connection layer into probability distribution;

s5: training the model according to the sample data, and inputting the inertial sensor data with unknown classification labels into the trained model to obtain the human body action category.

2. The method for recognizing human body actions based on the self-attention mechanism and the Bi-GRU as claimed in claim 1, wherein the S1 specifically comprises:

3. The method for recognizing human body actions based on the self-attention mechanism and the Bi-GRU as claimed in claim 1, wherein the multi-head self-attention layer in the step S2 comprises three fully-connected layers: query, key and value values, respectively obtaining Q, K and V matrixes from input data through the three full connection layers, then obtaining an Attention Score matrix through further calculation, and splicing the Attention Score matrix and the original data in the last dimension to obtain the output of the Encoder in order to ensure that the Bi-GRU can learn the time domain characteristics of the original data.

4. The method for recognizing human body actions based on the self-Attention mechanism and the Bi-GRU as claimed in claim 3, wherein the Attention-Score matrix is calculated by the following formula:

where Head _ size represents the dimension size of each Head of the Multi-Head, and Softmax represents the Softmax function, calculated for each row of the matrix, the Softmax formula being as follows:

5. The human body motion recognition method based on the self-attention mechanism and the Bi-GRU is characterized in that the full connection layer is followed by a Softmax layer, and the Softmax layer classifies sensor time sequence data x which is calculated to be input into an Encoder-Decoder model currently into a probability Q (i | x) labeled with i according to vectors output by the full connection layer by utilizing a Softmax formula; the Softmax formula is as follows:

wherein z is _i Representing the output of the ith neuron of the last fully-connected layer corresponding to the input sequence x, where z _c Representing the output of the c-th neuron of the full connection layer, N is the number of action types, and the N-dimensional value is the probability that the action corresponding to the inertial sensor data in the input sliding window is the N-th action, wherein Softmax (z) _i )＝Q(i|x)；

If Softmax (z) _i ) The maximum value of the result of the Softmax function, the action recognition result corresponding to the input data x is the i-th class tag action.

6. The human body motion recognition method based on the self-attention mechanism and the Bi-GRU as claimed in claim 5, wherein the loss function is a balanced cross entropy function:

where the first half of the equation to the right is the balanced cross entropy loss function, α _i The loss weight of the ith action is shown, N is the actionThe number of the types, P represents the probability distribution of the real label after being converted into one-hot coding, and Q represents that the vector output by the model is regarded as action probability distribution; p (x) _ji ) Represents the probability of the ith action in the real label corresponding to the jth input sequence x, Q (x) _ji ) Representing the probability of the ith action in the model output corresponding to the jth input sequence x; the problem of unbalanced sample size of the data set can be solved by distributing different loss weights; the rear half part is an L2 regular term; wherein λ is a regular term coefficient, θ represents a set of learnable parameters in the algorithm, and m is the number of learnable parameters in the algorithm.