CN111860188A

CN111860188A - Human body posture recognition method based on time and channel double attention

Info

Publication number: CN111860188A
Application number: CN202010588253.0A
Authority: CN
Inventors: 张雷; 高文彬; 刘悦
Original assignee: Nanjing Normal University
Current assignee: Nanjing Normal University
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-30

Abstract

The invention discloses a human body posture identification method based on time and channel double attention, which comprises the following steps: the method comprises the steps of collecting original data of various human body actions by using a built-in sensor of the mobile equipment, attaching an attribute label of the action, utilizing a sliding window and normalization processing, segmenting the data into a training sample set and a testing sample set, establishing a deep convolutional neural network model based on time and channel double attention, importing the training sample and the testing sample to train and optimally adjust, and obtaining a recognition result of the human body actions. Due to the superposition of the channel attention and the time sequence attention, the method can accurately position the type and the occurrence time of the target action after being trained by a large amount of coarse-grained training data, greatly reduces the complexity of manually marking the training data, and has very important functions in the aspects of sports, interactive games, medical care, general monitoring systems and the like.

Description

Human body posture recognition method based on time and channel double attention

Technical Field

The invention belongs to the field of intelligent monitoring of wearable equipment, and particularly relates to a human body posture identification method based on time and channel double attention.

Background

In recent years, with the development of computer technology and the popularization of intelligent technology, a new round of global technology change has been entered, and technologies such as large-scale cloud computing, internet of things, big data and artificial intelligence are also rapidly developed. Among them, the human body posture recognition technology is also an important research trend in the related field of computer vision. The application range is very wide, and the device can be used in various fields such as health monitoring, motion detection, man-machine interaction, movie and television production, game entertainment and the like. People can utilize a sensor worn by the human body to collect motion trail data of joint points of the human body to realize gesture recognition, and can also realize that 3D animation simulates human body motion to make movie and TV play and the like.

With the development of intelligent wearable device research, wearable sensor-based human body gesture recognition has become an important research field, and the technology is a technology for judging the human body motion behavior state by analyzing relevant information capable of reflecting human body motion behaviors. The method is applied to health monitoring, indoor positioning and navigation, user social behavior analysis, motion sensing games and the like. However, most of the existing human body posture recognition systems have the problems of low recognition accuracy, low inference speed and the like, so how to establish a high-accuracy network model and maintain the inference speed becomes a problem to be solved urgently.

The most widespread application of human body posture recognition at present is in intelligent monitoring. The intelligent monitoring is different from the common monitoring mainly in that a human body posture recognition technology is embedded into a video server, the behaviors of dynamic objects, namely pedestrians and vehicles, in a monitoring picture scene are recognized and judged by using an algorithm, key information in the behaviors is extracted, and when abnormal behaviors occur, an alarm is sent to a user in time. Similarly, human gesture recognition technology under the fixed scene can be applied to the family control, if for the emergence of prevention solitary old man's the condition of falling, can be through the intelligent supervisory equipment of installation discernment falling gesture at home, to the discernment of solitary old man's the condition of falling, in time make the response when the emergency appears. The continuous development of human society and the continuous improvement of quality of life, video monitoring has been applied to each field very widely, and the field of people's living space is expanding and expanding, public and private place is also developing thereupon, meets the probability of various emergency and is increasing constantly, especially in public place, because its control degree of difficulty is great, the population is intensive. Through simple monitoring, the requirement of current social development can not be met, the human body posture can be predicted with great difficulty by simply depending on the attendance of an operator on duty, and social resources are also potentially wasted. Therefore, the intelligent monitoring system independent of individuals is a necessary way for solving the problem in the current society, in the process of social contact, human body actions except for language can transmit certain information, the meaning of the actions can be read through scientific and reasonable prediction, and people can be better helped to realize social contact.

Deep learning has a good development prospect in pattern recognition. The model architecture represented by the shallow convolutional neural network occupies the mainstream position. The convolutional neural network is greatly concerned in the field of computer vision, can process multidimensional data, and has more obvious effect than the traditional method on the premise of large data volume. Compared with the traditional machine learning methods such as logistic regression, decision trees, Markov models and the like, although the shallow deep learning methods have significant improvement in precision, feature map information is not abundant due to the small number of convolution layers. Meanwhile, the general convolution calculation cannot accurately locate the occurrence time of the target action and the type of the target action in a long string of data. Therefore, how to accurately position the time and the category of the target action while ensuring the model identification precision becomes a problem which is first solved by current researchers.

Disclosure of Invention

The purpose of the invention is as follows: in view of the above problems, an object of the present invention is to provide a human body posture recognition method based on time and channel double attention, which can not only improve the accuracy of a model in a deeper convolutional neural network, but also accurately locate the type of a sensor acting on a channel axis and the occurrence time of a time axis target action.

The technical scheme is as follows: the invention provides a human body posture recognition method based on time and channel double attention, which comprises the following steps:

step1, acquiring human posture and motion signal data (such as lying down, standing up, walking, running, falling down and the like) of each activity type through a motion sensor, and attaching corresponding motion attribute labels to the motion signal data;

step2, preprocessing the collected motion signal data, and dividing the processed data into a training sample set and a testing sample set; the processing comprises the following steps: the data is subjected to sliding window processing with fixed step length, more data can be obtained by reducing the sliding step length, data denoising, null-removing operation and normalization processing are carried out on the data signals obtained through the processing, and the data signals are scaled according to the proportion to fall into a specific (0,1) interval;

step3, taking the processed data as an input sample, sending the processed data into a multi-attention deep convolutional neural network for training, setting a learning rate, an optimizer and a fixed batch, and then continuously reducing the loss value of the deep convolutional neural network model by utilizing gradient descent and updating each weight parameter at the same time until the loss value is smaller than a preset value to obtain a training model;

And Step4, classifying and recognizing the human body posture data to be recognized by using the trained model.

Further, in Step1, the sensor down-sampling frequency is set to be 20Hz to 40Hz when the sensor data signals are collected.

Further, Step2 includes removing outliers and nulls from the data, and rearranging the number of each activity category, so that the data set is subjected to uniform distribution, and 70% and 30% are used as training samples and test samples, respectively.

Further, Step3 specifically includes the following contents:

3.1, establishing a 6-layer convolution attention neural network model. The whole model establishes three convolution blocks together, wherein each block comprises two convolution extraction layers and a jump convolution layer;

a: a convolutional neural network was constructed using 6-layer deep convolution:

the convolutional neural network is constructed using 6 convolutional layers, where each two convolutional layers are built into one convolutional block. Adding a jump convolution layer to the convolution of two layers of each block to maintain the dimension R of the input data^C×H×W(C is the number of data channels, H is the data height, W is the data width) and output data R^C’^×H×W(C' is the number of data channels, H is the data height, and W is the data width) can be linearly weighted. Where C → C' is determined by the channel dimension in the convolutional layer. After the 6 layers of convolutional layers are built, output data are sent to an attention network for linear weighting of attention weight. Then finally sending the data into a full connection layer for human body action classification calculation

B: establishing a channel attention network:

sequentially adding a channel attention network and a time sequence attention network behind two convolutional layers of each block of 3 blocks of the model, and adding the channel attention network to a characteristic diagram channel dimension in order to determine the sensor characteristic importance degree of convolutional characteristics in the channel dimension;

feature time dimension information is first compressed using average pooling and maximum pooling to generate an average timing feature

And maximum timing characteristics

And then sending the two characteristics to a multilayer perceptron to obtain final output characteristics by using point-by-point summation, wherein the final output characteristics are specifically represented as follows:

M_C(F)＝σ(MLP(Avgpool(F))+MLP(MaxPool(F)))

where σ denotes the sigmod activation function. MLP represents a standard multi-tier perceptron, AvgPool, Maxpool represent average pooling, maximum pooling operations;

the features generated by the attention network are added to the channel dimension of the sensor metadata, so that the corresponding weight of the channel dimension can be further generated, and the sensor axis corresponding to the human body posture action can be accurately positioned.

C: establishing a time sequence attention network:

and after each convolution block is added with the channel attention network, sequentially establishing a time sequence attention network, and adding time sequence attention to the time dimension of the characteristic diagram in order to determine the accurate position of the target action of the sensor data on the time axis. Channel dimension information is first compressed using average pooling and maximum pooling to generate average channel characteristics

And maximum channel characteristics

Wherein the content of the first and second substances,

representing the channel-averaged pooling characteristics along the time axis,

indicating the maximum pooled feature of the channels along the time axis, H, W being the height and width of these features, respectively, and the number of channels for these features being 1. And then linearly superposing the two features, and sending the two features into a convolution layer for convolution to obtain a final convolution feature, which is specifically expressed as:

M_T(F)＝σ(conv^7×1([AvgPool(F)；MaxPool(F)]))

where σ denotes the sigmod activation function, conv^7×1Representing a convolution layer with a convolution kernel size of 7 multiplied by 1, wherein AvgPool and MaxPool respectively represent average pooling operation and large pooling operation;

3.2, importing training samples to adjust the parameters of the convolutional neural network model to obtain a model with high accuracy

In the convolutional neural network model, the size of a first layer of convolutional kernel is (6, 1), and the step length is (2, 1); the second layer convolution kernel size is (6, 1) and the step size is (2, 1); the size of a convolution kernel in the third layer is (6, 1), and the step length is (2, 1); the convolutional layer filling is set to (1, 0), the activation functions all use ReLu and add BatchNorm layer by layer to reduce the overfitting possibility, and finally, in order to obtain a classification effect with more obvious tendency, the classification is output by a Softmax layer.

Has the advantages that: compared with the prior art, the invention has the following remarkable progress:

The original data is subjected to frequency resampling processing, time dimensionality and feature dimensionality can be fused together for comprehensive consideration, high-precision discrimination is realized after deep convolutional neural network training, linear weighting is carried out on the features obtained by sequentially adding channel attention and time sequence attention to the features extracted by convolution and the original features, richer sensor data features are obtained, the convolutional neural network is trained, and the trained network model is subjected to human body posture recognition. An attention mechanism is added to the channel axis and the time axis of the extracted convolution features, so that the time of target action in the features and the category of the target action can be effectively positioned, the method can be effectively used for identifying coarse-grained human body actions, and the complexity of manually marked data is reduced; the invention adopts the sliding window technology to quickly preprocess the data under the condition of ensuring that the data does not lose the action characteristic, thereby effectively avoiding the defects of the traditional data processing.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the present invention;

FIG. 3 is a plot of a small batch of waveform of the raw triaxial acceleration data of the present invention;

FIG. 4 is a graph of error variation for training times according to the present invention;

FIG. 5 is a visualization of the sensor training data channel dimensions in the present invention;

FIG. 6 is a visualization of the time dimension of sensor training data in the present invention;

FIG. 7 is a graph of a confusion matrix for a test data set of the present invention.

Detailed Description

The technical solution and effects of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention provides a human body posture recognition method based on time and channel double attention, which comprises the following steps:

step1, recruiting volunteers, wearing a movement sensor, recording three-axis acceleration data of the volunteers under different body part (such as wrist, chest, leg and the like) movements (such as standing, sitting, going up stairs, going down stairs, jumping, walking and the like), and attaching corresponding movement type labels to the movement signal data;

step2, cleaning the acquired triaxial acceleration data and removing noise, performing frequency resampling processing on the cleaned data, and dividing the data into a training set and a test set after normalization processing, wherein the frequency resampling processing and normalization processing are as follows: the data is subjected to time series signal frequency down-sampling to be arranged into a data signal diagram, and the data signal diagram obtained by the processing is subjected to normalization processing, namely is scaled to fall into a specific (0,1) interval;

Step3, the processed data is a four-dimensional tensor, and includes data, features and channel information. Then, the processed data is used as an input sample and sent to a convolutional neural network for training, the batch size and the learning rate are set, and a weight parameter is automatically updated by utilizing a back propagation technology to obtain an optimal convolutional neural network model;

The human body posture identification method based on the deep convolutional neural network can identify six action postures of jumping, walking, going upstairs, going downstairs, standing and sitting.

FIG. 1 is a flow chart of an object of the present invention, which is to collect data from an original sensor, preprocess the data, input the data to a convolutional neural network for model training, and apply an ideal model obtained after training to human posture recognition data, thereby realizing human posture recognition.

FIG. 2 is a block diagram of a convolutional neural network model based on time and channel dual attention. Which contains six layers of convolution and a final classification layer. The figure also includes the internal structure of the attention module, namely after data is input, the channel attention module generates the channel attention feature and the primary convolution feature to carry out linear weighting, and then the time attention module generates the time sequence attention feature and the channel attention feature to carry out linear weighting sequentially to generate the final sensor data feature.

Specifically, firstly, time series signal frequency resampling and normalization processing are carried out on various types of human posture action signal data collected from a mobile sensor, the processed data are sent to a convolution neural network to be subjected to convolution operation to obtain corresponding convolution characteristics, then the convolution characteristics are sent to a channel attention module and a time series attention module in sequence, and the obtained attention characteristics and original characteristics are subjected to linear weighting to obtain final attention characteristics. The attention mechanism for each convolutional neural network is implemented as shown in the attention module of FIG. 2. The size of convolution kernel F during the experiment was (6, 1), the convolution step size was (2, 1), the convolution pad was set to (1, 0), there were 128 convolution kernels in total, the ReLu activation function was used and the BatchNorm layer was added.

B: attention of the channel:

to determine the sensor feature importance of the convolution feature in the channel dimension, channel attention is added to the feature map channel dimension. Feature time dimension information is first compressed using average pooling and maximum pooling to generate an average timing feature

And maximum timing characteristics

And then sending the two characteristics to a multilayer perceptron to obtain final output characteristics by using point-by-point summation. The concrete expression is as follows:

M_C(F)＝σ(MLP(Avgpool(F))+MLP(MaxPool(F)))

Where σ denotes the sigmod activation function.

C: time-series attention:

to determine the exact location of the sensor data on the time axis for the target action, time-series attention is added to the feature map time dimension. First using an averaging cellCompressing channel dimension information by pooling and max-pooling to generate average channel characteristics

And maximum channel characteristics

And then linearly superposing the two features, and sending the two features into a convolution layer for convolution to obtain the final convolution feature. The concrete expression is as follows:

M_T(F)＝σ(conv^7×1([AvgPool(F)；MaxPool(F)]))

where σ denotes the sigmod activation function, conv^7×1Represents a convolution layer having a convolution kernel size of 7 × 1.

Then, training samples are led in to adjust the parameters of the convolutional neural network model, and a model with high accuracy is obtained.

And in the network training, the dynamic learning rate is adopted to ensure that the curve oscillation is small, the initial learning rate is set to be 0.001, and every 50epochs is reduced to be 0.1 time of the original rate.

Compared with the traditional convolutional neural network, the method utilizes the deep neural network to extract richer sensor data characteristics, and simultaneously adds channel attention and time sequence attention to the sensor data in sequence to generate the characteristic diagram with attention characteristics. Due to the superposition of the channel attention and the time sequence attention, the method can accurately position the type and the occurrence time of the target action after a large amount of coarse-grained data training, and greatly reduces the complexity of manually marking training data. Through experimental comparison, the method disclosed by the invention is obviously superior to the traditional convolutional neural network in precision and has a better positioning effect.

FIG. 3 is a plot of a small batch of waveforms of raw sensor triaxial acceleration data. The down-sampling frequency of the motion sensor is preferably set to about 33 Hz.

FIG. 4 is a graph of the error variation of the neural network after 500epochs training.

The accuracy of the model of the error map is continuously increased along with the application of the deep convolutional network and the attention module. When the deep convolutional network and the double attention module act simultaneously, the model can obtain the optimal precision, and the generalization capability of the model is greatly improved.

FIG. 5 is a visualization of the sensor training data channel dimensions in the present invention.

Through the visualization of the training data channel dimension, the sensor type which plays a role can be distinguished by the channel which plays a role in the sensor channel dimension when the target action occurs, and the method has important significance for guiding the aspects of human action recognition, health detection and the like.

FIG. 6 is a visualization of the time dimension of sensor training data in the present invention.

Through the visualization of the training data time sequence dimension, the occurrence area and the category of the target action can be accurately positioned in a long string of sensing data containing the background action. By using the semi-supervised attention mechanism, a more accurate human body action classification result can be obtained by training a large amount of rough label sample data, the strict marking property of training data is reduced, the complexity of manual marking data is saved, and the human body gesture recognition industry is more convenient and faster.

Confusion matrices are techniques used to summarize the performance of classification algorithms. If the number of samples in each class is not equal, or there are more than two classes in the dataset, then misleading may occur if only the classification accuracy is used as the criterion. Computing the confusion matrix allows us to better understand how the classification model behaves and what types of errors it makes. In the figure, we can see that the horizontal axis is the predicted result, the vertical axis is the true labeled result, and the main diagonal is the same number of samples as the predicted result and the true result.

By analyzing the confusion matrix, the recognition precision conditions of the convolutional neural network model to different actions can be obtained, so that the network parameters can be modified. The final model classification precision is 98.87, which meets the requirements of practical application.

Claims

1. A human body posture recognition method based on time and channel double attention is characterized by comprising the following steps:

step1, acquiring human body posture action signal data of each activity type through a mobile sensor, and attaching corresponding action attribute labels to the action signal data;

step2, preprocessing the collected motion signal data, and dividing the processed data into a training sample set and a testing sample set; the processing comprises the following steps: carrying out fixed-step sliding window processing on the data, cutting the original long-section sensor data into data with fixed size, carrying out data denoising, null-removing operation and normalization processing on the processed data signal, and scaling the data signal according to the proportion to make the data fall into a specific (0, 1) interval;

Step3, inputting the processed data serving as an input sample into a multi-attention deep convolutional neural network for training, setting a learning rate, an optimizer and a fixed batch, continuously reducing the loss value of a deep convolutional neural network model by using gradient descent, and updating each weight parameter until the loss value is smaller than a preset value to obtain a training model;

2. The method for recognizing human body posture based on time and channel double attention of claim 1, characterized in that in Step1, the sensor down-sampling frequency is set to 20Hz-40Hz when the sensor data signal is collected.

3. The human body posture recognition method based on time and channel double attention as claimed in claim 1, characterized in that: in Step2, the data processing includes removing outliers and nulls from the data, and rearranging the number of each activity category, so that the data set is subjected to uniform distribution, and 70% and 30% are used as training samples and test samples, respectively.

4. The method for recognizing human body posture based on time and channel double attention as claimed in claim 1, wherein Step3 specifically comprises the following steps:

3.1, establishing a 6-layer convolution attention neural network model, and establishing three convolution blocks in the whole model, wherein each block comprises two convolution extraction layers and a jump convolution layer;

constructing a convolutional neural network by using 6 convolutional layers, wherein each two convolutional layers are constructed into a convolutional block, and a jumper convolutional layer is added to the two layers of convolution of each block and used for keeping the dimension R of input data^C×H×WAnd output data dimension R^C’^×H×WLinear weighting can be carried out, wherein C → C 'is determined by the channel dimension number in the convolutional layers, C input is the number of data channels, H is the data height, W is the data width, C' is the number of output data channels, after 6 layers of convolutional layers are built, output data are sent to an attention network for linear weighting of attention weight, and then are sent to a full-connection layer for classification calculation of human body actions;

b: establishing a channel attention network:

sequentially adding a channel attention network and a time sequence attention network behind two convolutional layers of each block of the 3 blocks, and adding the channel attention network to the channel dimension of the feature diagram in order to determine the sensor feature importance degree of the convolutional features in the channel dimension; compressing feature time dimension information using average pooling and maximum pooling to generate average timing features

And maximum timing characteristics

wherein, sigma represents sigmod activation function, MLP represents a standard multilayer perceptron, and AvgPool and Maxpool represent average pooling and maximum pooling operations;

c: establishing a time sequence attention network:

after adding the channel attention network to each convolution block, sequentially establishing a time sequence attention network, adding time sequence attention to the time dimension of the feature map in order to determine the accurate position of the target action of the sensor data on a time axis, firstly compressing channel dimension information by using average pooling and maximum pooling to generate average channel features

And maximum channel characteristics

Wherein the content of the first and second substances,

representing the channel-averaged pooling characteristics along the time axis,

the method comprises the steps of representing the maximum pooling characteristic of channels along a time axis, enabling the number of the channels of all the characteristics to be 1, linearly superposing the two characteristics, feeding the two characteristics into a convolution layer for convolution to obtain a final convolution characteristic, and specifically representing the maximum pooling characteristic as follows:

3.2, importing training samples to adjust the parameters of the convolutional neural network model, and training the model

In the convolutional neural network model, the size of a first layer of convolutional kernel is (6, 1), and the step length is (2, 1); the second layer convolution kernel size is (6, 1) and the step size is (2, 1); the size of a convolution kernel in the third layer is (6, 1), and the step length is (2, 1); convolutional layer padding is set to (1, 0), activation functions are all using ReLu and adding BatchNorm layer by layer to reduce overfitting, and finally output classification after Softmax layer.