CN112668443A

CN112668443A - Human body posture identification method based on two-channel convolutional neural network

Info

Publication number: CN112668443A
Application number: CN202011547116.9A
Authority: CN
Inventors: 白雪茹; 刘潇丹; 惠叶; 周峰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-16

Abstract

The invention provides a human posture identification method based on a two-channel convolutional neural network, which solves the problems that in the prior art, a time-frequency graph with a single STFT window length cannot simultaneously and obviously reflect the characteristics of limbs and trunk of a human body, and the expandability of the human posture identification types is poor. The method comprises the following implementation steps: (1) constructing a double-channel convolutional neural network; (2) generating a training set; (3) training a two-channel convolutional neural network; (4) and recognizing the human body posture category. The invention constructs a double-channel convolution neural network, so that the network simultaneously utilizes the detail characteristics of the limbs and the trunk of the human body, and expands the types of the human body posture recognition.

Description

Human body posture identification method based on two-channel convolutional neural network

Technical Field

The invention belongs to the technical field of radars, and further relates to a human body posture identification method based on a two-channel convolutional neural network in the technical field of radar target identification. The method can be used for identifying the postures of the pedestrians on the road surface from the time-frequency graph of the radar echo data and classifying the postures of the pedestrians on the road surface.

Background

The radar echo signals of the human body movement contain Doppler frequency generated by micro-motion modulation of all parts of the human body, a time-frequency graph can be generated after short-time Fourier transform (STFT) is carried out on the radar echo signals, and the main target of human body posture identification based on the radar echo signals is to judge the category of the human body posture from the time-frequency graph. Therefore, the human body posture recognition based on the radar time-frequency graph can also be regarded as a classification problem that the input is the time-frequency graph and the output is the posture category. The traditional radar human body posture classification method mainly depends on manual extraction of human body micro Doppler features in a time frequency image, and is high in complexity and poor in expandability of new target categories. At present, the convolutional neural network becomes a mainstream method in human body posture recognition due to the strong image representation capability of the convolutional neural network.

The patent document of the university of qinghua based on micro doppler feature and support vector machine discloses a human body gait recognition method (patent application No. 201610626219.1, application publication No. CN 106250854A). The method comprises the following specific steps: 1. collecting gait data of a human body during advancing by using a radar; 2. analyzing the gait data by using a time-frequency analysis tool to obtain a corresponding time-frequency graph; 3. extracting bandwidth characteristics and bias characteristics from the time-frequency diagram by calculating the span range of positive and negative micro Doppler frequencies caused by gait and the deviation of the positive and negative Doppler frequencies relative to the central frequency; 4. and inputting the extracted bandwidth characteristic bias characteristics into a support vector machine for gait recognition so as to determine the posture corresponding to the gait data. The method has the defects that when a new posture category is expanded in the data set, the bandwidth characteristic and the bias characteristic of the category can be extracted after recalculation is needed, and the expandability on the posture identification category of the human body is poor.

The patent document of Tianjin university discloses a human body motion classification method in a convolutional neural network human body motion classification method based on radar simulation images (patent application No. 201710325528.X, application publication No. CN 107169435A). The method comprises the following specific steps: 1. establishing a time-frequency image data set with a single STFT window length containing various human body actions; 2. enhancing radar time-frequency image data; 3. establishing a convolutional neural network model; 4. and training a convolutional neural network model. The method has the disadvantages that only a time-frequency graph with a single STFT window length is used as a sample set, when the STFT window length is shorter, the characteristics of the limbs of a human body in the time-frequency graph are obvious, when the STFT window length is longer, the characteristics of the trunk of the human body in the time-frequency graph are obvious, and the characteristics of the limbs and the trunk of the human body cannot be simultaneously and obviously reflected by the time-frequency graph with the single STFT window length.

Disclosure of Invention

The invention aims to provide a human body posture identification method based on a dual-channel convolutional neural network aiming at the defects of the prior art, and the method is used for solving the problems that in the prior art, a time-frequency graph with a single STFT window long cannot simultaneously and obviously reflect the characteristics of limbs and trunk of a human body, and the expandability of the human body posture identification types is poor.

In order to achieve the purpose, the idea of the invention is as follows: firstly, a short-window long-channel module and a long-window long-channel module are built, and a two-channel convolutional neural network is built on the basis of the short-window long-channel module and the long-window long-channel module, wherein the input of the two-channel convolutional neural network is a short-window long time-frequency diagram and a long-window long time-frequency diagram which are obtained by respectively making a short-window long STFT and a long-window long STFT by radar echo signals; when a new category is expanded, the network structure parameters of the double-channel convolutional neural network do not need to be changed, and the multi-category human body gestures can be directly recognized.

In order to achieve the purpose, the method comprises the following specific steps:

(1) constructing a two-channel convolutional neural network:

(1a) build a short window long channel module on 11 layers, its structure is in proper order: the device comprises a first convolution layer, a ReLU active layer, a first pooling layer, a second convolution layer, a ReLU active layer, a second pooling layer, a third convolution layer, a ReLU active layer, a third pooling layer, a fourth convolution layer and a full-connection layer;

the parameters of each layer are set as follows: the number of convolution kernels of the first layer to the fourth layer is sequentially set to be 8, 16, 32 and 64, the sizes of the convolution kernels are respectively set to be 9 multiplied by 9, 5 multiplied by 5, 7 multiplied by 7 and 6 multiplied by 6, the first layer to the third layer of pooling all adopt a maximum pooling mode, the sizes of the pooling kernels are sequentially set to be 3 multiplied by 3, 2 multiplied by 2 and 2 multiplied by 2, the pooling step length is sequentially set to be 3, 2 and 2, and the output dimension of the full connection layer is 30;

(1b) build a long window long channel module on 11 layers, its structure is in proper order: the device comprises a first convolution layer, a ReLU active layer, a first pooling layer, a second convolution layer, a ReLU active layer, a second pooling layer, a third convolution layer, a ReLU active layer, a third pooling layer, a fourth convolution layer and a full-connection layer;

(1c) building a classification module consisting of a concat layer with a splicing dimensionality of 60, a full connection layer with the number of output neurons being N and a softmax layer, wherein the softmax layer adopts a softmax function to calculate the probability of identifying an input sample into each category;

(1d) connecting the short-window long-channel module and the long-window long-channel module in parallel and then connecting the short-window long-channel module and the long-window long-channel module in series with the classification module to form a two-channel convolutional neural network;

(2) generating a training set:

(2a) selecting N human body posture categories, wherein each category at least comprises 150 radar echo signals, N represents the total number of the human body posture categories, and N is more than or equal to 3;

(2b) respectively carrying out short-time Fourier transform (STFT) with the window length of 101 and short-time Fourier transform (STFT) with the window length of 201 on each selected radar echo signal;

(2c) respectively forming time-frequency graphs after short-time Fourier transform (STFT) into time-frequency graph sets with short window length and long window length; the parameters of each layer of the network are as follows: the splicing dimensionality of the concat layer is set to be 60, and the number of output neurons of the full connection layer is 3;

(2d) respectively carrying out dimensionality reduction treatment on each time-frequency image in the two time-frequency image sets to obtain two time-frequency image sets with each time-frequency image size of 128 multiplied by 128 pixels;

(2e) forming a training set by the two time-frequency image sets of the processed short window length and the long window length;

(3) training a two-channel convolutional neural network:

inputting all time-frequency graphs in the short-window long-frequency graph set and all time-frequency graphs in the long-window long-frequency graph set in the training set into a short-window long-channel module and a long-window long-channel module in a dual-channel convolutional neural network respectively, and iteratively updating parameters of each layer of the dual-channel convolutional neural network by using a back propagation gradient descent method until the loss value of the dual-channel convolutional neural network can enable the parameters to be trained to gradually tend to the numerical value which enables the correct classification probability to be maximum, so as to obtain the trained dual-channel convolutional neural network;

(4) recognizing the human body posture category:

(4a) processing radar echo signals of the human body posture to be recognized by adopting the same method as the steps (2b) and (2d) to obtain a short-window long-time-frequency graph and a long-window long-time-frequency graph of the human body posture to be recognized, wherein the size of each time-frequency graph is 128 multiplied by 128 pixels;

(4b) and respectively inputting the short-window long-time frequency diagram and the long-window long-time frequency diagram of the human body posture to be recognized into a short-window long-channel module and a long-window long-channel module in the trained two-channel convolutional neural network, calculating the probability that the time-frequency diagram to be recognized is recognized into each class through a softmax layer, and selecting the class corresponding to the highest probability as a classification result.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention constructs a short-window long-channel module and a long-window long-channel module, and the input of the short-window long-channel module and the long-window long-channel module is that a short-window long STFT and a long-window long STFT are respectively made by radar echo signals to obtain a short-window long time-frequency graph and a long-window long time-frequency graph, the problem that the prior art only uses the single time-frequency graph with the STFT window length as a training set, but the single time-frequency graph with the STFT window length cannot simultaneously reflect the detailed characteristics of the limbs and the trunk of the human body is solved, and the technology provided by the invention can simultaneously extract the detailed characteristics of the limbs and.

Secondly, because the invention constructs a classification module, when expanding a new category, the invention can output the recognition probability of multi-category human body gestures without changing network parameters and structures, and overcomes the problem of poor expandability to the human body gesture recognition categories when expanding the new gesture category in the data set in the prior art, so that the invention expands the categories to the human body gesture recognition.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The specific steps of the present invention will be further described with reference to fig. 1.

Step 1, constructing a double-channel convolution neural network.

Build a short window long channel module on 11 layers, its structure is in proper order: the device comprises a first convolution layer, a ReLU active layer, a first pooling layer, a second convolution layer, a ReLU active layer, a second pooling layer, a third convolution layer, a ReLU active layer, a third pooling layer, a fourth convolution layer and a full connection layer.

The parameters of each layer are set as follows: the number of convolution kernels of the first layer to the fourth layer is sequentially set to be 8, 16, 32 and 64, the sizes of the convolution kernels are respectively set to be 9 multiplied by 9, 5 multiplied by 5, 7 multiplied by 7 and 6 multiplied by 6, the first layer to the third layer all adopt a maximum pooling mode, the sizes of the pooling kernels are sequentially set to be 3 multiplied by 3, 2 multiplied by 2 and 2 multiplied by 2, the pooling step sizes are sequentially set to be 3, 2 and 2, and the output dimension of the full connection layer is 30.

Build a long window long channel module on 11 layers, its structure is in proper order: the device comprises a first convolution layer, a ReLU active layer, a first pooling layer, a second convolution layer, a ReLU active layer, a second pooling layer, a third convolution layer, a ReLU active layer, a third pooling layer, a fourth convolution layer and a full connection layer.

Building a classification module consisting of a concat layer with a splicing dimensionality of 60, a full connection layer with the number of output neurons being N and a softmax layer, wherein the softmax layer adopts a softmax function for calculating the probability that the input samples are identified to each category, and the softmax function is as follows:

wherein p is_jInput sample x representing class n_nProbability of being identified as class j sample after softmax layer, e represents index operation with natural index e as base, W_iAnd representing components related to the ith output neuron in weight parameters of all connection layers in the classification module, wherein the values of i and j are equal, the superscript T represents transposition operation, and k represents a category serial number.

And connecting the short-window long-channel module and the long-window long-channel module in parallel and then connecting the short-window long-channel module and the long-window long-channel module in series with the classification module to form a two-channel convolutional neural network.

And 2, generating a training set.

Selecting N human body posture categories, wherein each category at least comprises 150 radar echo signals, N represents the total number of the human body posture categories, and N is larger than or equal to 3.

And respectively performing short-time Fourier transform STFT with the window length of 101 and short-time Fourier transform STFT with the window length of 201 on each selected radar echo signal.

And respectively forming time-frequency graphs after the short-time Fourier transform (STFT) into time-frequency graph sets with short window length and long window length.

For each time-frequency image in the short-window long-frequency image set, calculating a threshold Th of pixel points of the time-frequency image set, and setting values of all pixel points which are greater than or equal to Th in the time-frequency image as Th, wherein a calculation formula of the threshold Th is as follows:

Th＝(I_min+I_max)/2

wherein, I is a matrix formed by all pixel point values of the short-window long-time-frequency diagram, I_maxIs the maximum value of all pixel values in the short-window long-time-frequency diagram, I_minThe minimum value of all pixel values in the short window length time-frequency image is obtained.

And for each time-frequency image in the long-window long-frequency image set, performing power transformation of 1.2 on each pixel point value.

And respectively carrying out dimension reduction treatment on each time-frequency image in the two time-frequency image sets to obtain two time-frequency image sets with each time-frequency image size of 128 multiplied by 128 pixels.

And combining the two processed time-frequency image sets into a training set.

And 3, training the double-channel convolutional neural network.

And respectively inputting all the time-frequency graphs in the short-window long-frequency graph set and all the time-frequency graphs in the long-window long-frequency graph set in the training set into a short-window long-channel module and a long-window long-channel module in the dual-channel convolutional neural network, and iteratively updating each layer of parameters of the dual-channel convolutional neural network by using a back propagation gradient descent method until the loss value of the dual-channel convolutional neural network can enable the parameters to be trained to gradually tend to the numerical value which enables the correct classification probability to be maximum, so as to obtain the trained dual-channel convolutional neural network. The loss value of the two-channel convolutional neural network is represented by a cross entropy loss function, and the expression is as follows:

where L represents the loss value of the network, S represents the number of input training samples, log represents the base 10 logarithmic operation, p_jIs the probability that the input sample is identified as a class j sample.

And 4, recognizing the human body posture category.

And (3) processing the radar echo signals of the human body posture to be recognized by adopting the same method as the step 2 to obtain a short-window long time-frequency graph and a long-window long time-frequency graph of the human body posture to be recognized, wherein the size of each time-frequency graph is 128 multiplied by 128 pixels.

And respectively inputting the short-window long-time frequency diagram and the long-window long-time frequency diagram of the human body posture to be recognized into a short-window long-channel module and a long-window long-channel module in the trained two-channel convolutional neural network, calculating the probability that the time-frequency diagram to be recognized is recognized into each class through a softmax layer, and selecting the class corresponding to the highest probability as a classification result.

Obtaining the Accuracy of test identification according to the real category of the time-frequency graph to be identified and the category of the network discrimination, wherein the calculation formula is as follows:

where T is the number of samples to be recognized, pic_tLabel, the class identified for the tth test sample network_tIs the true category of the t-th test sample, the category pic judged by the network of the t-th test sample_tAnd the true category label of the t test sample_tEqual, h (pic)_t,label_t) Equal to 1, otherwise h (pic)_t,label_t) Equal to 0.

The effect of the present invention is further explained by combining the simulation experiment as follows:

1. simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: the processor is an Intel Xeon E5-2683 NVIDIA Geforce GTX1080 Ti GPU, the dominant frequency is 2.00GHz, and the memory is 64 GB.

The software platform of the simulation experiment of the invention is as follows: windows 7 operating system and python 3.6.

The simulation experiment of the invention uses 3 types of radar echo signals of human body postures, and the 3 types are respectively as follows: normal walking, high leg lifting walking and stooping walking. The radar echo signals are from a motion capture (MOCAP) database provided by the american University of kanaikymenuron (CMU) image laboratory. And respectively performing short-time Fourier transform (STFT) with the window length of 101 and short-time Fourier transform (STFT) with the window length of 201 on each radar echo signal, wherein the size of each time-frequency graph is 128 multiplied by 128 pixels, and the format of the time-frequency graph is mat. The number of samples corresponding to each channel of each gait is shown in table 1.

Table 1 two-channel sample number settings summary

2. Simulation content and result analysis thereof:

the simulation experiment of the invention is to classify the input time-frequency graphs of the radar echo signals of the simulation experiment using the 3-class human body postures respectively by adopting the classification method based on the traditional convolutional neural network in the prior art and the invention, so as to obtain the classification result.

In a simulation experiment, the classification method based on the traditional convolutional neural network is as follows: and inputting the time-frequency diagram with the window length of 101 and the single STFT window length into a traditional single-channel convolutional neural network and inputting the time-frequency diagram with the window length of 201 and the single STFT window length into a traditional single-channel convolutional neural network.

And evaluating the classification results of the two methods by using the recognition Accuracy Accuracy.

The recognition accuracy was calculated using the following formula, and all calculation results are plotted in table 2:

where T represents the number of samples to be recognized, pic_tIndicates the type, label, discriminated by the tth test sample network_tRepresenting the real category of the t-th test sample, when the category pic is judged by the t-th test sample network_tAnd the true category label of the t test sample_tEqual, h (pic)_t,label_t) Equal to 1, otherwise h (pic)_t,label_t) Equal to 0.

TABLE 2 comparison table of classification results of the present invention and the conventional convolutional neural network in simulation experiment

As can be seen from the table 2, the recognition Accuracy Accuracy of the invention is 96.24%, which is higher than that of the classification method based on the traditional convolutional neural network, and the invention proves that the invention can obtain higher classification Accuracy of the time-frequency diagram of the human posture.

The above simulation experiments show that: the method provided by the invention can simultaneously reflect the detailed characteristics of the limbs and the trunk of the human body by utilizing the built two-channel convolutional neural network, solves the problems that the time-frequency diagram with a single STFT window length in the prior art cannot simultaneously and obviously reflect the characteristics of the limbs and the trunk of the human body, so that the identification accuracy is low, and the expandability of the identification types of the human body postures is poor, and is a very practical human body posture identification method.

Claims

1. A human body posture identification method based on a double-channel convolution neural network is characterized in that the double-channel convolution neural network consisting of a short-window long-channel module, a long-window long-channel module and a classification network is constructed, and the method specifically comprises the following steps:

(1) constructing a two-channel convolutional neural network:

(2) generating a training set:

(2c) respectively forming time-frequency graphs after short-time Fourier transform (STFT) into time-frequency graph sets with short window length and long window length;

(2e) combining the processed short window length and long window length time frequency image sets into a training set;

(3) training a two-channel convolutional neural network:

(4) recognizing the human body posture category:

2. The human body posture recognition method based on the dual-channel convolutional neural network of claim 1, wherein the softmax function in step (1c) is as follows:

wherein p is_jInput sample x representing class n_nProbability of being identified as class j sample after softmax layer, e represents index operation with natural index e as base, W_iAnd representing components related to the ith output neuron in weight parameters of all connection layers in the classification module, wherein the values of i and j are equal, the superscript T represents transposition operation, and k represents the serial number of the category.