CN116524537A

CN116524537A - Human body posture recognition method based on CNN and LSTM combination

Info

Publication number: CN116524537A
Application number: CN202310465077.5A
Authority: CN
Inventors: 武其松; 孟德馨; 赵涤燹
Original assignee: Southeast University; Network Communication and Security Zijinshan Laboratory
Current assignee: Southeast University; Network Communication and Security Zijinshan Laboratory
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-01

Abstract

The invention discloses a human body posture recognition method based on CNN and LSTM combination, firstly, collecting intermediate frequency signal sample data required by training and testing; secondly, performing distance dimension Fourier transform on the intermediate frequency signal data to obtain a time-distance image, summing the distance unit data of the target along the distance dimension to obtain a one-dimensional distance spectrum peak, performing short-time Fourier transform to obtain a time-frequency image, and labeling each image with labels of different categories; establishing a three-channel deep learning neural network model, combining a CNN (computer numerical network) network and an LSTM (computer numerical network) network in each channel, taking a time-frequency image as input by a first channel and a second channel, extracting features by a convolution kernel with size difference in a convolution layer, and taking a time-distance image as input by a third channel. Data is input into the model for training. According to the method, the millimeter wave radar is combined with the CNN and LSTM networks, multiple types of characteristic images are fused, time sequence characteristic information is fully utilized, and accuracy of recognizing human body gestures is improved.

Description

Human body posture recognition method based on CNN and LSTM combination

Technical Field

The invention relates to a human body posture recognition method based on combination of CNN and LSTM, in particular to a target posture recognition method based on combination of millimeter wave radar, convolutional Neural Network (CNN) and long-short-term memory network (LSTM).

Background

In modern society of rapid technological development, target detection and motion recognition and classification become important research directions. In daily life, the old people with inconvenient behaviors are monitored and protected, road conditions are analyzed and judged in an automatic driving technology, criminals are detected in various anti-terrorist actions, and the like, and the method belongs to the field of research. The existing target recognition and action classification technology mainly has three aspects, namely, the method is based on a camera and a shot video, and the complete picture is utilized for recognition; secondly, the human body actions are identified by wearing wearable sensor equipment; thirdly, the detection is performed by using non-contact sensors such as radar, vision sensors, infrared sensors and the like. In addition to the more invasive privacy problem generated by camera monitoring, the security problem of privacy disclosure can also be generated when data flow to terminal equipment is transmitted; the wearable device causes a lot of inconveniences in daily life. The radar has the characteristics of non-wearing type, no influence of illumination and air environment, wall-penetrating detection without privacy problem and the like, so that the radar gradually becomes a popular choice for indoor monitoring and recognition.

In addition, the traditional machine learning method is simple and effective in manually extracting features when aiming at simple specific tasks, and deep learning developed from the traditional neural network has the characteristics of strong learning ability, good portability and high coverage, so long as a large amount of data are possessed, the recognition accuracy even exceeding the human performance can be obtained, and the training of different data can be realized by the same network structure. Deep learning has been widely used in various fields such as speech recognition, machine translation, image recognition, and automatic driving. The method for detecting the target by using the camera generally depends on a deep learning method based on a convolutional neural network. The deep learning utilizes the multi-layer network to capture image detail information better, performs segmentation extraction of target characters in the image, and tests the data set through the training data set to obtain parameters with higher accuracy, so that a neural network is built, and the image afferent network can be classified.

With the development of signal processing technology, we can determine the behavior of the monitored target by analyzing the electromagnetic wave received by the radar antenna. In the detection system, background noise generated by reflection of some static objects exists, and the identification of target objects can be affected. After background noise is removed from echo data received by the radar, the distance and Doppler frequency analysis of the time-varying signals is carried out through the distance Fourier transform and the short-time Fourier transform, so that the defect that the distance and frequency of the time-domain signals are difficult to acquire is overcome. The different actions are detected by a deep learning method by utilizing the characteristic that the frequency of signals generated by a human body under different action conditions is different and the generated time-frequency diagram and time-distance diagram are different.

Disclosure of Invention

Aiming at the problems, the human body gesture recognition method based on the combination of CNN and LSTM aims to solve the problem of low recognition accuracy rate on the premise of protecting privacy, and provides a method for realizing human body action gesture recognition by using a millimeter wave radar for the old or related groups.

The aim of the invention can be achieved by adopting the following technical scheme, which comprises the following steps:

step 1, obtaining intermediate frequency signals required by training and testing through a millimeter wave radar, and obtaining the intermediate frequency signals after mixing the transmitting signals with echo signals received by an receiving antenna.

The millimeter wave radar in the method is placed in the moving distance of the identified object, and after electromagnetic wave signals are sent out through the transmitting antenna, reflected signals are received through the reflecting antenna after being reflected by the monitoring target and the background environment, so that the non-contact monitoring is realized. In the detection process, signal acquisition needs to be performed on a plurality of different targets, and each target is required to realize a plurality of different postures.

Step 2, performing distance dimension Fourier transform on the intermediate frequency signal to obtain a time-distance image, and summing the distance unit data of the target along the distance dimension to obtain a one-dimensional distance spectrum peak; and carrying out short-time Fourier transform on the one-dimensional distance spectrum peak to obtain a time-frequency image, and labeling the two images with labels of corresponding types.

And 3, establishing an improved three-channel deep learning neural network model, combining CNN and LSTM networks in the three channels, taking time-frequency images as input in a first channel and a second channel, extracting features by adopting convolution kernels with size difference in a convolution layer, taking time-distance images as input in a third channel, inputting data into the model according to requirements, training, obtaining optimal model parameters, and storing.

The millimeter wave radar in the step 1 obtains the intermediate frequency signal needed by training and testing and is formed by mixing a transmitting signal and an echo signal received by a receiving antenna, and the steps are as follows:

step 1.1, a linear frequency modulation continuous wave signal, also called chirp signal, transmitted by a millimeter wave radar transmitting antenna at the time t is:

wherein A is _tx To transmit signal amplitude, f _c Is carrier frequency, B is bandwidth, T _c Is a sweep frequency period.

Step 1.2. The echo signal received by the receiving antenna is the delay time tau of the transmitting signal _d The following signals:

delay time τ _d Can be expressed as:

where d is the distance between the target and the radar and c is the speed of light.

And then mixing the echo signal and the transmitting signal to obtain an intermediate frequency signal, wherein the mixing refers to conjugate multiplication of the echo signal and the transmitting signal, and the intermediate frequency signal is in the following form:

wherein A is _IF For the amplitude of the intermediate frequency signal, f _b Is the frequency of the intermediate frequency signal,is the phase of the intermediate frequency signal.

In practice, signal processing is often done in the digital domain, requiring sampling of the signal. For multicycle chirp signals, f _b Andrelated to the time interval between chrip. Assume that the sampling rate of the millimeter wave radar system is f _s At this time, the discrete sampling form of the intermediate frequency signal is:

wherein m is a fast time sampling point, and represents distance dimension information of the signal; n is the slow time sampling point, characterizing the Doppler information of the signal.

Performing distance dimension Fourier transform on the intermediate frequency signal in the step 2 to obtain a time-distance image, and summing the distance unit data of the target along the distance dimension to obtain a one-dimensional distance spectrum peak; performing short-time Fourier transform on the one-dimensional distance spectrum peak to obtain a time-frequency image, and labeling the two images with labels of corresponding types, wherein the method comprises the following steps:

step 2.1, performing distance dimension Fourier transform on the sampled intermediate frequency signal to obtain a time-distance image:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing fourier transform of the fast time dimension, and k represents the distance dimension sampling point after fourier transform of the fast time dimension.

Step 2.2, summing the distance unit data of the target to obtain a one-dimensional distance spectrum peak, and performing short-time Fourier transform to obtain a time-frequency image:

wherein STFT represents short-time Fourier transform, p represents time-dimension sampling points after short-time Fourier transform, l represents Doppler-dimension sampling points after short-time Fourier transform, and k ₀ A starting distance unit k for representing the movement track crossing of the object to be detected ₁ And the distance unit is used for indicating the end of the motion trail of the target to be detected.

And 2.3, respectively labeling the time-distance images and the time-frequency images corresponding to various actions with corresponding labels, and storing the labels in different folders for classification.

In the step 3, an improved three-channel deep learning neural network model is established, and the gestures are classified by combining CNN and LSTM, and the method comprises the following steps:

and 3.1, constructing a three-channel network, and taking the time-distance image generated in the step 2 as the input of the first channel and the second channel and the time-distance image as the input of the third channel.

In the first and third channels, firstly, the convolution check image with a size of a×a is adopted in the convolution layer to perform feature extraction, and 0 filling (padding) is utilized to ensure that the image edge information is not lost, then the average pooling layer is adopted to reduce the parameter calculation amount, and the feature image is changed into sequence data and is sent into the LSTM network. In the second channel, features are extracted from the convolution layer by adopting a convolution check image with the size of b multiplied by b, wherein b is larger than a, because the time-frequency images after the transformation of different attitude signals have obviously different peaks, and the features with different thickness degrees can be extracted by adopting different convolution kernel sizes.

Step 3.3, fusing the three-channel feature images by using a concatate () method in a keras library, performing nonlinear operation on the feature images by using a Relu function, and classifying by using a softmax function through a full connection layer, wherein the softmax function is a function for converting a group of numbers into probability distribution, and maps each original value to a probability value between 0 and 1, and the formula is as follows:

wherein x is _i Is the original value to be converted, n is the number of categories, j represents the current category.

And 3.4, compiling a model, and performing error calculation on the real label and the actual label by using a classification cross loss entropy function as a loss function to obtain the model accuracy, wherein the classification cross loss entropy function is as follows:

further, compared with the prior art, the human body posture recognition method based on the combination of CNN and LSTM has the following advantages that:

1) The data set of the human body gesture recognition method based on the combination of millimeter wave radar and deep learning provided by the invention comes from a plurality of targets, and has higher diversity and universality;

2) The human body gesture recognition method based on the combination of millimeter wave radar and deep learning can effectively remove low-frequency clutter generated by static objects in the environment;

3) The human body gesture recognition method based on the millimeter wave radar and the deep learning combination can effectively extract the characteristics in the time-frequency image and the time-distance image obtained after the signals are transformed;

4) The human body posture recognition method based on the combination of millimeter wave radar and deep learning provided by the invention can be used for recognizing the human body posture with higher accuracy by fusing multiple aspects of information.

Drawings

FIG. 1 is a flow chart of the method of the present disclosure;

FIG. 2 is a deep learning neural network diagram of the present invention;

FIGS. 3.1 (a) and 3.1 (b) are time-frequency images and time-distance images, respectively, of a basketball playing position;

fig. 3.2 (a) and 3.2 (b) are respectively a time-frequency image and a time-distance image at the time of boxing gesture;

fig. 3.3 (a), 3.3 (b) are time-frequency images and time-distance images, respectively, at the dancing position;

fig. 3.4 (a), 3.4 (b) are time-frequency images and time-distance images, respectively, at jump-in-place;

fig. 3.5 (a), 3.5 (b) are time-frequency images and time-distance images, respectively, of a running;

fig. 4 (a) and fig. 4 (b) are training set and validation set results, respectively.

Detailed Description

In order to make the objects, embodiments and advantages of the present embodiments more comprehensible, the present embodiments are described in more detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1-3.5 (b), the invention provides a human body posture recognition method based on combination of CNN and LSTM, comprising the steps of:

In this embodiment, five actions of running, boxing, basketball, dancing and in-situ jump are mainly designed, the millimeter wave radar is placed at a distance of 1.5 meters from the identified object, and after electromagnetic wave signals are sent out through the transmitting antenna, reflected signals are received through the reflecting antenna after being reflected by the monitoring target and the background environment, so that non-contact monitoring is realized.

The linear frequency modulation continuous wave signal transmitted by the millimeter wave radar transmitting antenna at the time t is also called chirp signal, which is:

wherein A is _tx To transmit signal amplitude, f _c Is carrier frequency, B is bandwidth, T _c For the sweep period, the carrier frequency f in this embodiment _c 77GHz, bandwidth b=1.6 GHz, sweep period tc=40.96 μs.

The echo signal received by the receiving antenna is the delay time tau of the transmitting signal _d The following signals:

delay time τ _d Can be expressed as:

In practice, signal processing is often done in the digital domain, requiring sampling of the signal. For multicycle chirp signals, f _b Andrelated to the time interval between chirp. In this embodiment, the sampling rate of the millimeter wave radar system is f _s =6.25 MHz, the discrete sampling form of the intermediate frequency signal is:

wherein m is a fast time sampling point, and represents distance dimension information of the signal; n is the slow time sampling point, characterizing the Doppler information of the signal. In this embodiment, the fast sampling point number is 256, and the slow sampling point number is 100.

in this embodiment, a convolution kernel with a size of 3×3 may be adopted to perform feature extraction in the first channel, and a convolution kernel with a size of 5×5 may be adopted to extract features in the second channel, where the specific structure of the neural network is shown in fig. 2. The data set is used for 600 pieces, wherein each type of image is 100 pieces, and is divided into 480 pieces of training set and 120 pieces of verification set according to the proportion. The training runs were 20 runs, the number of samples was 5, and examples of time-frequency images and time-distance images for five different poses are shown in fig. 3. The network training results and test results are shown in fig. 4.

Table 1 shows the network structure in a specific embodiment

Table 2 shows the results of comparison of different types of methods

Method name	Classification accuracy
		Single class feature input CNN	91.24％
Single class feature input LSTM	89.9％
		The method of the invention	93.94％

In conclusion, the method utilizes the reflected signals received by the radar, and after the time-frequency and time distance images are obtained through Fourier transform, the human body gestures are identified and classified by combining the deep learning neural network, so that feature fusion is effectively performed, time sequence information is effectively utilized, and the accuracy of human body gesture identification by the millimeter wave radar is improved.

The foregoing is illustrative of the methods and structures of the present invention and modifications and substitutions of the specific embodiments described herein by those skilled in the art will be apparent to those of ordinary skill in the art without departing from the invention or beyond the scope of the appended claims.

Claims

1. The human body posture recognition method based on the combination of CNN and LSTM is characterized by comprising the following steps:

collecting radar reflection signals of different targets, wherein each target realizes a plurality of different postures;

obtaining a time-distance image and a time-frequency image according to the reflected signals, and respectively labeling the time-distance image and the time-frequency image corresponding to various gestures with corresponding labels;

establishing a deep learning neural network model comprising a feature extraction layer, a feature fusion layer and a full-connection layer, wherein the feature extraction layer comprises three channels, each channel comprises a CNN network and an LSTM network, the input of a first channel and a second channel is a time-frequency image, the input of a third channel is a time-distance image, the feature images output by the channels are input into the feature fusion layer for feature fusion, and the classification result is output at the full-connection layer;

training the deep learning neural network model;

and inputting the image to be identified into a trained deep learning neural network model, and outputting the corresponding human body posture type.

2. The human body posture recognition method based on the combination of CNN and LSTM according to claim 1, wherein the reflected signal is an intermediate frequency signal obtained by mixing a transmitting signal with an echo signal received by a receiving antenna.

3. The human body posture recognition method based on the combination of CNN and LSTM according to claim 2, wherein the discrete sampling form of the intermediate frequency signal is:

wherein A is _IF For the amplitude of the intermediate frequency signal, f _b Is the frequency of the intermediate frequency signal,the phase of the intermediate frequency signal, m is a fast time sampling point, and the distance dimension information of the signal is represented; n is a slow time sampling point, doppler information representing a signal, f _s Is the radar system sampling rate.

4. The human body posture recognition method based on the combination of CNN and LSTM according to claim 1, wherein the reflected signal is subjected to distance dimension Fourier transform to obtain a time-distance image.

5. The human body posture recognition method based on the combination of CNN and LSTM according to claim 1, wherein the distance unit data of the target is summed along the distance dimension to obtain a one-dimensional distance spectrum peak; and carrying out short-time Fourier transform on the one-dimensional distance spectrum peak to obtain a time-frequency image.

6. The method for recognizing human body posture based on the combination of CNN and LSTM according to claim 1, wherein the convolution layers of the first and second channels use convolution kernels having a difference in size.

7. The method of claim 1, wherein the convolution layer of the first channel and the third channel uses a convolution kernel of a×a, and the convolution layer of the second channel uses a convolution kernel of b×b, wherein b > a.

8. The human body posture recognition method based on the combination of CNN and LSTM according to claim 1, wherein the three-channel characteristic images are fused by using a concatate () method in a keras library, the characteristic images are subjected to nonlinear operation by using a Relu function, and then classified by using a softmax function through a full connection layer.