CN115877376A

CN115877376A - Millimeter wave radar gesture recognition method and recognition system based on multi-head self-attention mechanism

Info

Publication number: CN115877376A
Application number: CN202211566615.1A
Authority: CN
Inventors: 赵雅琴; 宋雨晴; 吴龙文; 刘璞秋; 何胜阳; 左伊芮; 周仕扬
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-03-31

Abstract

The invention discloses a millimeter wave radar gesture recognition method and system based on a multi-head self-attention mechanism, and relates to a rapid and light gesture recognition method and system based on a millimeter wave radar. The invention aims to solve the problems that most of the existing gesture recognition technologies based on radar use a characteristic spectrogram and a convolutional neural network to carry out gesture classification recognition, the training time is long, the occupied storage space is large, and the attention mechanism is not considered. The process is as follows: 1. adopting a millimeter wave radar to collect gesture data to form a gesture data training set; 2. obtaining a range-doppler plot; 3. simplifying a distance-time spectrum, a speed-time spectrum, an orientation spectrum and a pitching spectrum to obtain a 28 x 4-dimensional mixed feature vector; 4. obtaining a trained gesture recognition network; 5. and (3) the gesture data to be detected acquired by the millimeter wave radar is subjected to a two-input and three-input trained gesture recognition network to obtain a gesture data recognition result to be detected. The invention is used in the field of gesture recognition.

Description

Millimeter wave radar gesture recognition method and recognition system based on multi-head self-attention mechanism

Technical Field

The invention relates to a rapid and light gesture recognition method and system based on a millimeter wave radar.

Background

The non-contact gesture recognition is used as a novel man-machine interaction mode, accords with the body language habit of people, and has wide application prospect. In the aspect of medical treatment, a doctor can control medical equipment through gestures, so that non-contact medical operation is realized; in the field of automobiles, a driver and passengers can send instructions to an automobile center console through gestures; in the field of smart home, people can control common electric appliances such as air conditioners, televisions and the like by using gesture actions; in the AR/VR field, a player can control objects in a game by using gestures, and substitution feeling is enhanced. The gesture recognition method based on the millimeter wave radar becomes an important man-machine interaction mode due to the advantages of non-contact, strong sensing capability on the micro-motion target, capability of working all day long, all weather, no influence of light rays, no privacy disclosure and the like. At present, most millimeter wave radars for gesture recognition adopt a frequency modulation continuous wave technology and a multi-transmitting and multi-receiving antenna, which are the premise of the invention, and evaluation indexes of gesture recognition mainly comprise the type and recognition precision of gestures.

In general, millimeter wave radar gesture recognition can be divided into 3 major steps: firstly, detecting and collecting dynamic gesture information of a user by using a millimeter radar sensor; then, preprocessing the echo signal, extracting the dynamic gesture features to the maximum extent, and filtering interference clutter; and finally, selecting an appropriate algorithm to classify and recognize the gesture according to the gesture feature preprocessing result. Most of the existing gesture recognition technologies based on radar utilize a characteristic spectrogram and a convolutional neural network to carry out gesture classification recognition, the training time is long, the occupied storage space is large, and the attention mechanism is not considered.

Disclosure of Invention

The invention aims to solve the problems that most existing gesture recognition technologies based on radar use characteristic spectrograms and convolutional neural networks to carry out gesture classification recognition, training time is long, occupied storage space is large, and attention mechanism is not considered, and provides a millimeter wave radar gesture recognition method and system based on a multi-head self-attention mechanism.

The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism comprises the following specific processes:

firstly, acquiring gesture data by adopting a millimeter wave radar with a signal form of frequency modulated continuous waves to form a gesture data training set;

secondly, preprocessing the acquired gesture data to obtain a range Doppler RD image;

thirdly, obtaining a distance-time spectrum RTM, a speed-time spectrum DTM, an azimuth spectrum ATM and a pitch spectrum ETM based on the distance Doppler RD diagram obtained in the second step, and simplifying the distance-time spectrum RTM, the speed-time spectrum DTM, the azimuth spectrum ATM and the pitch spectrum ETM to finally obtain a 28 x 4-dimensional mixed feature vector;

step four, constructing a gesture recognition network 8HBi-GRU, and inputting the mixed feature vector into the gesture recognition network Bi-GRU to obtain a trained gesture recognition network Bi-GRU;

preprocessing gesture data to be detected acquired by the millimeter wave radar to obtain a Range Doppler (RD) image; obtaining a distance-time spectrum RTM, a speed-time spectrum DTM, an orientation spectrum ATM and a pitch spectrum ETM based on the obtained distance Doppler RD diagram, simplifying the distance-time spectrum RTM, the speed-time spectrum DTM, the orientation spectrum ATM and the pitch spectrum ETM, and finally obtaining a 28 multiplied by 4 dimensional mixed feature vector; and inputting the obtained 28 multiplied by 4 dimensional mixed feature vector into a trained gesture recognition network Bi-GRU to obtain a gesture data recognition result to be detected.

The millimeter wave radar gesture recognition system based on the multi-head self-attention mechanism is used for executing the millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism.

The invention has the beneficial effects that:

the method adopts the millimeter wave radar with the signal form of frequency modulation continuous waves to acquire gesture data, performs target detection and feature extraction on a data set, and finally performs gesture recognition of 12 gestures from the light quantization angle by means of a neural network.

In order to realize gesture recognition, the invention not only extracts the common distance and speed characteristics, but also extracts the azimuth angle and the pitch angle. In the existing gesture recognition method based on the radar, original radar data are directly put into a neural network, distance, doppler and arrival angle information are extracted and then put into the neural network, and pitch angle characteristics are mostly not used.

The invention uses the weighted average method to compress RTM, DTM, ATM and ETM data, and extracts the characteristic value as accurately as possible, thereby obtaining the mixed characteristic vector of 28 x 4 dimensions, wherein the mixed characteristic vector has 28 frames, each frame contains 4 characteristic values of distance, speed, azimuth angle and pitch angle, and the data volume is greatly reduced. The mixed feature vectors and the proposed gesture recognition network 8HBi-GRU are adopted for classification, the network can fully fuse 4 features and extract the time correlation of gesture data, and experimental results show that the recognition accuracy of 12 micro gestures can reach 98.24%, the model training and recognition speed is high, and the fast and light gesture recognition is realized.

Drawings

FIG. 1 is a flow chart of the present invention

FIG. 2a is an exemplary graph of compression of RTM data into feature vectors;

FIG. 2b is an exemplary graph of DTM data compression into feature vectors;

FIG. 2c is an exemplary graph of ATM data compression into feature vectors;

FIG. 2d is an exemplary graph of compression of ETM data into feature vectors;

FIG. 3 is a block diagram of an 8HBi-GRU network proposed by the present invention;

FIG. 4 is a schematic diagram of a GRU model principle;

FIG. 5 is a schematic diagram of a Bi-GRU model;

FIG. 6 is a schematic view of a self-attention mechanism model;

FIG. 7a is a graph of Accuracy (Accuracy) during training of an 8HBi-GRU network;

FIG. 7b is a graph of Loss (Loss) during training of an 8HBi-GRU network;

FIG. 8 is a graph of the confusion matrix obtained from testing an 8HBi-GRU network.

Detailed Description

The first specific implementation way is as follows: the millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism in the embodiment specifically comprises the following processes:

firstly, a millimeter wave radar with a signal form of Frequency Modulated Continuous Wave (FMCW) is adopted for gesture data acquisition, and a gesture data training set is formed;

step four, from the light-weight perspective, constructing a gesture recognition network 8HBi-GRU, inputting the mixed feature vectors into the gesture recognition network Bi-GRU to obtain a trained gesture recognition network Bi-GRU, wherein the obtained lightweight model can finally reach the recognition accuracy of 98.24%, and has the advantages of short training time, high training speed and small data volume;

preprocessing gesture data to be detected acquired by the millimeter wave radar to obtain a range Doppler RD image; obtaining a distance-time spectrum RTM, a speed-time spectrum DTM, an orientation spectrum ATM and a pitch spectrum ETM based on the obtained distance Doppler RD diagram, simplifying the distance-time spectrum RTM, the speed-time spectrum DTM, the orientation spectrum ATM and the pitch spectrum ETM, and finally obtaining a 28 x 4-dimensional mixed feature vector; and inputting the obtained 28 multiplied by 4 dimensional mixed feature vector into the trained gesture recognition network Bi-GRU to obtain a gesture data recognition result to be detected.

The second embodiment is as follows: the second step is to pre-process the acquired gesture data to obtain a range-doppler RD diagram;

the specific process is as follows:

filtering out static object components in gesture data (echo signals) collected by the millimeter wave radar by means of an MTI moving target display technology to obtain gesture data with the static object components filtered out;

performing 2D-FFT on the gesture data with the static object components filtered out in a distance dimension and a speed dimension to obtain a distance Doppler RD image;

and filtering the interference target (the constant false alarm detector CFAR finishes the detection and removal of the interference target) in the obtained range Doppler RD image by adopting the constant false alarm detector CFAR, and obtaining the range Doppler RD image only containing the human hand target after the interference target is filtered.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the difference between the present embodiment and the first or second specific embodiment is that 2D-FFT is performed on the gesture data with the stationary object component filtered out in the distance dimension and the velocity dimension to obtain a distance doppler RD graph;

the specific process is as follows:

and performing 2D-FFT on the gesture data with the static object components filtered out in a distance dimension and a velocity dimension to obtain a distance Doppler RD diagram (the horizontal axis is a velocity index, the vertical axis is a distance index, and the value is reflected to the RD diagram and is a color), wherein the expression is as follows:

wherein s is _IF (m, N) is gesture data collected by a Frequency Modulated Continuous Wave (FMCW) radar, N _c For the number of chirp signals, N _adc The number of gesture data originally collected for a Frequency Modulated Continuous Wave (FMCW) radar; j is an imaginary unit, j ² = -1; m is an index of an original pulse signal chirp, and n is an index of an original sampling point;

the target will appear on the RD map as a cluster of higher energy pixels on the RD map.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode: the difference between the present embodiment and one of the first to third embodiments is that, in the third step, based on the range doppler RD map obtained in the second step, a range-time spectrum RTM, a speed-time spectrum DTM, an azimuth spectrum ATM, and a pitch spectrum ETM are obtained, and the range-time spectrum RTM, the speed-time spectrum DTM, the azimuth spectrum ATM, and the pitch spectrum ETM are simplified to finally obtain a 28 × 4 dimensional mixed feature vector;

the specific process is as follows:

thirdly, projecting the range Doppler RD image only containing the hand target on a longitudinal axis, and splicing the range Doppler RD image frame by frame to obtain a range-time spectrum RTM image;

step two, projecting the range Doppler RD image only containing the hand target on a horizontal axis, and splicing the range Doppler RD image frame by frame to obtain a speed-time spectrum DTM image;

thirdly, performing DOA estimation on a human hand target point detected in a range Doppler RD image only containing a human hand target, namely performing angle FFT on a horizontal channel dimension, and splicing frame by frame to obtain an azimuth spectrum ATM; and performing angle FFT on the vertical channel dimension, and splicing frame by frame to obtain the pitch spectrum ETM.

And step three, simplifying the distance-time spectrum RTM, the speed-time spectrum DTM, the azimuth spectrum ATM and the pitch spectrum ETM to obtain a 28 x 4-dimensional mixed feature vector.

Other steps and parameters are the same as those in one of the first to third embodiments.

The fifth concrete implementation mode: the difference between this embodiment and one of the first to fourth embodiments is that, in the third and fourth steps, the distance-time spectrum RTM, the speed-time spectrum DTM, the azimuth spectrum ATM, and the pitch spectrum ETM are simplified to obtain a 28 × 4-dimensional mixed feature vector;

the specific process is as follows:

performing data compression on RTM, DTM, ATM and ETM to obtain a mixed feature vector with 28 x 4 dimensions, wherein 28 represents 28 frames of data, and 4 represents 4 features of distance, speed, azimuth angle and pitch angle, as shown in FIGS. 2a, 2b, 2c and 2 d;

step three, for the distance-time spectrum RTM diagram, the number of rows (the number of points from the FFT) and the number of columns of the distance-time spectrum RTM diagram are R =128 and l =28, respectively, that is, the distance-time spectrum RTM diagram is formed by splicing distance distributions of targets in 28 frame data, the distance distribution of the target in each frame data is divided into 128 distance units to represent, and assuming that the energy value of the pixel point in the ith row in the ith column is E _ R (l, R), the distance estimation f _ R (l) of the target in the ith frame data can be represented as:

wherein L =1,2, \8230;, L, R =1,2, \8230;, R;

the operation is carried out on 28 frames of data, the R multiplied by L-size distance-time spectrum RTM graph is simplified into a feature vector with the size of 1 multiplied by L, the dimension reduction method is efficient and direct, and one number can reflect the distance information of a target in certain frame of data;

step three, step two, for the velocity-time spectrum DTM map, the number of rows (the number of points from FFT) and the number of columns of the velocity-time spectrum DTM map are D =128, l =28, respectively, and assuming that the energy value of the ith row pixel point in the ith column is represented as E _ D (l, D), the distance estimation f _ D (l) of the target in the ith frame data can be represented as:

wherein L =1,2, \8230, L, D =1,2, \8230, D;

performing this operation on 28 frames of data, the velocity-time spectrum DTM map of size D × L is reduced to a feature vector of size 1 × L;

step three and three, for the azimuth spectrum ATM, the number of rows (the number of points of the distance FFT) and the number of columns of the azimuth spectrum ATM are a =160 and l =28, respectively, and assuming that the energy value of the pixel point at the a-th row in the l-th column is represented as E _ a (l, a), the distance estimation f _ a (l) of the target in the l-th frame data can be represented as:

wherein L =1,2, \8230, L, a =1,2, \8230, A;

this operation is performed on 28 frames of data, and the azimuth spectrum ATM of a × L size is reduced to a feature vector of size 1 × L;

step three and four, for the pitch spectrum ETM, the number of rows (the number of points from the FFT) and the number of columns of the pitch spectrum ETM are respectively E =50, l =28, and assuming that the energy value of the pixel point in the ith row in the ith column is denoted as E _ E (l, E), the target distance estimation f _ E (l) in the ith frame data can be denoted as:

wherein L =1,2, \8230, L, E =1,2, \8230, E;

performing this operation on 28 frames of data, the E × L-sized pitch spectrum ETM is reduced to a feature vector of size 1 × L;

and step three, four, time alignment and splicing are carried out on the four feature vectors to obtain a 28 x 4-dimensional mixed feature vector (4 28-dimensional vectors are spliced into a 4 x 28 vector).

Other steps and parameters are the same as in one of the first to fourth embodiments.

The sixth specific implementation mode: the implementation mode is different from one of the first to the fifth implementation modes in that in the fourth step, from the perspective of light weight, a gesture recognition network 8HBi-GRU is constructed, the mixed feature vector is input into the gesture recognition network Bi-GRU, the trained gesture recognition network Bi-GRU is obtained, and the obtained light weight model can finally reach the recognition accuracy of 98.24%, and has the advantages of short training time, high training speed and small data volume;

the specific process is as follows:

step four, constructing a gesture recognition network 8HBi-GRU; the specific process is as follows:

the gesture recognition network 8HBi-GRU sequentially comprises a first bidirectional GRU layer, a second bidirectional GRU layer, a multi-head self-attention mechanism layer, a summation layer and a full connection layer;

and step two, inputting the 28 x 4-dimensional mixed feature vector obtained in the step three into the gesture recognition network Bi-GRU until convergence, so as to obtain the trained gesture recognition network Bi-GRU.

Other steps and parameters are the same as those in one of the first to fifth embodiments.

The seventh embodiment: the difference between this embodiment and the first to sixth embodiment is that, in the second step, the 28 × 4-dimensional mixed feature vector obtained in the third step is input into the gesture recognition network Bi-GRU until convergence, so as to obtain a trained gesture recognition network Bi-GRU;

the specific process is as follows:

inputting the 28 x 4-dimensional mixed feature vector obtained in the third step into a first bidirectional GRU layer, inputting the output feature vector of the first bidirectional GRU layer into a multi-head self-attention machine making layer, inputting the output feature vector of the multi-head self-attention machine making layer into a summing layer for summing operation (the output feature vector of the multi-head self-attention machine making layer is changed into 512 through the summing layer), inputting the output feature vector of the summing layer into a full connection layer, and outputting a 12-dimensional vector through the full connection layer, as shown in FIGS. 2a, 2b, 2c and 2d, wherein the class corresponding to the maximum value is the recognition result until convergence, and obtaining the trained gesture recognition network Bi-GRU.

The 28 represents time, and 4 represents vectors, namely a distance-time spectrum (RTM), a speed-time spectrum (DTM), an azimuth spectrum (ATM) and an elevation spectrum (ETM), respectively;

other steps and parameters are the same as those in one of the first to sixth embodiments.

The specific implementation mode eight: the embodiment is a millimeter wave radar gesture recognition system based on a multi-head self-attention mechanism, and the system is used for executing a millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism.

The principle of the GRU network is specifically as follows:

the circulation unit of GRU mainly comprises reset gate and update gate, each of which is r _t And z _t To show that the om operator is defined as the operation of subtracting the input data by 1, the structure of the GRU is shown in fig. 4.

Reset gate will input new information x _t Reservation information h from the previous moment _t-1 Carrying out fusion; the larger the value of the reset gate is, the more useful information is reserved, and the calculation method is

r _t ＝σ(W _r ×[h _t-1 ,x _t ]) (6)

Wherein, W _r As a parameter to be trained, r _t σ () is a sigmoid activation function for the value of reset gate;

the updating gate determines the influence degree of the last moment information on the current moment, the larger the value of the updating gate is, the larger the influence of the last moment information on the current moment is, and the calculation formula of the updating gate is as follows

z _t ＝σ(W _z ×[h _t-1 ,x _t ]) (7)

Wherein, W _z As a parameter to be trained, z _t To update the value of the gate;

candidate hidden layer states

Is calculated by the formula

Wherein Tanh () is the Tanh activation function, W _xg 、W _hg As a weight matrix (parameter to be trained), b _g For the offset vector (the parameter to be trained),

can be regarded as weighted fusion of hidden layer information at two moments, and a reset gate r is needed in the process _t To calculate how much information was retained at the last time, when r _t If the value is 0, the information of the previous time is not memorized in the calculation process.

Then updating the hidden layer state by the calculation method of

The GRU can acquire the inter-frame features of the time series, but can only capture the time series correlation from front to back, which is far from sufficient in the complex classification problem.

The Bi-GRU network model (bidirectional GRU network model) is adopted to bidirectionally extract the time correlation of the sequence to obtain the context characteristics of the time sequence, namely the hidden layer state at the time t is jointly determined by the hidden layer states at the time t +1 and the time t-1 (namely h) _t From h _t-1 And h _t+1 Co-determined, so bi-directional extraction is possible), the structure of which is shown in fig. 5;

in the present invention, n is the number of frames, x _t Is a 4-dimensional vector, t =1, 2.., 28, and the hidden layer size is 256 (i.e., the dimension of h), since bidirectional GR is usedU-network, output size 28 × 512 (28 × 256 if normal unidirectional GRU is used, and 28 × 512 if bidirectional is used, double).

The principle of the self-attention mechanism is as follows:

the calculation of the self-attention mechanism depends on a query vector q, a key-value vector k and a value vector v, which are respectively obtained by multiplying an input sequence x (as shown in fig. 3, the size of the sequence x is 4 × 28) by a weight matrix:

q＝x×W _Q (10)

k＝x×W _K (11)

v＝x×W _V (12)

wherein, W _Q ，W _K ，W _V Representing a corresponding weight matrix for realizing the conversion of characteristic dimensionality, namely parameters to be learned in a self-attention mechanism, and solving q, k and v by adopting a linear layer;

the self-Attention mechanism is that a weight value matched by q and k is calculated firstly, then a Softmax function is utilized for normalization, and finally the obtained weight value and a value vector v are weighted and summed to obtain an Attention value, wherein the process is shown in figure 6;

feature vector x for time i _i Suppose feature vector x _i The attention weight found with the feature vector at time j is a _ij The final output is y _i Then, the related calculation formula is as follows:

wherein d is _k For the depth of the query vector or key-value vector, in the present invention, d _k ＝512；q _i Feature vector x for time i _i Corresponding query vector, k _j Is the key value vector corresponding to the characteristic vector at the moment j, and T is the turnV. position of _i The value vector corresponding to the characteristic vector;

the method adopts a multi-head self-attention mechanism to carry out attention allocation, and comprises the following specific steps:

using an m-head self-attention mechanism, for a certain frame feature vector x _i The corresponding query vector q _i Vector k of key values _i Vector of values v _i Respectively equally dividing the m subvectors to obtain m groups of combinations of q, k and v;

the 28 x 512 sized vectors q become m groups

A vector of sizes;

the 28 x 512 size vectors k become m groups

A vector of sizes;

the 28 x 512 sized vectors v become m groups

A vector of sizes;

performing the self-attention mechanism operation in groups to obtain a plurality of groups of output y, and then combining the plurality of groups of output y to obtain the final attention output, as shown in fig. 2a, 2b, 2c, and 2 d;

for a time series x of continuous inputs, the time series length is n, and the final attention output obtained after the m-head self-attention mechanism processing is u, which can be expressed by the formula:

u＝C(h ₁ (x ₁ ,x ₂ ,···,x _n ),···,h _k (x ₁ ,x ₂ ,···,x _n ),···,h _m (x ₁ ,x ₂ ,···,x _n ))(15)

wherein h is _k (x ₁ ,x ₂ ,···,x _n ) Representing the results of the kth set of self-attention operations, C (-) represents the sequential merging of the outputs y, and the calculation process for each single-headed self-attention is identical, but the weight matrix is different.

In the present invention, m is chosen to be 8, and the output size of the multi-head self-attention mechanism is 28 × 512.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows: and classifying and identifying the 12 micro gestures by using 8 HBi-GRU.

The 12 inching gestures include: the operator is towards the millimeter wave radar, 1) colludes, 2) the fork, 3) draw the circle clockwise, 4) draw the circle anticlockwise, wave the hand about 5), 6) left fan, 7) right fan, 8) the hand of beckon, 9) the hand of waving, 10) single-finger TAP, 11) the palm is held the fist, 12) the palm opens. 10 experimental personnel (6 men and 4 women) are invited to participate in gesture data acquisition, the gesture action distance is 20-60 cm away from a radar plane, the horizontal direction angle range is limited to +/-80 degrees, the vertical direction angle range is limited to +/-25 degrees, the number of gestures acquired by each person is approximately the same, and finally 600 groups of samples of each gesture are formed, and the total number of gesture data sets of 7200 groups of samples is formed, wherein 70% of the gesture data sets are randomly extracted and processed into characteristic data sets for training, and the rest 30% of the gesture data sets are used for testing.

The experiment is carried out in Python3.8 and Pytroch 1.12.0 environments, wherein a CPU is i7-12700H, a GPU is RTX3060, and the system is Windows10. The training learning rate was set to 0.001 and 60 rounds were iterated using the Adam optimizer.

The Accuracy (Accuracy) and Loss (Loss) curves during model training are shown in fig. 7a and 7 b. It can be seen that as model training progresses, the correctness and loss of the training set and the test set gradually converge after a sufficient number of iterations, and the final overall classification correctness is 98.24%.

The model was tested, the confusion matrix is shown in FIG. 8, and the accuracy (Precision), recall (Recall), and F1 score (F1-score) for the 12 gestures are shown in Table 1.

Table 112 representation of gestures

It can be seen that for the palm movement gestures with large movement amplitude such as left fan movement, right fan movement, hand waving movement and the like, the classification scheme can obtain the recognition rate close to 100%, and for the finger movement gestures with large movement amplitude such as hooking movement, forking movement, circle drawing movement and the like, the recognition effect of the classification scheme is ideal. However, for a pair of slightly-moved confusable gestures, namely, fist making and palm opening, the recognition effect is not very good, and the recognition rate can only reach about 90%.

The second embodiment: in order to verify the optimization effect of the multi-head self-attention mechanism on the Bi-GRU network and find out the proper number of taps of the multi-head self-attention mechanism, experiments are carried out on the Bi-GRU network and the Bi-GRU network of the multi-head self-attention mechanism with the number of the attention heads being 0, 1,2, 4, 8 and 16 respectively. The training learning rate was set to 0.001 and 60 iterations were performed using the Adam optimizer to obtain the test accuracy as shown in table 2.

TABLE 2Bi-GRU and Multi-headed attention mechanism test results

1HBi-GRU represents a Bi-GRU network combined with a single-head self-attention mechanism, 2HBi-GRU represents a network combined with a Bi-GRU network combined with a multi-head self-attention mechanism with the number of taps of 2, and the like. The experimental result shows that the action effect of the multi-head self-attention mechanism is greatly influenced by the number of taps, when the number of taps is properly selected, the multi-head self-attention mechanism has a certain promotion effect on the performance of the Bi-GRU network, and when the number of taps is improperly selected, even a negative optimization phenomenon may occur. The combination of experimental data can find that the optimization effect on the Bi-GRU network is the largest when the number of the taps of the multi-head self-attention mechanism is set to be 8, and finally the identification accuracy is the highest.

Example three: the 8HBi-GRU scheme of the present invention was compared to other schemes to verify the advancement and superiority of the present invention. The results of the experiment are shown in table 3.

The VGG16, resnet50, resnet101, denseNet121 and DenseNet161 take mixed feature spectrograms (RTM, DTM, ATM and ETM) as input, input data occupy storage space of 1.15GB, and experiments show that the method is large in data volume, large in training parameters and long in training time.

In contrast, the methods of CNN, CNN-LSTM, CNN-Bi-GRU, LSTM, bi-GRU, 8HBi-GRU and the like which take the mixed feature vector as input only occupy 0.82MB of storage space, have fewer training parameters and shorter training time, and can complete the training within 3 minutes. The CNN is composed of 4 one-dimensional convolution layers and 2 full-connection layers, the LSTM is composed of two LSTM layers and 1 classifier, the CNN-LSTM is formed by cascading CNN and LSTM, and the CNN-Bi-GRU is formed by cascading CNN and Bi-GRU.

Experiments show that compared with CNN, CNN-LSTM, CNN-Bi-GRU, LSTM, bi-LSTM and Bi-GRU, the 8HBi-GRU model obtains the highest recognition rate and has stronger superiority and application prospect.

TABLE 3 comparison of the performance of our model with other models

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism is characterized by comprising the following steps of: the method comprises the following specific processes:

preprocessing gesture data to be detected acquired by the millimeter wave radar to obtain a Range Doppler (RD) image; obtaining a distance-time spectrum RTM, a speed-time spectrum DTM, an orientation spectrum ATM and a pitch spectrum ETM based on the obtained distance Doppler RD diagram, simplifying the distance-time spectrum RTM, the speed-time spectrum DTM, the orientation spectrum ATM and the pitch spectrum ETM, and finally obtaining a 28 multiplied by 4 dimensional mixed feature vector; and inputting the obtained 28 multiplied by 4 dimensional mixed feature vector into the trained gesture recognition network Bi-GRU to obtain a gesture data recognition result to be detected.

2. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism according to claim 1, wherein: preprocessing the acquired gesture data in the second step to obtain a Range Doppler (RD) image;

the specific process is as follows:

filtering out static object components in the gesture data acquired by the millimeter wave radar by means of an MTI moving target display technology to obtain gesture data with the static object components filtered out;

2D-FFT is carried out on the gesture data with the static object components filtered out in the distance dimension and the speed dimension, and a distance Doppler RD image is obtained;

and filtering the interference target in the obtained range Doppler RD image by adopting a constant false alarm detector CFAR to obtain the range Doppler RD image only containing the human hand target after the interference target is filtered.

3. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism according to claim 2, wherein: performing 2D-FFT on the gesture data with the static object components filtered out in a distance dimension and a speed dimension to obtain a distance Doppler RD image;

the specific process is as follows:

and (3) performing 2D-FFT on the gesture data with the static object components filtered out in a distance dimension and a speed dimension to obtain a distance Doppler RD diagram, wherein the expression is as follows:

wherein s is _IF (m, N) is gesture data collected by a Frequency Modulated Continuous Wave (FMCW) radar, N _c For the number of chirp signals, N _adc The number of gesture data originally collected for a Frequency Modulated Continuous Wave (FMCW) radar; j is an imaginary unit, j ² = -1; m is the index of the original pulse signal chirp, and n is the index of the original sampling point.

4. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism according to claim 3, wherein: in the third step, based on the distance Doppler RD diagram obtained in the second step, a distance-time spectrum RTM, a speed-time spectrum DTM, an azimuth spectrum ATM and a pitch spectrum ETM are obtained, the distance-time spectrum RTM, the speed-time spectrum DTM, the azimuth spectrum ATM and the pitch spectrum ETM are simplified, and finally a mixed feature vector with dimensions of 28 multiplied by 4 is obtained;

the specific process is as follows:

step three, projecting the range Doppler RD image only containing the human hand target on a longitudinal axis, and splicing frame by frame to obtain a range-time spectrum RTM image;

thirdly, performing DOA estimation on the hand target point detected in the range Doppler RD image only containing the hand target, namely performing angle FFT on the horizontal channel dimension, and splicing frame by frame to obtain an azimuth spectrum ATM; performing angle FFT on a vertical channel dimension, and splicing frame by frame to obtain a pitch spectrum ETM;

5. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism according to claim 4, wherein: simplifying the distance-time spectrum RTM, the speed-time spectrum DTM, the azimuth spectrum ATM and the pitch spectrum ETM in the third step to obtain a 28 x 4-dimensional mixed feature vector;

the specific process is as follows:

step three, four, for the distance-time spectrum RTM graph, the number of rows and the number of columns of the distance-time spectrum RTM graph are R =128 and l =28, respectively, that is, the distance-time spectrum RTM graph is formed by splicing distance distributions of targets in 28 frame data, the distance distribution of the target in each frame data is divided into 128 distance units to be represented, and assuming that the energy value of the pixel point of the ith row in the ith column is represented as E _ R (l, R), the distance estimation f _ R (l) of the target in the ith frame data can be represented as:

wherein L =1,2, \8230, L, R =1,2, \8230, R;

this operation is performed on 28 frames of data, and the distance-time spectrum RTM diagram of the size R × L is simplified into a feature vector of the size 1 × L;

step three, step two, for the velocity-time spectrum DTM map, the number of rows and columns of the velocity-time spectrum DTM map are D =128, l =28, respectively, and assuming that the energy value of the ith row pixel point is represented as E _ D (l, D), the distance estimate f _ D (l) of the target in the ith frame data can be represented as:

wherein L =1,2, \8230;, L, D =1,2, \8230;, D;

this operation is performed on 28 frames of data, and the velocity-time spectrum DTM graph of size D × L is simplified into a feature vector of size 1 × L;

step three and three, for the azimuth spectrum ATM, the number of rows and columns of the azimuth spectrum ATM are a =160 and l =28, respectively, and assuming that the energy value of the pixel point in the ith row of the ith column is denoted as E _ a (l, a), the distance estimate f _ a (l) of the target in the ith frame data can be denoted as:

wherein L =1,2, \8230;, L, a =1,2, \8230;, A;

step three and four, for the pitch spectrum ETM, the number of rows and columns of the pitch spectrum ETM are respectively E =50, l =28, and assuming that the energy value of the pixel point on the ith row in the ith column is represented as E _ E (l, E), the target distance estimate f _ E (l) in the ith frame data can be represented as:

wherein L =1,2, \8230, L, E =1,2, \8230, E;

and step three, step four, time alignment and splicing are carried out on the four characteristic vectors to obtain a mixed characteristic vector with 28 multiplied by 4 dimensions.

6. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism according to claim 5, wherein: constructing a gesture recognition network 8HBi-GRU in the fourth step, and inputting the mixed feature vector into a gesture recognition network Bi-GRU to obtain a trained gesture recognition network Bi-GRU;

the specific process is as follows:

and step two, inputting the mixed feature vector of 28 multiplied by 4 dimensions obtained in the step three into the gesture recognition network Bi-GRU until convergence, and obtaining the trained gesture recognition network Bi-GRU.

7. The millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism according to claim 6, wherein: inputting the 28 x 4-dimensional mixed feature vector obtained in the third step into the gesture recognition network Bi-GRU in the fourth step until convergence, so as to obtain a trained gesture recognition network Bi-GRU;

the specific process is as follows:

inputting the 28 x 4 dimensional mixed feature vector obtained in the step three into a first bidirectional GRU layer, inputting the feature vector output by the first bidirectional GRU layer into a multi-head self-attention machine making layer, inputting the feature vector output by the multi-head self-attention machine making layer into a summing layer for summing operation, inputting the feature vector output by the summing layer into a full connection layer, and outputting a 12 dimensional vector by the full connection layer, wherein the class corresponding to the maximum value is the recognition result until convergence, and obtaining the trained gesture recognition network Bi-GRU.

8. Millimeter wave radar gesture recognition system based on bull self-attention mechanism, its characterized in that: the system is used for executing the millimeter wave radar gesture recognition method based on the multi-head self-attention mechanism in claims 1 to 7.