CN110348288B

CN110348288B - Gesture recognition method based on 77GHz millimeter wave radar signal

Info

Publication number: CN110348288B
Application number: CN201910445702.3A
Authority: CN
Inventors: 赵占锋; 刘多; 周志权; 赵宜楠; 冯翔; 陈雄兰
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2019-05-27
Filing date: 2019-05-27
Publication date: 2023-04-07
Anticipated expiration: 2039-05-27
Also published as: CN110348288A

Abstract

The invention provides a gesture recognition method based on 77GHz millimeter wave radar signals, which comprises the steps of firstly obtaining intermediate frequency signals of different gesture actions through a radar, innovatively utilizing an improved wavelet threshold function to preprocess low-frequency coefficients of the intermediate frequency signals, solving the problem that short-distance gestures cannot be recognized due to an antenna coupling phenomenon, secondly extracting a time-distance spectrogram, a time-speed spectrogram and a time-angle spectrogram from the preprocessed intermediate frequency signals, innovatively splicing the three characteristic spectrograms to obtain a diversified characteristic map, inputting the diversified characteristic map into a convolutional neural network for training, optimizing the problem that the traditional recognition algorithm is incomplete in information expression, simultaneously being beneficial to simplifying a network structure and finally obtaining a better recognition effect.

Description

Gesture recognition method based on 77GHz millimeter wave radar signal

Technical Field

The invention relates to the technical field of radar signal processing and recognition, in particular to a gesture recognition method based on 77GHz millimeter wave radar signals.

Background

Since the twenty-first century, with the rapid development of computer technology, human-computer interaction technology has become one of the major subject technologies today. At present, a mouse and a keyboard are used as mechanical input devices in common man-machine interaction methods, however, the methods have no way to realize simple, efficient and highly free information interaction between a person and a computer. In the development process of the field of computers and signal processing, the gesture recognition technology has more and more application scenes with the characteristics of lively, vivid, visual and efficient expression, such as an intelligent home system, a sign language real-time teaching system, a gesture control game system and the like. With the rapid development of human-computer interaction technology, gesture recognition technology has become a research hotspot for students at home and abroad.

Most of the traditional gesture recognition technologies are based on videos and images, for example, a motion sensing device Kinect of microsoft corporation forms a scene graph with depth by using a 3D motion sensing camera and an optical coding technology, performs image depth recognition by evaluating the pixel level of a depth image, and captures each part of a human body or gesture actions by combining a human skeleton tracking technology. However, conventional image or video-based gesture recognition methods have certain limitations. Firstly, the recognition accuracy of the traditional gesture recognition technology based on images is very easily influenced by factors such as illumination, weather and working environment. Secondly, the traditional gesture recognition method based on images or videos is also easily affected by occlusion, such as a wall, a bookcase, etc., and when the gesture performer is located behind the wall or at a position in a room that is partially or completely occluded, the method is completely impossible. In addition, the traditional gesture recognition method based on images or videos also has the risk of revealing privacy of users. In the era of high sensitivity of personal information, the privacy disclosure problem caused by the method can have serious consequences. Finally, the traditional gesture recognition method based on images or videos has relatively high requirements on computing resources and energy consumption, and generally requires an independent external power supply system, so that the application scene and the scale of the gesture recognition method are greatly limited.

Compared with the traditional gesture recognition method based on videos or images, the gesture recognition method based on radar signals generally has the characteristics of non-contact, no influence of illumination, weather and working environment and the like, and can effectively solve the problem that the recognition accuracy is influenced by conditions such as insufficient illumination and the like. Meanwhile, the radar also has a certain function of penetrating through shielding propagation, so that the influence of shielding objects such as walls and bookcases can be effectively avoided, and the gesture executor can realize gesture control and interaction under the condition of being completely shielded or partially shielded. And the gesture recognition method based on the radar signals can effectively eliminate the problem of user privacy disclosure caused by video or image shooting, and the advantage of the method protects the user privacy and simultaneously guarantees the safety of the user. In addition, most radar sensors can be integrated on a high-quality chip with low energy consumption, so that the identification cost can be greatly reduced, the calculation complexity can be reduced, and the application scene and scale of the gesture identification technology can be greatly increased. The 77GHz millimeter wave radar is increasingly widely applied due to the characteristics of light weight and small volume. Meanwhile, the 77GHz millimeter wave radar can easily achieve higher spatial resolution, so that the precision of distance measurement, angle measurement and speed measurement is higher. By combining the application of the existing 77GHz millimeter wave radar in the aspects of intelligent driving, intelligent home and the like, the gesture recognition technology based on the 77GHz millimeter wave radar has a very wide application prospect.

The existing gesture recognition method based on radar signals has the following problems. First, for low-frequency signals with large energy generated by coupling of a transmitting antenna and a receiving antenna, the current method intercepts signals from a distance domain and discards close-range information. In fact, for dynamic gesture motions, it is not scientific to remove useful gesture signals altogether when the gesture is closer to the radar. Secondly, the utilization of radar angle information is less at present, useful information provided by a radar is not fully utilized, information fusion of a characteristic spectrogram is lacked when an input data set of the convolutional neural network is constructed, and the design difficulty of the convolutional neural network in the later period is increased.

Disclosure of Invention

The invention aims to provide a method for removing low-frequency signals with larger energy generated by coupling of a transmitting antenna and a receiving antenna, and a method for constructing a radar gesture signal diversified characteristic map. Compared with the traditional gesture recognition technology, on one hand, the short-distance gesture signals can be recognized, and the application scenes of the gesture recognition technology are further increased; on the other hand, the gesture signal feature expression is more complete, the problem of low description of the previous gesture information amount is solved, the design of a convolutional neural network in the later period is facilitated to be simplified, and the accurate classification of various gestures is facilitated to be realized.

The invention relates to a gesture recognition method based on 77GHz millimeter wave radar signals, which comprises the following steps:

designing N gesture actions, taking N =4 as an example, designing four gesture actions of hooking, radially waving hands, clockwise rotating and anticlockwise rotating, and acquiring data of corresponding gestures by different volunteers in a microwave darkroom environment to obtain 4 groups of data of ClassNum;

step two, transmitting a signal S _T (t) and a received signal S _R (t) obtaining a mixed signal S by means of a mixer _MIX (t)，S _MIX (t) analyzing the intermediate frequency signal x (t) through a low-pass filter, and extracting a signal SN corresponding to a corresponding receiving antenna from the intermediate frequency signal _I\Q (t)；

Step three, due to the coupling phenomenon of the transmitting antenna and the receiving antenna, a low-frequency signal with large energy exists in the gesture intermediate-frequency signal x (t), and the intermediate-frequency signal SN is processed by utilizing a wavelet threshold method _I\Q (t) preprocessing, selecting an improved threshold function, processing only the low-frequency coefficient, and reconstructing to obtain a new intermediate-frequency signal

Step four, centering the intermediate frequency signal

Processing, namely estimating a time-distance spectrogram, a time-velocity spectrogram and a time-angle spectrogram of the gesture action, and performing numerical value normalization processing on the three spectrograms respectively;

step five, splicing the time-distance spectrogram, the time-velocity spectrogram and the time-angle spectrogram after normalization in the step four to construct a diversified characteristic spectrogram A;

step six, respectively carrying out the operations of step two to step six on the collected 4 groups of ClassNum group gesture echo dataThen, 4 sets of ClassNum original gesture signal diversified feature spectrum atlas are obtained

Step seven, the diversified characteristic spectrum atlas obtained in the step six is collected

Carrying out graying processing on each sample to obtain an original diversified feature map set B;

eighthly, removing the mean value of all samples in the original diversified feature atlas B and carrying out scale normalization to obtain a diversified feature atlas

Labeling each characteristic graph;

step nine, a diversified feature map set

Divided into training sets S according to a certain proportion _train Verification set S _val And test set S _test For example, the training set accounts for 70%, the validation set accounts for 20%, and the test set accounts for 10%;

step ten, adding S _train 、S _val As input data set C of convolutional neural network together with its corresponding label _input And initializing network weights, wherein S _train For training the network coefficients, S _val After training for a period of time, carrying out network verification, and adjusting the weight of the network through errors;

step eleven, inputting a data set C _input Performing convolution pooling operation once, and setting the scale kernel _ size1, convolution step size kernel _ stride1, pooling size pool _ size1 and pooling step size pool _ stride1 of a convolution kernel to obtain a feature map set feature1;

step twelve, further performing convolution pooling on the feature map set feature1, extracting deep features, and setting the size kernel _ size2, convolution step size kernel _ stride2, pooling size pool _ size2 and pooling step size pool _ stride2 of a convolution kernel to obtain a feature map set feature2;

step thirteen, performing convolution pooling operation on the feature map set feature2 again to extract deeper features, and setting the size kernel _ size3, convolution step size kernel _ stride3, pooling size pool _ size3 and pooling step size pool _ stride3 of a convolution kernel to obtain a feature map set feature3;

step fourteen, sequentially passing the feature3 through full connection layers fc4, fc5 and fc6, respectively setting the sizes of fc4, fc5 and fc6 as size4, size5 and size6, and converting the feature map set into a column vector v1 of 1 × size 6;

and step fifteen, outputting the column vector v1 into different gesture categories through a softmax classifier, and obtaining a trained convolutional neural network model NetModel through multiple iterations, wherein the network accuracy and the loss function tend to be stable.

Sixthly, testing the data set S _test And loading the gesture classification result y into the NetModel.

The third step comprises the following steps:

3.1 selecting proper wavelet basis function to gesture intermediate frequency signal SN _I\Q (t) carrying out N-layer wavelet transform to obtain an approximate coefficient A (i) and detail coefficients D (1, i), D (2, i), \\ 8230;, D (N, i);

3.2 thresholding the low-frequency coefficients A (i), i.e.

A _i '＝mA _i +(1-m)sgn(A _i )(|A _i |-n)

3.3 utilization of

D (1, i), D (2, i), \8230, and D (N, i) reconstruct the signals to obtain a preprocessed gesture intermediate frequency signal x' (t).

The fourth step comprises the following steps:

4.1 extracting a time-distance spectrogram: firstly, FFT is carried out in a fast time domain, then weighting and averaging are carried out in a slow time domain, and finally interframe accumulation is carried out to obtain a time-distance spectrogram with the size of FFTNum1 x FrameNum;

4.2 extraction time-velocity spectrogram: firstly, carrying out two-dimensional FFT by taking a frame as a unit to obtain a distance-Doppler image, then searching for the maximum value of the distance-Doppler image, extracting a line where the maximum value is positioned, taking speed information of the line as the speed of a current frame signal, and finally carrying out accumulation between frames to obtain a time-speed spectrogram, wherein the size of the time-speed spectrogram is FFTNum2 x FrameNum;

4.3 extracting a time-angle spectrogram: firstly, taking a frame as a unit, respectively carrying out two-dimensional FFT on corresponding frame signals of N receiving antennas to obtain N distance-Doppler graphs. Respectively extracting maximum values from the N distance-Doppler graphs, arranging the maximum values into 1 XN one-dimensional vectors, performing FFT on the one-dimensional vectors to obtain speed information corresponding to the current frame signal, and finally performing accumulation between frames to obtain a time-angle graph with the size of FFTNum3 x FrameNum;

4.4 numerical normalization treatment: firstly, performing dispersion standardization on each characteristic spectrogram, specifically, taking a time-distance spectrogram D as an example, performing numerical scaling on the time-distance spectrogram according to the following formula:

by the above scaling operation, each value in the time-distance spectrum falls within the [0,1] range.

Description of the drawings:

FIG. 1 is a flow chart of the present invention.

FIG. 2 illustrates specific gesture actions designed and identified in the present invention.

FIG. 3 is a simulation diagram of wavelet low-frequency threshold processing for each gesture.

Fig. 4 is a time-distance spectrogram simulation diagram of each gesture.

Fig. 5 is a time-velocity spectrogram simulation diagram of each gesture.

Fig. 6 is a time-angle spectrogram simulation diagram of each gesture.

Fig. 7 is a diversified feature map (normalized result) of each gesture operation.

Fig. 8 is a network architecture of a convolutional neural network.

Fig. 9 is table 1.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

Take four gestures as an example:

designing four gesture actions of hooking, radial waving, clockwise rotation and anticlockwise rotation, and firstly configuring relevant parameters of the 77GHz millimeter wave radar as shown in fig. 2. In this patent, set up sampling frequency and be 2000kHz, the frame cycle is 55ms, gathers 100 frame data at every turn, has 128 chirp signals in every frame, and every chirp signal has 64 sampling points. The antennas adopt one transmitting antenna and four receiving antennas, namely one transmitting antenna and four receiving antennas, and corresponding gesture data acquisition is carried out by three volunteers in a microwave darkroom environment;

step two, transmitting a signal S _T (t) and the received signal S _R (t) mixing by means of a mixer to obtain a mixed signal S _MIX (t) and mixing S _MIX (t) obtaining the intermediate frequency signal x (t) through a low-pass filter, and extracting the signal SN corresponding to the corresponding receiving antenna from the intermediate frequency signal _I\Q (t), the concrete steps are as follows:

the method comprises the following steps: the specific expression of the 2-1 radar emission signal is as follows:

wherein, A _T To transmit the amplitude of the signal, f _c Is the carrier frequency, T is the sawtooth period, B is the signal bandwidth, f _T (T) is the frequency of the transmitted signal during time T;

step 2-2, the radar receiving signal is the result after the transmitting signal is delayed by delta t, and the result is as follows:

/>

wherein A is _R To receive the amplitude of the signal, f _R (T) is the frequency of the received signal within the time T, and Δ f is the frequency offset;

step 2-3: will transmit signal S _T (t) and a received signal S _R (t) obtaining a mixed signal S by means of a mixer _MIX The expression of (t) is as follows:

S _MIX (t)＝S _T (t)S _R (t)

step 2-4: mixing the signal S _MIX (t) passing through a low-frequency filter, obtaining an intermediate-frequency signal x (t) with the following expression:

step 2-5: one-transmitting four-receiving is adopted, the four receiving antennas have 8 paths of I \ Q channels in total, the intermediate frequency signal x (t) is assigned to one path of data every 8 points, and the intermediate frequency signal corresponding to the antenna of the missing path can be extractedNumber SN _I\Q (t)；

Step three, because of the coupling phenomenon of the transmitting antenna and the receiving antenna, a low-frequency signal with large energy exists in the gesture intermediate-frequency signal, and the intermediate-frequency signal SN is subjected to wavelet threshold method _I\Q (t) preprocessing, selecting an improved threshold function, processing only the low-frequency coefficient, and reconstructing to obtain a new intermediate-frequency signal

The method comprises the following specific steps:

step 3-1: sym6 is selected as a wavelet basis function to handle gesture intermediate frequency signals

After three-layer wavelet decomposition, approximate coefficients A (i), detail coefficients D (1, i), D (2, i) and D (3, i) are obtained,

step 3-2: thresholding its low-frequency coefficients A (i), i.e.

A _i '＝mA _i +(1-m)sgn(A _i )(|A _i |-n)

Step 3-3: by using

D (1, i), D (2, i), D (3, i) are reconstructedAnd obtaining a preprocessed signal x' (t), wherein the simulation result of wavelet low-frequency threshold processing of each gesture is shown in fig. 3.

Step four, centering the intermediate frequency signal

And processing, estimating a time-distance spectrogram, a time-velocity spectrogram and a time-angle spectrogram of the gesture action, and performing numerical value normalization processing on the three spectrograms respectively. The method comprises the following specific steps:

4.1 extracting a time-distance spectrogram: firstly, FFT is carried out in a fast time domain, then weighting and averaging are carried out in a slow time domain, and finally interframe accumulation is carried out to obtain a time-distance spectrogram, wherein the size of the time-distance spectrogram is FFTNum1 x FrameNum, and the simulation result of each gesture action time-distance spectrogram is shown in figure 4;

4.2 extraction time-velocity spectrogram: firstly, performing two-dimensional FFT by taking a frame as a unit to obtain a distance-Doppler diagram, searching for the maximum value of the distance-Doppler diagram, extracting a line where the maximum value is located, taking speed information of the line as the speed of a current frame signal, and finally performing accumulation between frames to obtain a time-speed spectrogram, wherein the size of the time-speed spectrogram is FFTNum2 x FrameNum, and the simulation result of each gesture action time-speed spectrogram is shown in FIG. 5;

4.3 extracting a time-angle spectrogram: firstly, taking a frame as a unit, respectively carrying out two-dimensional FFT on corresponding frame signals of N receiving antennas to obtain N distance-Doppler graphs. Maximum values are respectively extracted from the N distance-Doppler graphs, the maximum values are arranged into 1 multiplied by N one-dimensional vectors, FFT is carried out on the one-dimensional vectors, speed information corresponding to the current frame signal can be obtained, finally, accumulation is carried out between frames, a time-angle graph with the size of FFTNum3 x FrameNum can be obtained, and simulation results of time-angle spectrograms of all gesture actions are shown in figure 6;

4.4 numerical normalization treatment: firstly, performing dispersion standardization on each feature spectrogram, specifically, taking a time-distance spectrogram D as an example, performing numerical scaling on the time-distance spectrogram according to the following formula:

by the above scaling operation, each value in the time-distance spectrogram falls within the [0,1] range.

Step five, splicing the time-distance spectrogram, the time-velocity spectrogram and the time-angle spectrogram after normalization in the step four according to columns to construct a diversified characteristic spectrogram A with the size of (FFTNum 1+ FFTNum2+ FFTNum 3) × FrameNum;

step six, respectively carrying out operations of the step two to the step six on the collected 4 groups of ClassNum group gesture echo data to obtain a diversified characteristic spectrum atlas set of the 4 groups of ClassNum group original gesture signals

Step seven, the diversified characteristic spectrum atlas obtained in the step six is subjected to

The results of the gesture-action diversification feature maps are shown in fig. 7, and each feature map is labeled, the collective steps are as follows:

8.1 averaging all samples in the multivariate feature set B according to

It is worth noting here that all samples in the multivariate data set B are averaged here, rather than just within a class for a certain type of gesture;

8.2 averaging, i.e. averaging, of any sample in the set B of multivariate feature maps

8.3 pairs of diversified feature maps

And (3) carrying out scale normalization: normalizing the scale of each feature map in the data set into Hight multiplied by Width, specifically, carrying out down sampling when the size of the original feature map is larger than the Hight multiplied by Width, otherwise carrying out up sampling, and finally obtaining the normalized diversified feature map set->

Step nine, a diversified feature map set

step ten, adding S _train 、S _val Together with its corresponding label as input data set C of the convolutional neural network _input And initializing network weights, wherein S _train For training the network coefficients, S _val After training for a period of time, carrying out network verification, and adjusting the weight of the network through error back propagation;

step eleven, inputting a data set C _input Performing convolution pooling operation once, and setting the scale 11 × 11 of a convolution kernel, the convolution step 4, the pooling size3 × 3 and the pooling step 2 to obtain a feature map set feature1;

step twelve, further performing convolution pooling on the feature map set feature1, extracting deep features, and setting the size of a convolution kernel of 5 × 5, the convolution step size1, the pooling size of 3 × 3 and the pooling step size of 2 to obtain a feature map set feature2;

step thirteen, performing convolution pooling operation on the feature map set feature2 again to extract deeper features, and setting the size of a convolution kernel of 3 × 3, the convolution step size of 1, the pooling size of 3 × 3 and the pooling step size of 2 to obtain a feature map set feature3;

step fourteen, sequentially passing feature3 through full connection layers fc4, fc5 and fc6, respectively setting the sizes of fc4, fc5 and fc6 to 4096, 2048 and 1000, and converting the feature diagram set into a 1 × 1000 column vector v1;

and step fifteen, outputting the column vector v1 into different gesture categories through a softmax classifier, and obtaining a trained convolutional neural network model NetModel through multiple iterations when the network accuracy and the loss function tend to be stable.

Sixthly, testing the data set S _test And loading the gesture data into a NetModel to obtain a gesture classification result y, and obtaining a gesture data classification result shown in a table 1.

Claims

1. A gesture recognition method based on 77GHz millimeter wave radar signals is characterized by comprising the following steps:

designing N gesture actions, and acquiring data of corresponding gestures by different volunteers in a microwave darkroom environment to obtain N ClassNum group data in total;

step two, analyzing the radar original data to obtain an intermediate frequency signal x (t), and extracting an intermediate frequency signal SN corresponding to a corresponding receiving antenna from the intermediate frequency signal _I\Q (t)；

Step three, due to the coupling phenomenon of the transmitting antenna and the receiving antenna, a low-frequency signal with large energy exists in the gesture intermediate-frequency signal x (t), and the intermediate-frequency signal SN is subjected to wavelet threshold method _I\Q (t) preprocessing, selecting an improved threshold function, processing only the low-frequency coefficient, and reconstructing to obtain a new intermediate-frequency signal

Step four, centering the intermediate frequency signal

step six, respectively carrying out operations of the step two to the step six on the collected N groups of ClassNum gesture echo data to obtain N groups of ClassNum original gesture signal diversification feature spectrum atlas

Labeling each characteristic graph;

step nine, a diversified feature map set

Divided into training sets S according to a certain proportion _train And a verification set S _val And test set S _test ；

Step ten, adding S _train 、S _val Together with its corresponding label as input data set C of the convolutional neural network _input And initializing network weights, wherein S _train For training the network coefficients, S _val In training for a period of timeThen, network verification is carried out, and the weight of the network is adjusted through error back propagation;

step eleven, inputting a data set C _input Performing convolution pooling operation for one time, and setting the scale kernel _ size1, convolution step kernel _ stride1, pooling size pool _ size1 and pooling step pool _ stride1 of a convolution kernel to obtain a feature map set feature1;

thirteen, performing convolution pooling operation on the feature map set feature2 again to extract deeper features, and setting the size kernel _ size3, convolution step size kernel _ stride3, pooling size pool _ size3 and pooling step size pool _ stride3 of a convolution kernel to obtain a feature map set feature3;

step fifteen, outputting the column vector v1 into different gesture categories through a softmax classifier, and obtaining a trained convolutional neural network model NetModel through multiple iterations when the network accuracy and the loss function tend to be stable;

sixthly, testing the data set S _test Loading the gesture classification result y into a NetModel;

the pretreatment method in the third step specifically comprises the following steps:

step 3-1, for low-frequency signals with large energy generated by coupling of the transmitting antenna and the receiving antenna, an improved wavelet threshold function is applied for removal, and the threshold function is as shown in formula 1:

wherein

W _j,k (m,n)＝mω _j,k +(1-m)sgn(ω _j,k )(|ω _j,k |-n)

The improved threshold function is continuous at lambda and-lambda, and the values of alpha and beta are adjusted to enable the method to adapt to the requirements of various noise scenes;

step 3-2 aiming at the characteristics of the interference signal, the intermediate frequency signal of the gesture is subjected to

After N layers of wavelet decomposition, approximate coefficients A (i) are obtained, detail coefficients D (1, i), D (2, i),. And D (N, i) are obtained, and only the low-frequency coefficients A (i) are subjected to threshold processing, namely

A _i '＝mA _i +(1-m)sgn(A _i )(|A _i |-n)

Followed by reuse

D (1, i), D (2, i),. Cndot, D (N, i) are reconstructed to obtain a preprocessed signal x' (t).

2. The gesture recognition method based on 77GHz millimeter wave radar signals according to claim 1, characterized in that: the construction of the diversified characteristic spectrogram A in the fifth step specifically comprises the following steps:

step 5-1: extracting a time-distance spectrogram: firstly, FFT is carried out in a fast time domain, then weighting and averaging are carried out in a slow time domain, and finally interframe accumulation is carried out to obtain a time-distance spectrogram with the size of FFTNum1 x FrameNum;

step 5-2: extracting a time-velocity spectrogram: firstly, carrying out two-dimensional FFT by taking a frame as a unit to obtain a distance-Doppler diagram, searching for the maximum value of the distance-Doppler diagram, extracting a line where the maximum value is located, taking speed information of the line as the speed of a current frame signal, and finally carrying out accumulation between frames to obtain a time-speed spectrogram with the size of FFTNum2 x FrameNum;

step 5-3: extracting a time-angle spectrogram: firstly, taking a frame as a unit, respectively performing two-dimensional FFT (fast Fourier transform) on corresponding frame signals of N receiving antennas to obtain N distance-Doppler graphs, respectively extracting maximum values from the N distance-Doppler graphs, arranging the maximum values into 1 xN one-dimensional vectors, performing FFT (fast Fourier transform) on the one-dimensional vectors to obtain speed information corresponding to a current frame signal, and finally accumulating the speed information between frames to obtain a time-angle graph with the size of FFTNum3 x FrameNum;

step 5-4: constructing a diversified characteristic spectrogram A: the three feature spectrograms extracted by the method have the same column number, after the three spectrograms are respectively normalized, the normalized feature spectrograms are spliced by rows to obtain a diversified feature spectrogram, and the dimension of the diversified feature spectrogram is (FFTNum 1+ FFTNum2+ FFTNum 3) × FrameNum.