CN115294656A

CN115294656A - FMCW radar-based hand key point tracking method

Info

Publication number: CN115294656A
Application number: CN202211013101.3A
Authority: CN
Inventors: 韩崇; 李帮杰; 孙力娟; 郭剑; 薛景; 王娟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-11-04

Abstract

A method for tracking key points of hands based on an FMCW radar aims at solving the problems of limitation of light receiving line parts, privacy leakage and the like existing in the existing method for tracking the key points of hands by using an optical camera, and utilizes a cross-modal supervision method to take images and data acquired by the camera and the radar as the input of a neural network in training, synchronize radar data and video streams, extract key point information of the hands from the video streams for preprocessing and monitor the key point information as a radio frequency signal processed by the neural network. The trained system can realize the output of tracking the key points of the hands only by using radio frequency signals as input. By the method, the accuracy of gesture recognition can be improved under the condition that personal privacy is protected and illumination conditions are not considered, and the method has the characteristics of strong robustness, stability, instantaneity and high efficiency.

Description

FMCW radar-based hand key point tracking method

Technical Field

The invention belongs to the field of crossing of wireless perception and computer vision, relates to the technical field of millimeter wave radars and neural networks, and particularly relates to a method for tracking key points of a hand based on an FMCW radar.

Background

In recent years, with the advancement of science and technology, man-machine interaction has gradually progressed to a point closely related to the daily life of human beings. The pursuit of more efficient and easier information interaction becomes the core of human-computer interaction research. The new technologies such as face recognition, gesture recognition, lip reading, voice recognition and gesture recognition gradually change the interaction mode taking a computer as the center in the past, so that a user really becomes the main part in a man-machine interaction mode.

The method for acquiring information of gesture actions by using the optical camera is a mature gesture recognition method, although the high-resolution camera enables the recognition rate of a visual gesture recognition technology to be higher than 90%, the light condition of the environment where the optical camera is located has great influence on the recognition rate, and the description effect on gesture information can be greatly reduced in the environment with too strong light or too dim light. Although the current technology can solve the problem by adding a night vision camera and the like, the technology cost is correspondingly increased, and the application range is greatly reduced. Secondly, the method has the problem of privacy leakage, because the characteristics of the optical image determine the possibility of information leakage such as images and videos, and the current era is an era extremely sensitive to personal privacy, the information leakage problem can have great influence on the product and the technical development. Meanwhile, the method has higher energy consumption and higher requirements on operation resources, so that the method cannot be applied to a system with a relatively simple equipment environment in a large scale.

Disclosure of Invention

Aiming at the technical problems, the invention provides a hand key point tracking method, which utilizes an FMCW radar to perform gesture recognition, and has the advantages that based on the gesture recognition of the FMCW radar, data flow is radar signals instead of optical image signals, even if signals are leaked, an attacker can hardly directly see any useful information, and certain guarantee is provided for the safety of the system. In addition, the dynamic gesture recognition based on the FMCW radar can be integrated on a high-speed processing chip with low energy consumption and small volume, which provides the possibility of being embedded in a portable device.

A method for tracking hand key points based on an FMCW radar comprises the following steps:

step 1, initializing an FMCW radar system, configuring parameters of hand information sampling, including receiving and transmitting antenna pairs, sampling points and sampling time, and simultaneously shooting a complete hand motion track by using a camera;

step 2, carrying out corresponding preprocessing on the obtained image information and the radio frequency signal; for image information, storing a GBR image stored in a camera as an RGB image; firstly, clutter suppression is carried out on radio frequency signals, a range-Doppler heat map RDI is formed by carrying out Fourier transform FFT on a range-velocity dimension, the signals are respectively processed in a horizontal direction and a vertical direction compared with the ground, and a horizontal heat map H is formed _l And vertical heatmap H _v ；

Step 3, processing the hand information in the picture for the stored image information, capturing the hand information in the picture, automatically marking each key point of the hand according to the obtained information, and obtaining a hand key point confidence map from the video;

step 4, obtaining H _l And H _v Coding is carried out through a radio frequency coding network, and the coding is input into a CNN; inputting the feature map into different convolutional layers in the CNN to realize feature extraction, and then decoding the coded heat map information through a radio frequency decoding network to obtain a key point confidence map from radio frequency;

step 5, interacting image information and radio frequency information of different modes by using a cross-mode learning and supervised learning method, calling a network for obtaining a video key point confidence map as a teacher network, calling a network for obtaining a radio frequency key point confidence map as a student network, constructing a cross-supervised learning teacher-student network, and detecting the reliability of the key point confidence map obtained by processing radio frequency signals through the network; identifying and tracking the positions of the key points of the hands obtained from the radio frequency signals in a cross-supervision learning mode;

and 6, tracking the key points of the hands by the trained system only through radio frequency signals without video image assistance.

Further, in the step 1, acquiring an original signal of the dynamic gesture through an FMCW radar, setting the period of each frequency modulation continuous pulse chirp as t, S as a frequency increasing slope, tau as a delay of the signal from the radar to the gesture and then back, and f as a carrier frequency of the radar; emission signal S of radar ₁ Expressed as:

S ₁ ＝sin(2πft+πSt·t)

receiving signal S ₂ Expressed as:

S ₂ ＝sin[2πf(t-τ)+πS(t-τ) ² ]

after passing through the mixer and the low-pass filter, the output intermediate frequency signal S is:

the frequency f of the intermediate frequency signal is obtained by performing one-dimensional Fourier transform on the formula _IF Setting the distance from the gesture target to the radar as d, the light speed as c, and the formula as follows:

the same processing is repeatedly carried out on the plurality of chirp signals by the method, and then the processed signals are spliced into one frame of data to obtain the radio frequency signals from the radar.

Further, in step 2, converting the hand image acquired by the camera into a corresponding RGB image with a size of 200 × 200 for storage; for radio frequency signals, a frequency domain-based feature extraction method is utilized, signals with complex time domains are transformed to frequency domains through a Fourier transform method in the horizontal direction and the vertical direction, the conditions of frequency components of the signals are observed, and features are extracted in the frequency domains; fourier transform of continuous signal

Generating a frequency spectrum having distinct separate peaks by FFT processing, each peak indicating the presence of an object at a particular distance; further taking the phase of each effective data at the same distance to perform FFT again, and distinguishing a plurality of targets with different speeds at the same distance; then after phase FFT, the phase difference omega of each target is obtained ₁ 、ω ₂ Further obtaining targets with different speeds, wherein the hand characteristic diagram, namely the range-Doppler diagram RDI, is obtained at the moment; the horizontal heat map is a projection of the signal reflection on a plane parallel to the ground, and the vertical heat map is a projection of the signal reflection on a plane perpendicular to the ground.

Further, in step 4, the radio frequency coding network uses 10 layers of 9 × 5 × 5 space-time convolution, the step size of each layer is 1 × 2 × 2, and batch processing normalization is utilized after input is completed; after each layer, the ReLU activation function f (x) = max (0, x) is used, and the encoded data is input to the CNN.

Further, in step 4, the radio frequency decoding network decodes the encoded heat map information, and the decoding network has 5 layers except the last layer with the step size of

In addition, the step lengths of other layers are

Furthermore, after each layer a ReLu function is used, for the last layer a sigmoid function is used as output layer.

Further, in step 5, the image information and the radio frequency signal are respectively input into the teacher network and the student network, the student network receives the key point confidence maps from the teacher network markers and compares the key point confidence maps with the predicted key point confidence maps, and the key point confidence maps from the teacher network provide cross-modal supervision for the student network, so that the student network learns from the teacher network and successfully predicts the key point confidence maps.

Further, in step 5, the goal of student network training is to minimize the difference between its prediction and the teacher network prediction, defining the loss as the sum of the binary cross-entropy losses for each pixel in the confidence map:

wherein

And

is the confidence of each pixel on the confidence map c; the student network receives the key point confidence maps from the teacher network marks and compares the key point confidence maps with the predicted key point confidence maps, and the key point confidence maps from the teacher network provide cross-modal supervision for the student network, so that the student network learns from the key point confidence maps to successfully predict the key point confidence maps.

Further, in step 6, after the training is completed, for tracking the hand key points, the position coordinates of the hand key points are acquired only by placing the hand in front of the radar, and the tracking of the hand key points only through the radio frequency signals is realized without using video image auxiliary markers.

The invention has the beneficial effects that:

(1) The FMCW radar is used for identifying and tracking the key points of the hand, and the electromagnetic wave of the radar is not influenced by factors such as illumination, smoke, visible distance and the like, so that the requirement on the environment is lower, and the action perception reliability and accuracy can be higher even if the environmental condition changes;

(2) The method utilizes the FMCW radar to identify and track the key points of the hands, the data flow of the FMCW radar is radar signals, but not optical images, and even if the signals are leaked, an attacker is difficult to directly see any useful information, so that certain guarantee is provided for the safety of the system;

(3) The invention utilizes the FMCW radar to identify and track the key points of the hand, and the dynamic key point tracking of the FMCW radar can be integrated on a high-speed processing chip with low energy consumption and small volume, thereby having higher transportability and availability.

Drawings

FIG. 1 is a flowchart illustrating a method for tracking a key point of a hand according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a cross-supervisor teacher-student network for key point tracking according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating completion of image information calibration according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the drawings in the specification.

The invention provides a hand key point tracking method, which utilizes an FMCW radar to perform gesture recognition, and has the advantages that based on the gesture recognition of the FMCW radar, data flow is radar signals instead of optical image signals, and even if signals are leaked, an attacker is difficult to directly see any useful information, so that certain guarantee is provided for the safety of a system. In addition, the dynamic gesture recognition based on the FMCW radar can be integrated on a high-speed processing chip with low energy consumption and small volume, which provides the possibility of being embedded in a portable device.

As shown in fig. 1, the main steps of the method are as follows:

step 1: initializing an FMCW radar system, and synchronously acquiring a hand image acquired by a camera and a radio frequency signal acquired by a radar:

acquiring an original signal of a dynamic gesture through an FMCW radar, setting the period of each frequency modulation continuous pulse chirp as t, S as a frequency increasing slope, tau as a delay of the signal from the radar to the gesture and then returning, and f as a carrier frequency of the radar; emission signal S of radar ₁ Expressed as:

S ₁ ＝sin(2πft+πSt·t)

receiving signal S ₂ Expressed as:

S ₂ ＝sin[2πf(t-τ)+πS(t-τ) ² ]

the frequency f of the intermediate frequency signal is obtained by performing one-dimensional Fourier transform on the formula _IF If the distance from the gesture target to the radar is d, the light speed is c, and the formula is as follows:

the same processing is repeatedly carried out on the plurality of chirp signals by the method, and then the processed signals are spliced into one frame of data, so that the radio frequency signals from the radar can be obtained.

And 2, step: preprocessing the collected image information and radio frequency signals

For the hand images collected by the camera, the images stored by the camera are GBR images, so that the images need to be converted into RGB images through algorithm conversion, and the hand key points in the images can be conveniently marked later; the radio-frequency signal is subjected to a frequency domain-based feature extraction method, and each pixel in the mapping has a real component and an imaginary component because the radio-frequency signal is complex, so that the complex time domain signal can be transformed into a frequency domain by a Fourier transform method in the horizontal and vertical directions, the condition of each frequency component of the signal is clearly observed, and the feature can be extracted in the frequency domain. Fourier transform of continuous signal

Generating a frequency spectrum having distinct separate peaks by FFT processing, each peak indicating the presence of an object at a particular distance; further taking the phase of each effective data at the same distance to perform FFT again, and distinguishing a plurality of targets with different speeds at the same distance; then after phase FFT, the phase difference omega of each target is obtained ₁ 、ω ₂ And further obtaining targets with different speeds, namely a hand characteristic diagram, namely a range-Doppler diagram RDI. One each of the horizontal and vertical antenna arrays needs to be preserved, the horizontal heat map being the projection of the signal reflection onto a plane parallel to the ground, and the vertical heat map being the projection of the signal reflection onto a plane perpendicular to the ground.

And step 3: construction of teacher network using image information

As shown in fig. 2, after image acquisition is completed, the image is input into a teacher network to obtain position information of key points of a hand, the teacher network is mainly built based on a MediaPipe handles model of google, after the MediaPipe handles model is imported, relevant parameters are set, the type of input data is set to be continuous static pictures, the confidence threshold of the model is set to be 0.5, the tracking threshold of the model is set to be 0.5, picture data containing continuous hand movements is input into the model after parameter setting is completed, the model can process the hand information from the pictures, capture the hand information in the pictures, automatically mark each key point of the hand according to the obtained picture information, and simultaneously obtain the pixel position of each key point relative to the pictures, the processed image is shown in fig. 3, and contains position information of 21 key points of the hand, and after processing is completed, a hand key point confidence map from a video can be obtained. A teacher network built based on a MediaPipe handles model can locate 21 key abscissa and ordinate of a hand, supervise the learning of a radio frequency network (student network) by using the coordinates as tags, and detect the prediction effect of the radio frequency network.

And 4, step 4: building student network using radio frequency information

For the processing of the radio frequency signal, a horizontal heat map H will be obtained, as shown in FIG. 2 _l And vertical heatmap H _v The method comprises the steps of carrying out encoding through radio frequency encoding networks, wherein each encoding network takes a radio frequency heat map of 100 frames (3.3 seconds) as input, the radio frequency encoding networks use 10 layers of 9 × 5 × 5 space-time convolution, the step length of each layer is 1 × 2 × 2, batch processing normalization is adopted after the input is finished, in order to solve the problem that after input data is input and when a neural network carries out direction error propagation, each layer needs to be multiplied by a first derivative of an activation function, gradient is attenuated by one layer every time the gradient is transferred, and when the number of network layers is large, gradient G is continuously attenuated until the gradient disappears, a ReLU activation function f (x) = max (0, x) is used after each layer, and the encoding is input into CNN after the encoding is finished. Inputting the characteristic diagram into different convolutional layers to realize characteristic extraction, and then decoding the coded heat map information by a radio frequency decoding network, wherein the decoding network has 5 layers except the last layer with the step length of 5

In addition, the step lengths of other layers are

Furthermore, the ReLu function is also used after each layer, and for the last layer, the sigmoid function is used as an output layer. During the training process of the whole student network, two real-value channels for storing a real part and an imaginary part are used for tableShowing a complex-valued radio frequency heat map, the whole network realizes radio frequency information through PyTorch and can obtain a key point confidence map from radio frequency through an encoding-decoding process.

And 5: a teacher-student network is constructed by using a cross-supervised learning mode.

And tracking the key points of the hands by taking the synchronous images and the radio frequency signals as a bridge. Inputting image information and radio frequency signals into a teacher network and a student network respectively, wherein the goal of student network culture is to minimize the difference between the prediction and the teacher network prediction, and the loss is defined as the sum of binary cross entropy losses of each pixel in a confidence map:

wherein

And

is the confidence of each pixel on the confidence map c. The student network receives the key point confidence maps from the teacher network marks and compares the key point confidence maps with the predicted key point confidence maps, and the key point confidence maps from the teacher network provide cross-modal supervision for the student network, so that the student network can learn from the key point confidence maps to successfully predict the key point confidence maps.

And 6: the system after the training realizes the tracking of the key points of the hands through radio frequency signals

After the system training is completed, for tracking the key points of the hand, the position coordinates of the key points of the hand can be obtained only by placing the hand in front of the radar, and the key points of the hand can be tracked only through radio frequency signals without using video image auxiliary marks.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims

1. A method for tracking hand key points based on an FMCW radar is characterized in that: the method comprises the following steps:

step 2, carrying out corresponding preprocessing on the obtained image information and the radio frequency signal; for image information, storing a GBR image stored in a camera as an RGB image; firstly, clutter suppression is carried out on radio frequency signals, a Fourier transform FFT is carried out on a distance-velocity dimension to form a distance-Doppler heat map RDI, the signals are respectively processed in the horizontal direction and the vertical direction compared with the ground to form a horizontal heat map H _l And vertical heatmap H _v ；

step 4, obtaining H _l And H _v Coding is carried out through a radio frequency coding network, and the coding is input into the CNN; inputting the feature map into different convolutional layers in the CNN to realize feature extraction, and then decoding the coded heat map information through a radio frequency decoding network to obtain a key point confidence map from radio frequency;

step 5, interacting image information and radio frequency information of different modes by using a cross-mode learning and supervised learning method, calling a network for obtaining a video key point confidence map as a teacher network, calling a network for obtaining a radio frequency key point confidence map as a student network, constructing a cross-supervised learning teacher-student network, and detecting the reliability of the key point confidence map obtained by processing radio frequency signals through the network; the positions of the key points of the hands obtained from the radio frequency signals are identified and tracked in a cross-supervised learning mode;

and step 6, tracking the key points of the hands by the trained system only through radio frequency signals without video image assistance.

2. The FMCW radar-based hand keypoint tracking method according to claim 1, wherein: in the step 1, collecting an original signal of a dynamic gesture through an FMCW radar, setting the period of each frequency modulation continuous pulse chirp as t, S as a frequency increasing slope, tau as a delay of the signal from the radar to the gesture and then returning, and f as a carrier frequency of the radar; emission signal S of radar ₁ Expressed as:

S ₁ ＝sin(2πft+πSt·t)

receiving signal S ₂ Expressed as:

S ₂ ＝sin[2πf(t-τ)+πS(t-τ) ² ]

the frequency of the intermediate frequency signal is f obtained by performing one-dimensional Fourier transform on the formula _IF Setting the distance from the gesture target to the radar as d, the light speed as c, and the formula as follows:

3. The FMCW radar-based hand keypoint tracking method according to claim 1, wherein: in step 2, the hand images collected by the camera are converted into corresponding sizesStoring the RGB image of 200 multiplied by 200; for radio frequency signals, a frequency domain-based feature extraction method is utilized, signals with complex time domains are transformed to frequency domains through a Fourier transform method in the horizontal direction and the vertical direction, the conditions of frequency components of the signals are observed, and features are extracted in the frequency domains; fourier transform of continuous signal

4. A FMCW radar-based hand keypoint tracking method according to claim 1, characterized in that: in step 4, the radio frequency coding network uses 10 layers of 9 × 5 × 5 space-time convolution, the step length of each layer is 1 × 2 × 2, and batch processing normalization is utilized after input is completed; after each layer, the coding is input into the CNN after completion using the ReLU activation function f (x) = max (0, x).

5. A FMCW radar-based hand keypoint tracking method according to claim 1, characterized in that: in step 4, the radio frequency decoding network decodes the encoded heat map information, the decoding network has 5 layers, except the last layer, the step length is

In addition, the step lengths of other layers are

6. A FMCW radar-based hand keypoint tracking method according to claim 1, characterized in that: and step 5, inputting the image information and the radio frequency signals into a teacher network and a student network respectively, wherein the student network receives the key point confidence maps from the teacher network marks and compares the key point confidence maps with the predicted key point confidence maps, and the key point confidence maps from the teacher network provide cross-modal supervision for the student network, so that the student network learns from the key point confidence maps and then successfully predicts the key point confidence maps.

7. The FMCW radar-based hand keypoint tracking method of claim 6, wherein: in step 5, the goal of student network training is to minimize the difference between the predictions and the teacher network predictions, and the loss is defined as the sum of the binary cross-entropy losses of each pixel in the confidence map:

wherein

And

8. A FMCW radar-based hand keypoint tracking method according to claim 1, characterized in that: in step 6, after training is completed, for tracking of the hand key points, the position coordinates of the hand key points are obtained only by placing a hand in front of the radar, and the tracking of the hand key points only through radio frequency signals is realized without using video image auxiliary marks.