CN114219853A

CN114219853A - Multi-person three-dimensional attitude estimation method based on wireless signals

Info

Publication number: CN114219853A
Application number: CN202111336826.1A
Authority: CN
Inventors: 于水瀛; 应建新
Original assignee: Hangzhou Changze Information Technology Co ltd
Current assignee: Hangzhou Changze Information Technology Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-03-22

Abstract

A multi-person three-dimensional attitude estimation method based on wireless signals comprises the following steps: step one, a WiFi device is used for collecting Channel State Information (CSI) signals, and a camera is used for shooting corresponding videos for supervision; processing the video by alpha Pose, and outputting a human body target frame and human body key point processing to generate a thermodynamic diagram and a target frame diagram; step three, preprocessing the CSI signal; step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained; step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body; and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates. The invention has lower cost and higher precision.

Description

Multi-person three-dimensional attitude estimation method based on wireless signals

Technical Field

The invention belongs to the technical field of wireless communication, and relates to a method for estimating a three-dimensional posture of multiple persons based on wireless channel state information.

Background

On the estimation of human body postures, the method of camera-based posture estimation including two-dimensional or even three-dimensional posture estimation is quite mature, but the traditional sensor technology of camera is obviously limited by illumination, shading and background, and has privacy problem.

The wireless sensing has great potential for sensing human bodies based on wireless signals, and in recent years, the research of the wireless sensing on the aspect of human sensing has some achievements including behavior recognition, respiration detection, target positioning, crowd counting and the like. Two basic approaches to human sensing based on wireless signals are: one is device based, requiring a person to wear or carry the device/sensor; another type of device that uses sensing elements located in an environment to monitor human activity without requiring the human to carry any devices or sensors. Device-based approaches, while generally accurate, are not practical or convenient in many important real-life scenarios, such as requiring elderly or dementia patients to carry the device at all times. Device-less human sensing offers significant advantages for these scenarios.

Many research results prove that the wireless signals can well complete two-dimensional and three-dimensional attitude estimation tasks under the supervision of a video model. Zhao et al propose RF-pos, which is a codec-based human Pose estimation deep learning architecture. The input signal is an antenna array that emits a continuous modulation frequency wave (FMCW), which is expensive and not widely available compared to commercial WiFi, although it can acquire the distance from the object to the signal. Wang et al performs human body posture estimation on a CSI signal collected by commercial WiFi for the first time to obtain a human body mask, two-dimensional joint points and joint effective area Portions (PAFs). Guo et al also uses deep learning to obtain the human skeleton from using multi-directional receive CSI signals. The previous methods only obtain two-dimensional gestures, Jiang et al propose a method for obtaining three-dimensional gestures from CSI signals, model training is performed on CSI by using a VICON system (a gesture capture camera) as a marker, the direct end-to-end learning of the three-dimensional gesture model is high in learning pressure, the learning accuracy is not high, and the method is still limited to in-situ gesture motion.

In summary, the problems of the prior art are as follows: the human body posture estimation based on the CSI mostly stays on two dimensions, and the three-dimensional posture estimation is performed only by a single person without being moved in place, and the accuracy cannot be very high. In addition, the CSI signal has a large interference influence on the environment, which is a difficult point if the environmental problem is solved. The difficulty of solving the above problems is: how to train a three-dimensional human body posture model with fine granularity from CSI data; how to realize attitude estimation under the condition of multiple persons; how to eliminate the effect of environmental noise.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the multi-user three-dimensional attitude estimation method based on the wireless signals, which has lower cost and higher precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a multi-person three-dimensional attitude estimation method based on wireless signals comprises the following steps:

step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;

processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;

preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;

step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;

step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;

and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.

Further, the process of the step one is as follows:

1.1, arranging transceiving equipment and a camera in a conference room environment, carrying out data transceiving experiments by using two notebooks provided with Intel5300 network cards, using 6 directional antennas, dividing 3 antennas into one group, wherein the antenna distance between the antennas in the same group is 20cm, so as to form WiFi-like router equipment, wherein one group is used as a transmitter (T), and the other group is used as a receiver (R); the WiFi equipment is used for transmitting subcarrier signals with the frequency of 100Hz and 30 different frequencies, signal attenuation and phase change of the signals with the different frequencies can be obtained to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target. (ii) a

1.2 shoot the corresponding video with the camera, the process is: the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching and shaking hands are made, and the timestamp is saved so as to correspond to the CSI signal part.

Still further, in the second step, the process of generating the thermodynamic diagram and the target block diagram by outputting the alphapos includes: inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; the generated target frame coordinates are subjected to multi-scale transformation, 4 scale target frames are used, and the multi-scale transformed target frame coordinates are respectively placed on a plurality of graphs to generate tensors. The two tensors will supervise their learning as annotations of the CSI-2D model.

Furthermore, in the third step, the data preprocessing process is performed on the acquired data as follows;

3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:

wherein H₁(f, t) is channel state information of the antenna 1,

is the conjugate of the channel state information of antenna 2, H_1,S(f) And H_2,S(f) Respectively, its static path part, K and L are the number of multipath, alpha_l(f, t) is the amplitude decay function on the l path,

is a function of the Doppler shift;

3.2 eliminating the noise of the environment;

firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, filtering high-frequency noise by using a high-performance filter after removing the outlier, using a soft threshold method based on nonlinear wavelet transform,

wherein, when the absolute value of the wavelet coefficient w is equal to or greater than a given threshold thr, the threshold is subtracted from the absolute value of the wavelet coefficient w and multiplied by the sgn function, and when the wavelet coefficient w is less than the given threshold thr, the wavelet coefficient is set to 0.

The threshold selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;

3.3 removing static direct current components;

the static direct current component is large but does not contain the required human motion information, and the signal mean is subtracted to eliminate the static variable.

In the fourth step, the preprocessed CSI signals are corresponded to the video frames, and the process is as follows: the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded.

In the fifth step, the CSI-2D model training process is as follows:

using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;

the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:

L＝λ₁L_box+λ₂L_JHMs；

wherein the window loss L is calculated using the two-class loss_boxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weight_JHMsAdditionally setting window loss weight coefficient lambda₁To 0.1, a joint position error loss weight coefficient λ is set₂Is 1;

L_JHMsthe calculation formula is as follows:

in the formula

Namely the MSE loss formula,

it is shown that the video tag data matrix,

represented is a matrix generated by a model,

for element weights, an attention mechanism is implemented by using Matthew weights;

II (·) is

Value of functionIs +1, when

The value of the function is-1;

the method comprises the following steps of firstly carrying out up-sampling on an input CSI tensor, then outputting the CSI tensor through a pooling layer through a residual block and U-Nets, outputting box and JHMs in the same network model, outputting the output after pooling through two convolutional layers and a BN layer to represent the tensor, and processing an output multi-scale box graph and a thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:

evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:

wherein j represents the number of key points, p represents the number of people,

representing the coordinate position of the person regressing the jth key point pth,

a thermodynamic diagram representing the jth keypoint,

a target detection block diagram representing a pth person.

In the sixth step, the process of generating the multi-person 3D coordinate by the multi-person two-dimensional coordinate comprises the following steps: and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.

The invention provides a novel method for estimating multi-person three-dimensional postures of CSI signals, which comprises the steps of preprocessing the signals through data, carrying out combined training of wireless network two-dimensional postures and human body target frames under the supervision of a video two-dimensional posture estimation deep learning model, and outputting a movable three-dimensional posture framework by combining output two-dimensional postures with a video frame through a pre-trained three-dimensional posture model.

The WiFi signal is used for human body posture estimation, so that the human body posture estimation can be well applied to the market, including activity detection and life assistance of some patients or old people who are inconvenient to move, and some privacy problems can be well avoided by using the WiFi signal; applying the human body to games, animations and AR under the environment of sundries cluster; in addition, the human body can be detected in a dim environment. Can meet the requirements of low cost and high precision.

The invention has the following beneficial effects:

1. two-dimensional and three-dimensional multi-person attitude estimation is realized by using commercial WiFi, the method is low in cost, can be realized by only WiFi equipment, and is high in precision;

2. the multi-scale target frame is used, so that the technical problem that one key point occupies the other space when multiple persons interact is solved, and the accuracy of multi-person interaction in the multi-person target segmentation process is improved;

3. in the aspect of generating the three-dimensional posture, the three-dimensional part combines the two-dimensional joint point, the position and the video frame to carry out smooth processing on the action, so that the precision of the model is improved, and the pressure of directly carrying out end-to-end learning on the model is reduced;

4. according to the method, a key Point Affinity Field (PAFs) is abandoned on two-dimensional gesture recognition, and the real-time performance of the model is improved by combining a human body target frame; the invention collects 8 actions of data set including waving hands, clapping hands, walking, kicking legs, shaking hands, squatting, jumping and punching a fist, four volunteers participate in the method, and video and CSI data of single person and multiple persons are divided.

Drawings

FIG. 1 is a flow chart of a method for three-dimensional attitude estimation based on channel state information according to an embodiment of the present invention;

fig. 2 is a flow chart of a pre-processing of a raw CSI signal according to an embodiment of the present invention;

fig. 3 is a network structure diagram of a CSI-2D module.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a method for estimating a three-dimensional posture of a plurality of persons based on wireless signals includes the following steps:

As shown in fig. 1, the method for estimating a three-dimensional pose of multiple persons based on wireless channel state information provided in the embodiment of the present invention includes the following steps

Step one, using WiFi equipment to collect Channel State Information (CSI) signals, and the process is as follows:

the WiFi equipment is used for transmitting subcarrier signals with the frequency of 100Hz and 30 different frequencies, signal attenuation and phase change of the signals with the different frequencies can be obtained to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target.

Shooting a corresponding video by using a camera, wherein the process is as follows:

the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching, shaking hands and the like are performed, and the timestamp is stored so as to correspond to the CSI signal part.

Step two, outputting the AlphaPose to generate a thermodynamic diagram and a target block diagram, wherein the process is as follows:

inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; the generated target frame coordinates are subjected to multi-scale transformation, 4 scale target frames are used in the method, the multi-scale transformed target frame coordinates are respectively placed on a plurality of graphs to generate tensors, and the two tensors are used as labels of the CSI-2D model to supervise the learning of the CSI-2D model.

Thirdly, preprocessing the acquired data, wherein the process is as follows;

wherein H₁(f, t) is channel state information of the antenna 1,

is a function of the Doppler shift;

3.2 eliminating the noise of the environment;

firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, and then filtering high-frequency noise by using a high-performance filter after removing the outlier. Using a non-linear wavelet transform based soft threshold method,

when the absolute value of the wavelet coefficient w is greater than or equal to a given threshold thr, subtracting the threshold from the absolute value of the wavelet coefficient w and multiplying the result by an sgn function, and when the wavelet coefficient w is less than the given threshold thr, setting the wavelet coefficient to be 0;

3.3 removing static direct current components;

the static direct current component is large, but does not contain the required human motion information, and the static variable can be eliminated by subtracting the signal mean value;

step four, the preprocessed CSI signals correspond to the video frames, and the process is as follows:

the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded;

step five, the CSI-2D model training process comprises the following steps:

L＝λ₁L_box+λ₂L_JHMs；

L_JHMsthe calculation formula is as follows:

in the formula

In the formula

Namely the MSE loss formula,

it is shown that the video tag data matrix,

represented is a matrix generated by a model,

II (·) is

The value of the function is +1 when

The value of the function is-1.

The input CSI tensor is up-sampled, then the output passes through a residual block and U-Nets, the pooling layer is output, box and JHMs are output in the same network model, and therefore the output after pooling passes through two convolution layers and a BN layer to output the expression tensor. And processing the output multi-scale box graph and thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:

a thermodynamic diagram representing the jth keypoint,

a target detection block diagram representing a pth individual;

step six, generating a multi-person 3D coordinate by the multi-person two-dimensional coordinate, wherein the process is as follows:

and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.

The results of the experiments of the present invention are described in detail below with reference to the experiments:

firstly, experimental environment and data acquisition: the method comprises the steps that transceiving equipment and a camera are arranged in a 10 m-15 m conference room environment, two notebooks provided with Intel5300 network cards are used for data transceiving experiments, 6 directional antennas are used, 3 antennas are divided into one group to form WiFi router equipment, one group is used as a transmitter (T), the other group is used as a receiver (R), and the distance between the transceiving antennas is 6 m. The transmitting signal takes 5GHz as a center, an Orthogonal Frequency Division Multiplexing (OFDM) WiFi signal which accords with an 802.11nWiFi communication standard is recorded, 30 subcarriers are provided, the transmitting frequency is 100Hz, the 30 subcarrier signals with different frequencies can acquire signal attenuation and phase change of the signals with different frequencies so as to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target. While collecting the CSI signal, 20FPS video frames were recorded using a monocular RGB camera at the receiving end.

Four volunteers participated in data acquisition, the volunteers respectively perform actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, boxing, shaking hands and the like among the transmitting and receiving antennas by a plurality of persons and a single person, and the timestamp is stored so that the video frame time corresponds to the CSI signal collection time.

Secondly, data processing: performing data preprocessing on the dat file obtained by the CSI acquisition tool in an MATLAB data processing tool, specifically comprising: eliminating phase offset between the two antennas; removing the most obvious outlier of the original signal by a Hample outlier filter; removing noise based on a nonlinear wavelet transform soft threshold method; the static dc component is removed.

Thirdly, model training: the collected CSI data were made into data sets, and the data sets were partitioned into a training set, validation set, 2:8, using Adam optimizer with β 1 0.9 and β 2 0.999 during training. K-1 and b-1 are used in calculating MW weights. These networks were trained for 40 rounds.

The invention evaluates the whole model by using two-dimensional and three-dimensional joint points generated by a video model as real values on model evaluation.

Table 1 shows the error (P-MPJPE) result of the model after alignment of each key point on a multi-person data set with a true value in translation, rotation and proportion, wherein the lower the P-MPJPE is, the better the P-MPJPE is;

TABLE 1

The joint point sequences respectively represent that 1: middle hip; 2: the left hip; 3: the left knee; 4: a left ankle; 5: the right hip is shown in the specification; 6: the right knee; 7: a right ankle; 8: spinal column, 9: chest; 10: a neck; 11: a head; 12: a left shoulder; 13: the left elbow; 14: a left wrist; 15: a right shoulder; 16: the right elbow; 17: the right wrist.

As can be seen from Table 1, the total P-MPJPE of the method can reach 30.4mm, which is improved by 7.5mm compared with 37.9mm of WiPose.

The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. A multi-person three-dimensional attitude estimation method based on wireless signals is characterized by comprising the following steps:

2. The method for estimating the three-dimensional pose of a plurality of people based on wireless signals as claimed in claim 1, wherein the process of the step one is as follows:

1.1, arranging transceiving equipment and a camera in a conference room environment, carrying out data transceiving experiments by using two notebooks provided with Intel5300 network cards, using 6 directional antennas, dividing 3 antennas into one group, wherein the antenna distance between the antennas in the same group is 20cm, so as to form WiFi-like router equipment, wherein one group is used as a transmitter (T), and the other group is used as a receiver (R); using a WiFi device to transmit subcarrier signals with the frequency of 100Hz and 30 different frequencies to acquire signal attenuation and phase change of different frequency signals so as to know information of different scales of a propagation path, and receiving a 3 × 3 × 30 Channel State Information (CSI) signal reflected and penetrated by a target by a receiving end;

3. The method for estimating the three-dimensional postures of the multiple persons based on the wireless signals as claimed in claim 1 or 2, wherein in the second step, the process of generating the thermodynamic diagram and the target block diagram by the AlphaPose output is as follows: inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; and performing multi-scale transformation on the generated target frame coordinates, using 4 scale target frames, respectively placing the multi-scale transformed target frame coordinates on a plurality of graphs to generate tensors, and using the two tensors as labels of the CSI-2D model to supervise the learning of the CSI-2D model.

4. The method for estimating the three-dimensional pose of a plurality of people based on wireless signals as claimed in claim 1 or 2, wherein in the third step, the data preprocessing process is performed on the collected data as follows;

wherein H₁(f, t) is channel state information of the antenna 1,

is a function of the Doppler shift;

3.2 eliminating the noise of the environment;

firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, filtering high-frequency noise by using a high-performance nonlinear wavelet transform-based soft thresholding filter after removing the outlier,

the selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;

3.3 removing static direct current components;

5. The method as claimed in claim 1 or 2, wherein in the fourth step, the preprocessed CSI signals are mapped to the video frames by the following process: the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded.

6. The method for estimating the three-dimensional pose of multiple persons based on wireless signals as claimed in claim 1 or 2, wherein in the fifth step, the CSI-2D model training process is as follows:

L＝λ₁L_box+λ₂L_JHMs；

L_JHMsthe calculation formula is as follows:

in the formula

Namely the MSE loss formula,

it is shown that the video tag data matrix,

represented is a matrix generated by a model,

II (·) is

The value of the function is +1 when

The value of the function is-1;

a thermodynamic diagram representing the jth keypoint,

a target detection block diagram representing a pth person.

7. The method for estimating the three-dimensional pose of the multiple persons based on the wireless signal as claimed in claim 1 or 2, wherein in the sixth step, the process of generating the 3D coordinates of the multiple persons by the two-dimensional coordinates of the multiple persons is as follows: and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.