CN114219853A - Multi-person three-dimensional attitude estimation method based on wireless signals - Google Patents

Multi-person three-dimensional attitude estimation method based on wireless signals Download PDF

Info

Publication number
CN114219853A
CN114219853A CN202111336826.1A CN202111336826A CN114219853A CN 114219853 A CN114219853 A CN 114219853A CN 202111336826 A CN202111336826 A CN 202111336826A CN 114219853 A CN114219853 A CN 114219853A
Authority
CN
China
Prior art keywords
csi
dimensional
human body
signal
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111336826.1A
Other languages
Chinese (zh)
Inventor
于水瀛
应建新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Changze Information Technology Co ltd
Original Assignee
Hangzhou Changze Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Changze Information Technology Co ltd filed Critical Hangzhou Changze Information Technology Co ltd
Priority to CN202111336826.1A priority Critical patent/CN114219853A/en
Publication of CN114219853A publication Critical patent/CN114219853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/08Projecting images onto non-planar surfaces, e.g. geodetic screens
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

A multi-person three-dimensional attitude estimation method based on wireless signals comprises the following steps: step one, a WiFi device is used for collecting Channel State Information (CSI) signals, and a camera is used for shooting corresponding videos for supervision; processing the video by alpha Pose, and outputting a human body target frame and human body key point processing to generate a thermodynamic diagram and a target frame diagram; step three, preprocessing the CSI signal; step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained; step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body; and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates. The invention has lower cost and higher precision.

Description

Multi-person three-dimensional attitude estimation method based on wireless signals
Technical Field
The invention belongs to the technical field of wireless communication, and relates to a method for estimating a three-dimensional posture of multiple persons based on wireless channel state information.
Background
On the estimation of human body postures, the method of camera-based posture estimation including two-dimensional or even three-dimensional posture estimation is quite mature, but the traditional sensor technology of camera is obviously limited by illumination, shading and background, and has privacy problem.
The wireless sensing has great potential for sensing human bodies based on wireless signals, and in recent years, the research of the wireless sensing on the aspect of human sensing has some achievements including behavior recognition, respiration detection, target positioning, crowd counting and the like. Two basic approaches to human sensing based on wireless signals are: one is device based, requiring a person to wear or carry the device/sensor; another type of device that uses sensing elements located in an environment to monitor human activity without requiring the human to carry any devices or sensors. Device-based approaches, while generally accurate, are not practical or convenient in many important real-life scenarios, such as requiring elderly or dementia patients to carry the device at all times. Device-less human sensing offers significant advantages for these scenarios.
Many research results prove that the wireless signals can well complete two-dimensional and three-dimensional attitude estimation tasks under the supervision of a video model. Zhao et al propose RF-pos, which is a codec-based human Pose estimation deep learning architecture. The input signal is an antenna array that emits a continuous modulation frequency wave (FMCW), which is expensive and not widely available compared to commercial WiFi, although it can acquire the distance from the object to the signal. Wang et al performs human body posture estimation on a CSI signal collected by commercial WiFi for the first time to obtain a human body mask, two-dimensional joint points and joint effective area Portions (PAFs). Guo et al also uses deep learning to obtain the human skeleton from using multi-directional receive CSI signals. The previous methods only obtain two-dimensional gestures, Jiang et al propose a method for obtaining three-dimensional gestures from CSI signals, model training is performed on CSI by using a VICON system (a gesture capture camera) as a marker, the direct end-to-end learning of the three-dimensional gesture model is high in learning pressure, the learning accuracy is not high, and the method is still limited to in-situ gesture motion.
In summary, the problems of the prior art are as follows: the human body posture estimation based on the CSI mostly stays on two dimensions, and the three-dimensional posture estimation is performed only by a single person without being moved in place, and the accuracy cannot be very high. In addition, the CSI signal has a large interference influence on the environment, which is a difficult point if the environmental problem is solved. The difficulty of solving the above problems is: how to train a three-dimensional human body posture model with fine granularity from CSI data; how to realize attitude estimation under the condition of multiple persons; how to eliminate the effect of environmental noise.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the multi-user three-dimensional attitude estimation method based on the wireless signals, which has lower cost and higher precision.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a multi-person three-dimensional attitude estimation method based on wireless signals comprises the following steps:
step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;
processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;
preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;
step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;
step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;
and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.
Further, the process of the step one is as follows:
1.1, arranging transceiving equipment and a camera in a conference room environment, carrying out data transceiving experiments by using two notebooks provided with Intel5300 network cards, using 6 directional antennas, dividing 3 antennas into one group, wherein the antenna distance between the antennas in the same group is 20cm, so as to form WiFi-like router equipment, wherein one group is used as a transmitter (T), and the other group is used as a receiver (R); the WiFi equipment is used for transmitting subcarrier signals with the frequency of 100Hz and 30 different frequencies, signal attenuation and phase change of the signals with the different frequencies can be obtained to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target. (ii) a
1.2 shoot the corresponding video with the camera, the process is: the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching and shaking hands are made, and the timestamp is saved so as to correspond to the CSI signal part.
Still further, in the second step, the process of generating the thermodynamic diagram and the target block diagram by outputting the alphapos includes: inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; the generated target frame coordinates are subjected to multi-scale transformation, 4 scale target frames are used, and the multi-scale transformed target frame coordinates are respectively placed on a plurality of graphs to generate tensors. The two tensors will supervise their learning as annotations of the CSI-2D model.
Furthermore, in the third step, the data preprocessing process is performed on the acquired data as follows;
3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:
Figure BDA0003350828260000031
wherein H1(f, t) is channel state information of the antenna 1,
Figure BDA0003350828260000032
is the conjugate of the channel state information of antenna 2, H1,S(f) And H2,S(f) Respectively, its static path part, K and L are the number of multipath, alphal(f, t) is the amplitude decay function on the l path,
Figure BDA0003350828260000033
is a function of the Doppler shift;
3.2 eliminating the noise of the environment;
firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, filtering high-frequency noise by using a high-performance filter after removing the outlier, using a soft threshold method based on nonlinear wavelet transform,
Figure BDA0003350828260000041
wherein, when the absolute value of the wavelet coefficient w is equal to or greater than a given threshold thr, the threshold is subtracted from the absolute value of the wavelet coefficient w and multiplied by the sgn function, and when the wavelet coefficient w is less than the given threshold thr, the wavelet coefficient is set to 0.
The threshold selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;
3.3 removing static direct current components;
the static direct current component is large but does not contain the required human motion information, and the signal mean is subtracted to eliminate the static variable.
In the fourth step, the preprocessed CSI signals are corresponded to the video frames, and the process is as follows: the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded.
In the fifth step, the CSI-2D model training process is as follows:
using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;
the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:
L=λ1Lbox2LJHMs
wherein the window loss L is calculated using the two-class lossboxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weightJHMsAdditionally setting window loss weight coefficient lambda1To 0.1, a joint position error loss weight coefficient λ is set2Is 1;
LJHMsthe calculation formula is as follows:
Figure BDA0003350828260000051
in the formula
Figure BDA0003350828260000052
Namely the MSE loss formula,
Figure BDA0003350828260000053
it is shown that the video tag data matrix,
Figure BDA0003350828260000054
represented is a matrix generated by a model,
Figure BDA0003350828260000055
for element weights, an attention mechanism is implemented by using Matthew weights;
Figure BDA0003350828260000056
II (·) is
Figure BDA0003350828260000057
Value of functionIs +1, when
Figure BDA0003350828260000058
The value of the function is-1;
the method comprises the following steps of firstly carrying out up-sampling on an input CSI tensor, then outputting the CSI tensor through a pooling layer through a residual block and U-Nets, outputting box and JHMs in the same network model, outputting the output after pooling through two convolutional layers and a BN layer to represent the tensor, and processing an output multi-scale box graph and a thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:
evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:
Figure BDA0003350828260000059
wherein j represents the number of key points, p represents the number of people,
Figure BDA00033508282600000510
representing the coordinate position of the person regressing the jth key point pth,
Figure BDA00033508282600000511
a thermodynamic diagram representing the jth keypoint,
Figure BDA00033508282600000512
a target detection block diagram representing a pth person.
In the sixth step, the process of generating the multi-person 3D coordinate by the multi-person two-dimensional coordinate comprises the following steps: and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.
The invention provides a novel method for estimating multi-person three-dimensional postures of CSI signals, which comprises the steps of preprocessing the signals through data, carrying out combined training of wireless network two-dimensional postures and human body target frames under the supervision of a video two-dimensional posture estimation deep learning model, and outputting a movable three-dimensional posture framework by combining output two-dimensional postures with a video frame through a pre-trained three-dimensional posture model.
The WiFi signal is used for human body posture estimation, so that the human body posture estimation can be well applied to the market, including activity detection and life assistance of some patients or old people who are inconvenient to move, and some privacy problems can be well avoided by using the WiFi signal; applying the human body to games, animations and AR under the environment of sundries cluster; in addition, the human body can be detected in a dim environment. Can meet the requirements of low cost and high precision.
The invention has the following beneficial effects:
1. two-dimensional and three-dimensional multi-person attitude estimation is realized by using commercial WiFi, the method is low in cost, can be realized by only WiFi equipment, and is high in precision;
2. the multi-scale target frame is used, so that the technical problem that one key point occupies the other space when multiple persons interact is solved, and the accuracy of multi-person interaction in the multi-person target segmentation process is improved;
3. in the aspect of generating the three-dimensional posture, the three-dimensional part combines the two-dimensional joint point, the position and the video frame to carry out smooth processing on the action, so that the precision of the model is improved, and the pressure of directly carrying out end-to-end learning on the model is reduced;
4. according to the method, a key Point Affinity Field (PAFs) is abandoned on two-dimensional gesture recognition, and the real-time performance of the model is improved by combining a human body target frame; the invention collects 8 actions of data set including waving hands, clapping hands, walking, kicking legs, shaking hands, squatting, jumping and punching a fist, four volunteers participate in the method, and video and CSI data of single person and multiple persons are divided.
Drawings
FIG. 1 is a flow chart of a method for three-dimensional attitude estimation based on channel state information according to an embodiment of the present invention;
fig. 2 is a flow chart of a pre-processing of a raw CSI signal according to an embodiment of the present invention;
fig. 3 is a network structure diagram of a CSI-2D module.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a method for estimating a three-dimensional posture of a plurality of persons based on wireless signals includes the following steps:
step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;
processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;
preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;
step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;
step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;
and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.
As shown in fig. 1, the method for estimating a three-dimensional pose of multiple persons based on wireless channel state information provided in the embodiment of the present invention includes the following steps
Step one, using WiFi equipment to collect Channel State Information (CSI) signals, and the process is as follows:
the WiFi equipment is used for transmitting subcarrier signals with the frequency of 100Hz and 30 different frequencies, signal attenuation and phase change of the signals with the different frequencies can be obtained to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target.
Shooting a corresponding video by using a camera, wherein the process is as follows:
the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching, shaking hands and the like are performed, and the timestamp is stored so as to correspond to the CSI signal part.
Step two, outputting the AlphaPose to generate a thermodynamic diagram and a target block diagram, wherein the process is as follows:
inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; the generated target frame coordinates are subjected to multi-scale transformation, 4 scale target frames are used in the method, the multi-scale transformed target frame coordinates are respectively placed on a plurality of graphs to generate tensors, and the two tensors are used as labels of the CSI-2D model to supervise the learning of the CSI-2D model.
Thirdly, preprocessing the acquired data, wherein the process is as follows;
3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:
Figure BDA0003350828260000081
wherein H1(f, t) is channel state information of the antenna 1,
Figure BDA0003350828260000082
is the conjugate of the channel state information of antenna 2, H1,S(f) And H2,S(f) Respectively, its static path part, K and L are the number of multipath, alphal(f, t) is the amplitude decay function on the l path,
Figure BDA0003350828260000083
is a function of the Doppler shift;
3.2 eliminating the noise of the environment;
firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, and then filtering high-frequency noise by using a high-performance filter after removing the outlier. Using a non-linear wavelet transform based soft threshold method,
Figure BDA0003350828260000084
when the absolute value of the wavelet coefficient w is greater than or equal to a given threshold thr, subtracting the threshold from the absolute value of the wavelet coefficient w and multiplying the result by an sgn function, and when the wavelet coefficient w is less than the given threshold thr, setting the wavelet coefficient to be 0;
the threshold selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;
3.3 removing static direct current components;
the static direct current component is large, but does not contain the required human motion information, and the static variable can be eliminated by subtracting the signal mean value;
step four, the preprocessed CSI signals correspond to the video frames, and the process is as follows:
the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded;
step five, the CSI-2D model training process comprises the following steps:
using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;
the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:
L=λ1Lbox2LJHMs
wherein the window loss L is calculated using the two-class lossboxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weightJHMsAdditionally setting window loss weight coefficient lambda1To 0.1, a joint position error loss weight coefficient λ is set2Is 1;
LJHMsthe calculation formula is as follows:
Figure BDA0003350828260000091
in the formula
Figure BDA0003350828260000092
In the formula
Figure BDA0003350828260000093
Namely the MSE loss formula,
Figure BDA0003350828260000094
it is shown that the video tag data matrix,
Figure BDA0003350828260000095
represented is a matrix generated by a model,
Figure BDA0003350828260000096
for element weights, an attention mechanism is implemented by using Matthew weights;
Figure BDA0003350828260000097
II (·) is
Figure BDA0003350828260000098
The value of the function is +1 when
Figure BDA0003350828260000099
The value of the function is-1.
The input CSI tensor is up-sampled, then the output passes through a residual block and U-Nets, the pooling layer is output, box and JHMs are output in the same network model, and therefore the output after pooling passes through two convolution layers and a BN layer to output the expression tensor. And processing the output multi-scale box graph and thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:
evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:
Figure BDA0003350828260000101
wherein j represents the number of key points, p represents the number of people,
Figure BDA0003350828260000102
representing the coordinate position of the person regressing the jth key point pth,
Figure BDA0003350828260000103
a thermodynamic diagram representing the jth keypoint,
Figure BDA0003350828260000104
a target detection block diagram representing a pth individual;
step six, generating a multi-person 3D coordinate by the multi-person two-dimensional coordinate, wherein the process is as follows:
and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.
The results of the experiments of the present invention are described in detail below with reference to the experiments:
firstly, experimental environment and data acquisition: the method comprises the steps that transceiving equipment and a camera are arranged in a 10 m-15 m conference room environment, two notebooks provided with Intel5300 network cards are used for data transceiving experiments, 6 directional antennas are used, 3 antennas are divided into one group to form WiFi router equipment, one group is used as a transmitter (T), the other group is used as a receiver (R), and the distance between the transceiving antennas is 6 m. The transmitting signal takes 5GHz as a center, an Orthogonal Frequency Division Multiplexing (OFDM) WiFi signal which accords with an 802.11nWiFi communication standard is recorded, 30 subcarriers are provided, the transmitting frequency is 100Hz, the 30 subcarrier signals with different frequencies can acquire signal attenuation and phase change of the signals with different frequencies so as to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target. While collecting the CSI signal, 20FPS video frames were recorded using a monocular RGB camera at the receiving end.
Four volunteers participated in data acquisition, the volunteers respectively perform actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, boxing, shaking hands and the like among the transmitting and receiving antennas by a plurality of persons and a single person, and the timestamp is stored so that the video frame time corresponds to the CSI signal collection time.
Secondly, data processing: performing data preprocessing on the dat file obtained by the CSI acquisition tool in an MATLAB data processing tool, specifically comprising: eliminating phase offset between the two antennas; removing the most obvious outlier of the original signal by a Hample outlier filter; removing noise based on a nonlinear wavelet transform soft threshold method; the static dc component is removed.
Thirdly, model training: the collected CSI data were made into data sets, and the data sets were partitioned into a training set, validation set, 2:8, using Adam optimizer with β 1 0.9 and β 2 0.999 during training. K-1 and b-1 are used in calculating MW weights. These networks were trained for 40 rounds.
The invention evaluates the whole model by using two-dimensional and three-dimensional joint points generated by a video model as real values on model evaluation.
Table 1 shows the error (P-MPJPE) result of the model after alignment of each key point on a multi-person data set with a true value in translation, rotation and proportion, wherein the lower the P-MPJPE is, the better the P-MPJPE is;
Figure BDA0003350828260000111
TABLE 1
The joint point sequences respectively represent that 1: middle hip; 2: the left hip; 3: the left knee; 4: a left ankle; 5: the right hip is shown in the specification; 6: the right knee; 7: a right ankle; 8: spinal column, 9: chest; 10: a neck; 11: a head; 12: a left shoulder; 13: the left elbow; 14: a left wrist; 15: a right shoulder; 16: the right elbow; 17: the right wrist.
As can be seen from Table 1, the total P-MPJPE of the method can reach 30.4mm, which is improved by 7.5mm compared with 37.9mm of WiPose.
The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.

Claims (7)

1. A multi-person three-dimensional attitude estimation method based on wireless signals is characterized by comprising the following steps:
step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;
processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;
preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;
step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;
step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;
and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.
2. The method for estimating the three-dimensional pose of a plurality of people based on wireless signals as claimed in claim 1, wherein the process of the step one is as follows:
1.1, arranging transceiving equipment and a camera in a conference room environment, carrying out data transceiving experiments by using two notebooks provided with Intel5300 network cards, using 6 directional antennas, dividing 3 antennas into one group, wherein the antenna distance between the antennas in the same group is 20cm, so as to form WiFi-like router equipment, wherein one group is used as a transmitter (T), and the other group is used as a receiver (R); using a WiFi device to transmit subcarrier signals with the frequency of 100Hz and 30 different frequencies to acquire signal attenuation and phase change of different frequency signals so as to know information of different scales of a propagation path, and receiving a 3 × 3 × 30 Channel State Information (CSI) signal reflected and penetrated by a target by a receiving end;
1.2 shoot the corresponding video with the camera, the process is: the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching and shaking hands are made, and the timestamp is saved so as to correspond to the CSI signal part.
3. The method for estimating the three-dimensional postures of the multiple persons based on the wireless signals as claimed in claim 1 or 2, wherein in the second step, the process of generating the thermodynamic diagram and the target block diagram by the AlphaPose output is as follows: inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; and performing multi-scale transformation on the generated target frame coordinates, using 4 scale target frames, respectively placing the multi-scale transformed target frame coordinates on a plurality of graphs to generate tensors, and using the two tensors as labels of the CSI-2D model to supervise the learning of the CSI-2D model.
4. The method for estimating the three-dimensional pose of a plurality of people based on wireless signals as claimed in claim 1 or 2, wherein in the third step, the data preprocessing process is performed on the collected data as follows;
3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:
Figure FDA0003350828250000011
wherein H1(f, t) is channel state information of the antenna 1,
Figure FDA0003350828250000012
is the conjugate of the channel state information of antenna 2, H1,S(f) And H2,S(f) Respectively, its static path part, K and L are the number of multipath, alphal(f, t) is the amplitude decay function on the l path,
Figure FDA0003350828250000021
is a function of the Doppler shift;
3.2 eliminating the noise of the environment;
firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, filtering high-frequency noise by using a high-performance nonlinear wavelet transform-based soft thresholding filter after removing the outlier,
Figure FDA0003350828250000022
when the absolute value of the wavelet coefficient w is greater than or equal to a given threshold thr, subtracting the threshold from the absolute value of the wavelet coefficient w and multiplying the result by an sgn function, and when the wavelet coefficient w is less than the given threshold thr, setting the wavelet coefficient to be 0;
the selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;
3.3 removing static direct current components;
the static direct current component is large but does not contain the required human motion information, and the signal mean is subtracted to eliminate the static variable.
5. The method as claimed in claim 1 or 2, wherein in the fourth step, the preprocessed CSI signals are mapped to the video frames by the following process: the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded.
6. The method for estimating the three-dimensional pose of multiple persons based on wireless signals as claimed in claim 1 or 2, wherein in the fifth step, the CSI-2D model training process is as follows:
using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;
the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:
L=λ1Lbox2LJHMs
wherein the window loss L is calculated using the two-class lossboxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weightJHMsAdditionally setting window loss weight coefficient lambda1To 0.1, a joint position error loss weight coefficient λ is set2Is 1;
LJHMsthe calculation formula is as follows:
Figure FDA0003350828250000023
in the formula
Figure FDA0003350828250000024
Namely the MSE loss formula,
Figure FDA0003350828250000025
it is shown that the video tag data matrix,
Figure FDA0003350828250000026
represented is a matrix generated by a model,
Figure FDA0003350828250000027
for element weights, an attention mechanism is implemented by using Matthew weights;
Figure FDA0003350828250000028
II (·) is
Figure FDA0003350828250000029
The value of the function is +1 when
Figure FDA00033508282500000210
The value of the function is-1;
the method comprises the following steps of firstly carrying out up-sampling on an input CSI tensor, then outputting the CSI tensor through a pooling layer through a residual block and U-Nets, outputting box and JHMs in the same network model, outputting the output after pooling through two convolutional layers and a BN layer to represent the tensor, and processing an output multi-scale box graph and a thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:
evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:
Figure FDA0003350828250000031
wherein j represents the number of key points, p represents the number of people,
Figure FDA0003350828250000032
representing the coordinate position of the person regressing the jth key point pth,
Figure FDA0003350828250000033
a thermodynamic diagram representing the jth keypoint,
Figure FDA0003350828250000034
a target detection block diagram representing a pth person.
7. The method for estimating the three-dimensional pose of the multiple persons based on the wireless signal as claimed in claim 1 or 2, wherein in the sixth step, the process of generating the 3D coordinates of the multiple persons by the two-dimensional coordinates of the multiple persons is as follows: and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.
CN202111336826.1A 2021-11-12 2021-11-12 Multi-person three-dimensional attitude estimation method based on wireless signals Pending CN114219853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111336826.1A CN114219853A (en) 2021-11-12 2021-11-12 Multi-person three-dimensional attitude estimation method based on wireless signals

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111336826.1A CN114219853A (en) 2021-11-12 2021-11-12 Multi-person three-dimensional attitude estimation method based on wireless signals

Publications (1)

Publication Number Publication Date
CN114219853A true CN114219853A (en) 2022-03-22

Family

ID=80696971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111336826.1A Pending CN114219853A (en) 2021-11-12 2021-11-12 Multi-person three-dimensional attitude estimation method based on wireless signals

Country Status (1)

Country Link
CN (1) CN114219853A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581958A (en) * 2022-05-06 2022-06-03 南京邮电大学 Static human body posture estimation method based on CSI signal arrival angle estimation
CN114999002A (en) * 2022-08-04 2022-09-02 松立控股集团股份有限公司 Behavior recognition method fusing human body posture information
CN115412188A (en) * 2022-08-26 2022-11-29 福州大学 Power distribution station room operation behavior identification method based on wireless sensing

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581958A (en) * 2022-05-06 2022-06-03 南京邮电大学 Static human body posture estimation method based on CSI signal arrival angle estimation
WO2023213051A1 (en) * 2022-05-06 2023-11-09 南京邮电大学 Static human body posture estimation method based on csi signal angle-of-arrival estimation
CN114999002A (en) * 2022-08-04 2022-09-02 松立控股集团股份有限公司 Behavior recognition method fusing human body posture information
CN115412188A (en) * 2022-08-26 2022-11-29 福州大学 Power distribution station room operation behavior identification method based on wireless sensing

Similar Documents

Publication Publication Date Title
CN111144217B (en) Motion evaluation method based on human body three-dimensional joint point detection
CN109934111B (en) Fitness posture estimation method and system based on key points
CN114219853A (en) Multi-person three-dimensional attitude estimation method based on wireless signals
Gideon et al. The way to my heart is through contrastive learning: Remote photoplethysmography from unlabelled video
CN110472604B (en) Pedestrian and crowd behavior identification method based on video
CN113408508B (en) Transformer-based non-contact heart rate measurement method
Hu et al. Robust heart rate estimation with spatial–temporal attention network from facial videos
Liu et al. Motion-robust multimodal heart rate estimation using BCG fused remote-PPG with deep facial ROI tracker and pose constrained Kalman filter
CN110991559B (en) Indoor personnel behavior non-contact cooperative sensing method
CN110728213A (en) Fine-grained human body posture estimation method based on wireless radio frequency signals
Meng et al. A video information driven football recommendation system
CN110490109A (en) A kind of online human body recovery action identification method based on monocular vision
An et al. AdaptNet: Human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks
Kabir et al. CSI-IANet: An inception attention network for human-human interaction recognition based on CSI signal
Sheu et al. Improvement of human pose estimation and processing with the intensive feature consistency network
CN113743374B (en) Personnel identity recognition method based on channel state information respiratory perception
CN114581958A (en) Static human body posture estimation method based on CSI signal arrival angle estimation
Yadav et al. YogaTube: a video benchmark for Yoga action recognition
CN117058228A (en) Three-dimensional human body posture estimation method based on low-channel radar
CN116626596A (en) Social intention recognition method and system based on millimeter wave radar
Xaviar et al. Robust Multimodal Fusion for Human Activity Recognition
Pang et al. Device-free activity recognition: A survey
Cao et al. Task-Specific Feature Purifying in Radar-Based Human Pose Estimation
CN116524612B (en) rPPG-based human face living body detection system and method
Zheng et al. Through-Wall Human Pose Reconstruction Based on Cross-Modal Learning and Self-Supervised Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination