CN114219853A - Multi-person three-dimensional attitude estimation method based on wireless signals - Google Patents
Multi-person three-dimensional attitude estimation method based on wireless signals Download PDFInfo
- Publication number
- CN114219853A CN114219853A CN202111336826.1A CN202111336826A CN114219853A CN 114219853 A CN114219853 A CN 114219853A CN 202111336826 A CN202111336826 A CN 202111336826A CN 114219853 A CN114219853 A CN 114219853A
- Authority
- CN
- China
- Prior art keywords
- csi
- dimensional
- human body
- signal
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000010586 diagram Methods 0.000 claims abstract description 44
- 238000012545 processing Methods 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 230000036544 posture Effects 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 17
- 230000003068 static effect Effects 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 8
- 230000009471 action Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000007613 environmental effect Effects 0.000 claims description 5
- 238000002474 experimental method Methods 0.000 claims description 5
- 230000009191 jumping Effects 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 4
- 238000004080 punching Methods 0.000 claims description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000012795 verification Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 210000004247 hand Anatomy 0.000 description 12
- 210000002414 leg Anatomy 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 210000003423 ankle Anatomy 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 210000003127 knee Anatomy 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 210000000707 wrist Anatomy 0.000 description 2
- 206010012289 Dementia Diseases 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- QBWCMBCROVPCKQ-UHFFFAOYSA-N chlorous acid Chemical compound OCl=O QBWCMBCROVPCKQ-UHFFFAOYSA-N 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000011065 in-situ storage Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
- G06T7/74—Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/08—Projecting images onto non-planar surfaces, e.g. geodetic screens
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/02—Preprocessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
A multi-person three-dimensional attitude estimation method based on wireless signals comprises the following steps: step one, a WiFi device is used for collecting Channel State Information (CSI) signals, and a camera is used for shooting corresponding videos for supervision; processing the video by alpha Pose, and outputting a human body target frame and human body key point processing to generate a thermodynamic diagram and a target frame diagram; step three, preprocessing the CSI signal; step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained; step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body; and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates. The invention has lower cost and higher precision.
Description
Technical Field
The invention belongs to the technical field of wireless communication, and relates to a method for estimating a three-dimensional posture of multiple persons based on wireless channel state information.
Background
On the estimation of human body postures, the method of camera-based posture estimation including two-dimensional or even three-dimensional posture estimation is quite mature, but the traditional sensor technology of camera is obviously limited by illumination, shading and background, and has privacy problem.
The wireless sensing has great potential for sensing human bodies based on wireless signals, and in recent years, the research of the wireless sensing on the aspect of human sensing has some achievements including behavior recognition, respiration detection, target positioning, crowd counting and the like. Two basic approaches to human sensing based on wireless signals are: one is device based, requiring a person to wear or carry the device/sensor; another type of device that uses sensing elements located in an environment to monitor human activity without requiring the human to carry any devices or sensors. Device-based approaches, while generally accurate, are not practical or convenient in many important real-life scenarios, such as requiring elderly or dementia patients to carry the device at all times. Device-less human sensing offers significant advantages for these scenarios.
Many research results prove that the wireless signals can well complete two-dimensional and three-dimensional attitude estimation tasks under the supervision of a video model. Zhao et al propose RF-pos, which is a codec-based human Pose estimation deep learning architecture. The input signal is an antenna array that emits a continuous modulation frequency wave (FMCW), which is expensive and not widely available compared to commercial WiFi, although it can acquire the distance from the object to the signal. Wang et al performs human body posture estimation on a CSI signal collected by commercial WiFi for the first time to obtain a human body mask, two-dimensional joint points and joint effective area Portions (PAFs). Guo et al also uses deep learning to obtain the human skeleton from using multi-directional receive CSI signals. The previous methods only obtain two-dimensional gestures, Jiang et al propose a method for obtaining three-dimensional gestures from CSI signals, model training is performed on CSI by using a VICON system (a gesture capture camera) as a marker, the direct end-to-end learning of the three-dimensional gesture model is high in learning pressure, the learning accuracy is not high, and the method is still limited to in-situ gesture motion.
In summary, the problems of the prior art are as follows: the human body posture estimation based on the CSI mostly stays on two dimensions, and the three-dimensional posture estimation is performed only by a single person without being moved in place, and the accuracy cannot be very high. In addition, the CSI signal has a large interference influence on the environment, which is a difficult point if the environmental problem is solved. The difficulty of solving the above problems is: how to train a three-dimensional human body posture model with fine granularity from CSI data; how to realize attitude estimation under the condition of multiple persons; how to eliminate the effect of environmental noise.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides the multi-user three-dimensional attitude estimation method based on the wireless signals, which has lower cost and higher precision.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a multi-person three-dimensional attitude estimation method based on wireless signals comprises the following steps:
step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;
processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;
preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;
step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;
step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;
and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.
Further, the process of the step one is as follows:
1.1, arranging transceiving equipment and a camera in a conference room environment, carrying out data transceiving experiments by using two notebooks provided with Intel5300 network cards, using 6 directional antennas, dividing 3 antennas into one group, wherein the antenna distance between the antennas in the same group is 20cm, so as to form WiFi-like router equipment, wherein one group is used as a transmitter (T), and the other group is used as a receiver (R); the WiFi equipment is used for transmitting subcarrier signals with the frequency of 100Hz and 30 different frequencies, signal attenuation and phase change of the signals with the different frequencies can be obtained to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target. (ii) a
1.2 shoot the corresponding video with the camera, the process is: the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching and shaking hands are made, and the timestamp is saved so as to correspond to the CSI signal part.
Still further, in the second step, the process of generating the thermodynamic diagram and the target block diagram by outputting the alphapos includes: inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; the generated target frame coordinates are subjected to multi-scale transformation, 4 scale target frames are used, and the multi-scale transformed target frame coordinates are respectively placed on a plurality of graphs to generate tensors. The two tensors will supervise their learning as annotations of the CSI-2D model.
Furthermore, in the third step, the data preprocessing process is performed on the acquired data as follows;
3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:
wherein H1(f, t) is channel state information of the antenna 1,is the conjugate of the channel state information of antenna 2, H1,S(f) And H2,S(f) Respectively, its static path part, K and L are the number of multipath, alphal(f, t) is the amplitude decay function on the l path,is a function of the Doppler shift;
3.2 eliminating the noise of the environment;
firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, filtering high-frequency noise by using a high-performance filter after removing the outlier, using a soft threshold method based on nonlinear wavelet transform,
wherein, when the absolute value of the wavelet coefficient w is equal to or greater than a given threshold thr, the threshold is subtracted from the absolute value of the wavelet coefficient w and multiplied by the sgn function, and when the wavelet coefficient w is less than the given threshold thr, the wavelet coefficient is set to 0.
The threshold selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;
3.3 removing static direct current components;
the static direct current component is large but does not contain the required human motion information, and the signal mean is subtracted to eliminate the static variable.
In the fourth step, the preprocessed CSI signals are corresponded to the video frames, and the process is as follows: the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded.
In the fifth step, the CSI-2D model training process is as follows:
using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;
the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:
L=λ1Lbox+λ2LJHMs;
wherein the window loss L is calculated using the two-class lossboxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weightJHMsAdditionally setting window loss weight coefficient lambda1To 0.1, a joint position error loss weight coefficient λ is set2Is 1;
LJHMsthe calculation formula is as follows:
in the formulaNamely the MSE loss formula,it is shown that the video tag data matrix,represented is a matrix generated by a model,for element weights, an attention mechanism is implemented by using Matthew weights;
the method comprises the following steps of firstly carrying out up-sampling on an input CSI tensor, then outputting the CSI tensor through a pooling layer through a residual block and U-Nets, outputting box and JHMs in the same network model, outputting the output after pooling through two convolutional layers and a BN layer to represent the tensor, and processing an output multi-scale box graph and a thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:
evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:
wherein j represents the number of key points, p represents the number of people,representing the coordinate position of the person regressing the jth key point pth,a thermodynamic diagram representing the jth keypoint,a target detection block diagram representing a pth person.
In the sixth step, the process of generating the multi-person 3D coordinate by the multi-person two-dimensional coordinate comprises the following steps: and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.
The invention provides a novel method for estimating multi-person three-dimensional postures of CSI signals, which comprises the steps of preprocessing the signals through data, carrying out combined training of wireless network two-dimensional postures and human body target frames under the supervision of a video two-dimensional posture estimation deep learning model, and outputting a movable three-dimensional posture framework by combining output two-dimensional postures with a video frame through a pre-trained three-dimensional posture model.
The WiFi signal is used for human body posture estimation, so that the human body posture estimation can be well applied to the market, including activity detection and life assistance of some patients or old people who are inconvenient to move, and some privacy problems can be well avoided by using the WiFi signal; applying the human body to games, animations and AR under the environment of sundries cluster; in addition, the human body can be detected in a dim environment. Can meet the requirements of low cost and high precision.
The invention has the following beneficial effects:
1. two-dimensional and three-dimensional multi-person attitude estimation is realized by using commercial WiFi, the method is low in cost, can be realized by only WiFi equipment, and is high in precision;
2. the multi-scale target frame is used, so that the technical problem that one key point occupies the other space when multiple persons interact is solved, and the accuracy of multi-person interaction in the multi-person target segmentation process is improved;
3. in the aspect of generating the three-dimensional posture, the three-dimensional part combines the two-dimensional joint point, the position and the video frame to carry out smooth processing on the action, so that the precision of the model is improved, and the pressure of directly carrying out end-to-end learning on the model is reduced;
4. according to the method, a key Point Affinity Field (PAFs) is abandoned on two-dimensional gesture recognition, and the real-time performance of the model is improved by combining a human body target frame; the invention collects 8 actions of data set including waving hands, clapping hands, walking, kicking legs, shaking hands, squatting, jumping and punching a fist, four volunteers participate in the method, and video and CSI data of single person and multiple persons are divided.
Drawings
FIG. 1 is a flow chart of a method for three-dimensional attitude estimation based on channel state information according to an embodiment of the present invention;
fig. 2 is a flow chart of a pre-processing of a raw CSI signal according to an embodiment of the present invention;
fig. 3 is a network structure diagram of a CSI-2D module.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a method for estimating a three-dimensional posture of a plurality of persons based on wireless signals includes the following steps:
step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;
processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;
preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;
step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;
step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;
and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.
As shown in fig. 1, the method for estimating a three-dimensional pose of multiple persons based on wireless channel state information provided in the embodiment of the present invention includes the following steps
Step one, using WiFi equipment to collect Channel State Information (CSI) signals, and the process is as follows:
the WiFi equipment is used for transmitting subcarrier signals with the frequency of 100Hz and 30 different frequencies, signal attenuation and phase change of the signals with the different frequencies can be obtained to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target.
Shooting a corresponding video by using a camera, wherein the process is as follows:
the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching, shaking hands and the like are performed, and the timestamp is stored so as to correspond to the CSI signal part.
Step two, outputting the AlphaPose to generate a thermodynamic diagram and a target block diagram, wherein the process is as follows:
inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; the generated target frame coordinates are subjected to multi-scale transformation, 4 scale target frames are used in the method, the multi-scale transformed target frame coordinates are respectively placed on a plurality of graphs to generate tensors, and the two tensors are used as labels of the CSI-2D model to supervise the learning of the CSI-2D model.
Thirdly, preprocessing the acquired data, wherein the process is as follows;
3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:
wherein H1(f, t) is channel state information of the antenna 1,is the conjugate of the channel state information of antenna 2, H1,S(f) And H2,S(f) Respectively, its static path part, K and L are the number of multipath, alphal(f, t) is the amplitude decay function on the l path,is a function of the Doppler shift;
3.2 eliminating the noise of the environment;
firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, and then filtering high-frequency noise by using a high-performance filter after removing the outlier. Using a non-linear wavelet transform based soft threshold method,
when the absolute value of the wavelet coefficient w is greater than or equal to a given threshold thr, subtracting the threshold from the absolute value of the wavelet coefficient w and multiplying the result by an sgn function, and when the wavelet coefficient w is less than the given threshold thr, setting the wavelet coefficient to be 0;
the threshold selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;
3.3 removing static direct current components;
the static direct current component is large, but does not contain the required human motion information, and the static variable can be eliminated by subtracting the signal mean value;
step four, the preprocessed CSI signals correspond to the video frames, and the process is as follows:
the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded;
step five, the CSI-2D model training process comprises the following steps:
using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;
the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:
L=λ1Lbox+λ2LJHMs;
wherein the window loss L is calculated using the two-class lossboxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weightJHMsAdditionally setting window loss weight coefficient lambda1To 0.1, a joint position error loss weight coefficient λ is set2Is 1;
LJHMsthe calculation formula is as follows:
In the formulaNamely the MSE loss formula,it is shown that the video tag data matrix,represented is a matrix generated by a model,for element weights, an attention mechanism is implemented by using Matthew weights;
The input CSI tensor is up-sampled, then the output passes through a residual block and U-Nets, the pooling layer is output, box and JHMs are output in the same network model, and therefore the output after pooling passes through two convolution layers and a BN layer to output the expression tensor. And processing the output multi-scale box graph and thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:
evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:
wherein j represents the number of key points, p represents the number of people,representing the coordinate position of the person regressing the jth key point pth,a thermodynamic diagram representing the jth keypoint,a target detection block diagram representing a pth individual;
step six, generating a multi-person 3D coordinate by the multi-person two-dimensional coordinate, wherein the process is as follows:
and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.
The results of the experiments of the present invention are described in detail below with reference to the experiments:
firstly, experimental environment and data acquisition: the method comprises the steps that transceiving equipment and a camera are arranged in a 10 m-15 m conference room environment, two notebooks provided with Intel5300 network cards are used for data transceiving experiments, 6 directional antennas are used, 3 antennas are divided into one group to form WiFi router equipment, one group is used as a transmitter (T), the other group is used as a receiver (R), and the distance between the transceiving antennas is 6 m. The transmitting signal takes 5GHz as a center, an Orthogonal Frequency Division Multiplexing (OFDM) WiFi signal which accords with an 802.11nWiFi communication standard is recorded, 30 subcarriers are provided, the transmitting frequency is 100Hz, the 30 subcarrier signals with different frequencies can acquire signal attenuation and phase change of the signals with different frequencies so as to know information of different scales of a propagation path, and a receiving end receives a 3 x 30 Channel State Information (CSI) signal which is reflected and penetrated by a target. While collecting the CSI signal, 20FPS video frames were recorded using a monocular RGB camera at the receiving end.
Four volunteers participated in data acquisition, the volunteers respectively perform actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, boxing, shaking hands and the like among the transmitting and receiving antennas by a plurality of persons and a single person, and the timestamp is stored so that the video frame time corresponds to the CSI signal collection time.
Secondly, data processing: performing data preprocessing on the dat file obtained by the CSI acquisition tool in an MATLAB data processing tool, specifically comprising: eliminating phase offset between the two antennas; removing the most obvious outlier of the original signal by a Hample outlier filter; removing noise based on a nonlinear wavelet transform soft threshold method; the static dc component is removed.
Thirdly, model training: the collected CSI data were made into data sets, and the data sets were partitioned into a training set, validation set, 2:8, using Adam optimizer with β 1 0.9 and β 2 0.999 during training. K-1 and b-1 are used in calculating MW weights. These networks were trained for 40 rounds.
The invention evaluates the whole model by using two-dimensional and three-dimensional joint points generated by a video model as real values on model evaluation.
Table 1 shows the error (P-MPJPE) result of the model after alignment of each key point on a multi-person data set with a true value in translation, rotation and proportion, wherein the lower the P-MPJPE is, the better the P-MPJPE is;
TABLE 1
The joint point sequences respectively represent that 1: middle hip; 2: the left hip; 3: the left knee; 4: a left ankle; 5: the right hip is shown in the specification; 6: the right knee; 7: a right ankle; 8: spinal column, 9: chest; 10: a neck; 11: a head; 12: a left shoulder; 13: the left elbow; 14: a left wrist; 15: a right shoulder; 16: the right elbow; 17: the right wrist.
As can be seen from Table 1, the total P-MPJPE of the method can reach 30.4mm, which is improved by 7.5mm compared with 37.9mm of WiPose.
The embodiments described in this specification are merely illustrative of implementations of the inventive concepts, which are intended for purposes of illustration only. The scope of the present invention should not be construed as being limited to the particular forms set forth in the examples, but rather as being defined by the claims and the equivalents thereof which can occur to those skilled in the art upon consideration of the present inventive concept.
Claims (7)
1. A multi-person three-dimensional attitude estimation method based on wireless signals is characterized by comprising the following steps:
step one, using WiFi equipment to collect Channel State Information (CSI) signals, and simultaneously using a camera to shoot corresponding videos for supervision;
processing the video by alpha Pose, and outputting a human body target frame and a human body key point to process to generate a thermodynamic diagram and a target frame diagram which are used as labels for wireless signal training;
preprocessing the collected signals, including eliminating phase offset between the two antennas, removing abnormal points and environmental noise of the CSI signals, and removing static direct-current components;
step four, the preprocessed CSI signals correspond to the video frames, and five pieces of CSI data are input into a CSI-2D network as a section of data to be trained;
step five, processing the thermodynamic diagram and the target block diagram output by the CSI signal in the CSI-2D model, and regressing the two-dimensional posture of each human body;
and step six, forming a video frame by each human body two-dimensional posture, inputting the video frame into the 2D-3D model, and generating the multi-person three-dimensional posture by combining the two-dimensional coordinates.
2. The method for estimating the three-dimensional pose of a plurality of people based on wireless signals as claimed in claim 1, wherein the process of the step one is as follows:
1.1, arranging transceiving equipment and a camera in a conference room environment, carrying out data transceiving experiments by using two notebooks provided with Intel5300 network cards, using 6 directional antennas, dividing 3 antennas into one group, wherein the antenna distance between the antennas in the same group is 20cm, so as to form WiFi-like router equipment, wherein one group is used as a transmitter (T), and the other group is used as a receiver (R); using a WiFi device to transmit subcarrier signals with the frequency of 100Hz and 30 different frequencies to acquire signal attenuation and phase change of different frequency signals so as to know information of different scales of a propagation path, and receiving a 3 × 3 × 30 Channel State Information (CSI) signal reflected and penetrated by a target by a receiving end;
1.2 shoot the corresponding video with the camera, the process is: the monocular RGB camera is used for recording 20FPS video frames, four volunteers participate in data acquisition, data of multiple persons and single persons are acquired respectively, actions including waving hands, clapping hands, walking, kicking legs, squatting, jumping, punching and shaking hands are made, and the timestamp is saved so as to correspond to the CSI signal part.
3. The method for estimating the three-dimensional postures of the multiple persons based on the wireless signals as claimed in claim 1 or 2, wherein in the second step, the process of generating the thermodynamic diagram and the target block diagram by the AlphaPose output is as follows: inputting the video frame into an AlphaPose model to generate 17 human body key points and position coordinates of a human body target frame, and generating a thermodynamic diagram tensor by the generated key point coordinates through Gaussian blur processing; and performing multi-scale transformation on the generated target frame coordinates, using 4 scale target frames, respectively placing the multi-scale transformed target frame coordinates on a plurality of graphs to generate tensors, and using the two tensors as labels of the CSI-2D model to supervise the learning of the CSI-2D model.
4. The method for estimating the three-dimensional pose of a plurality of people based on wireless signals as claimed in claim 1 or 2, wherein in the third step, the data preprocessing process is performed on the collected data as follows;
3.1 eliminate the phase offset between the two antennas, a conjugate phase multiplication is used to eliminate the phase offset:
wherein H1(f, t) is channel state information of the antenna 1,is the conjugate of the channel state information of antenna 2, H1,S(f) And H2,S(f) Respectively, its static path part, K and L are the number of multipath, alphal(f, t) is the amplitude decay function on the l path,is a function of the Doppler shift;
3.2 eliminating the noise of the environment;
firstly, removing the most obvious outlier of an original signal by using a Hample outlier filter, filtering high-frequency noise by using a high-performance nonlinear wavelet transform-based soft thresholding filter after removing the outlier,
when the absolute value of the wavelet coefficient w is greater than or equal to a given threshold thr, subtracting the threshold from the absolute value of the wavelet coefficient w and multiplying the result by an sgn function, and when the wavelet coefficient w is less than the given threshold thr, setting the wavelet coefficient to be 0;
the selection standard uses a heuristic threshold principle, namely when the signal-to-noise ratio is small, denoising is carried out by using an unbiased likelihood estimation principle, and when the signal-to-noise ratio is large, denoising is carried out by using a fixed threshold method;
3.3 removing static direct current components;
the static direct current component is large but does not contain the required human motion information, and the signal mean is subtracted to eliminate the static variable.
5. The method as claimed in claim 1 or 2, wherein in the fourth step, the preprocessed CSI signals are mapped to the video frames by the following process: the frame of the video is 20FPS, the transmitting frequency of the CSI signal is 100Hz, timestamps are recorded when the video and the CSI signal are collected, and the video and the CSI signal are corresponded.
6. The method for estimating the three-dimensional pose of multiple persons based on wireless signals as claimed in claim 1 or 2, wherein in the fifth step, the CSI-2D model training process is as follows:
using a learning rate of 0.06, training 40 rounds, dividing the data set into training sets: the verification set is 2:8, the optimizer adopts an Adam optimizer, and the offset is set to be 0.9;
the goal of the training of the CSI-2D model is to reduce the difference from the key point thermodynamic diagrams predicted by the vision-based network model, and the calculated loss L is made up of multiple losses:
L=λ1Lbox+λ2LJHMs;
wherein the window loss L is calculated using the two-class lossboxThe joint position error L is calculated using the L2 penalty (squared penalty) multiplied by the element weightJHMsAdditionally setting window loss weight coefficient lambda1To 0.1, a joint position error loss weight coefficient λ is set2Is 1;
LJHMsthe calculation formula is as follows:
in the formulaNamely the MSE loss formula,it is shown that the video tag data matrix,represented is a matrix generated by a model,for element weights, an attention mechanism is implemented by using Matthew weights;
the method comprises the following steps of firstly carrying out up-sampling on an input CSI tensor, then outputting the CSI tensor through a pooling layer through a residual block and U-Nets, outputting box and JHMs in the same network model, outputting the output after pooling through two convolutional layers and a BN layer to represent the tensor, and processing an output multi-scale box graph and a thermodynamic diagram to extract two-dimensional key points of each person, wherein the operation is as follows:
evaluating a human body target frame segmented by each scale in the multi-scale box image to obtain the best target detection frame serving as each human body, then multiplying the thermodynamic diagram and the single human body target frame to extract the thermodynamic diagram of the single human body, and finally regressing the coordinates of key points by using an argmax function:
7. The method for estimating the three-dimensional pose of the multiple persons based on the wireless signal as claimed in claim 1 or 2, wherein in the sixth step, the process of generating the 3D coordinates of the multiple persons by the two-dimensional coordinates of the multiple persons is as follows: and (3) performing normalization processing on the single two-dimensional coordinate sequence, inputting the normalized single two-dimensional coordinate sequence into the 2D-3D model to generate a single three-dimensional coordinate, combining the single two-dimensional coordinate and the single three-dimensional coordinate, and finally generating the multi-person three-dimensional human skeleton in the same three-dimensional coordinate system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111336826.1A CN114219853A (en) | 2021-11-12 | 2021-11-12 | Multi-person three-dimensional attitude estimation method based on wireless signals |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111336826.1A CN114219853A (en) | 2021-11-12 | 2021-11-12 | Multi-person three-dimensional attitude estimation method based on wireless signals |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114219853A true CN114219853A (en) | 2022-03-22 |
Family
ID=80696971
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111336826.1A Pending CN114219853A (en) | 2021-11-12 | 2021-11-12 | Multi-person three-dimensional attitude estimation method based on wireless signals |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114219853A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581958A (en) * | 2022-05-06 | 2022-06-03 | 南京邮电大学 | Static human body posture estimation method based on CSI signal arrival angle estimation |
CN114999002A (en) * | 2022-08-04 | 2022-09-02 | 松立控股集团股份有限公司 | Behavior recognition method fusing human body posture information |
CN115412188A (en) * | 2022-08-26 | 2022-11-29 | 福州大学 | Power distribution station room operation behavior identification method based on wireless sensing |
-
2021
- 2021-11-12 CN CN202111336826.1A patent/CN114219853A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581958A (en) * | 2022-05-06 | 2022-06-03 | 南京邮电大学 | Static human body posture estimation method based on CSI signal arrival angle estimation |
WO2023213051A1 (en) * | 2022-05-06 | 2023-11-09 | 南京邮电大学 | Static human body posture estimation method based on csi signal angle-of-arrival estimation |
CN114999002A (en) * | 2022-08-04 | 2022-09-02 | 松立控股集团股份有限公司 | Behavior recognition method fusing human body posture information |
CN115412188A (en) * | 2022-08-26 | 2022-11-29 | 福州大学 | Power distribution station room operation behavior identification method based on wireless sensing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111144217B (en) | Motion evaluation method based on human body three-dimensional joint point detection | |
CN109934111B (en) | Fitness posture estimation method and system based on key points | |
CN114219853A (en) | Multi-person three-dimensional attitude estimation method based on wireless signals | |
Gideon et al. | The way to my heart is through contrastive learning: Remote photoplethysmography from unlabelled video | |
CN110472604B (en) | Pedestrian and crowd behavior identification method based on video | |
CN113408508B (en) | Transformer-based non-contact heart rate measurement method | |
Hu et al. | Robust heart rate estimation with spatial–temporal attention network from facial videos | |
Liu et al. | Motion-robust multimodal heart rate estimation using BCG fused remote-PPG with deep facial ROI tracker and pose constrained Kalman filter | |
CN110991559B (en) | Indoor personnel behavior non-contact cooperative sensing method | |
CN110728213A (en) | Fine-grained human body posture estimation method based on wireless radio frequency signals | |
Meng et al. | A video information driven football recommendation system | |
CN110490109A (en) | A kind of online human body recovery action identification method based on monocular vision | |
An et al. | AdaptNet: Human activity recognition via bilateral domain adaptation using semi-supervised deep translation networks | |
Kabir et al. | CSI-IANet: An inception attention network for human-human interaction recognition based on CSI signal | |
Sheu et al. | Improvement of human pose estimation and processing with the intensive feature consistency network | |
CN113743374B (en) | Personnel identity recognition method based on channel state information respiratory perception | |
CN114581958A (en) | Static human body posture estimation method based on CSI signal arrival angle estimation | |
Yadav et al. | YogaTube: a video benchmark for Yoga action recognition | |
CN117058228A (en) | Three-dimensional human body posture estimation method based on low-channel radar | |
CN116626596A (en) | Social intention recognition method and system based on millimeter wave radar | |
Xaviar et al. | Robust Multimodal Fusion for Human Activity Recognition | |
Pang et al. | Device-free activity recognition: A survey | |
Cao et al. | Task-Specific Feature Purifying in Radar-Based Human Pose Estimation | |
CN116524612B (en) | rPPG-based human face living body detection system and method | |
Zheng et al. | Through-Wall Human Pose Reconstruction Based on Cross-Modal Learning and Self-Supervised Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |