CN115171154A

CN115171154A - WiFi human body posture estimation algorithm based on Performer-Unet

Info

Publication number: CN115171154A
Application number: CN202210749949.6A
Authority: CN
Inventors: 朱艾春; 周跃; 徐曹洁; 张帆; 李义丰
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-10-11

Abstract

The WiFi human body posture estimation algorithm based on the Performer-Unet is characterized in that a human body activity video is collected, real posture marking information and CSI data containing human body skeleton point coordinates are extracted, input into an artificial neural network for training, loss marking is carried out, the artificial neural network is optimized by adopting a gradient descent method, and a model is obtained; and processing the CSI data stream of the acquired video through the model, and accurately identifying the human body posture. According to the invention, a cross-modal technology is introduced into a human body posture recognition algorithm, the WiFi-based posture recognition algorithm is trained, the cost is low, the application range is wide, and the privacy protection is good, so that the application range of posture estimation in multiple fields is greatly expanded, and the defects of the traditional algorithm application are overcome.

Description

WiFi human body posture estimation algorithm based on Performer-Unet

Technical Field

The invention relates to a WiFi human body posture estimation algorithm based on a Performer-Unet, and belongs to the field of human body posture estimation based on WiFi signals.

Background

With the development of deep learning, human body posture estimation is widely applied to the fields of human-computer interaction, motion analysis, virtual reality, security and the like, and gradually becomes a research hotspot in the field of computer vision. The traditional human body posture estimation method mainly comprises RGB image estimation and multimedia sensor estimation. However, for personal privacy and security, image capturing devices such as cameras cannot be installed in privacy areas such as bedrooms and toilets, and therefore, a large number of blind spot areas exist in the posture estimation method based on the RGB images. In addition, the working condition of the camera is very easily influenced by light factors such as glare and shading, and has larger instability. Most of attitude estimation methods based on multimedia sensors depend on wearable sensors, infrared sensors and radar equipment, and all the methods have the problems of needing to install professional equipment, poor flexibility, high cost and the like.

Disclosure of Invention

The invention aims to provide a WiFi human body posture estimation algorithm based on a Performer-Unet, aiming at the problems that the application scene of the traditional RGB image human body posture recognition algorithm is limited, the performance of the traditional convolution neural network in the human body posture estimation algorithm is limited and the like. In the invention, a cross-mode technology is introduced into a human body posture recognition algorithm, and a WiFi-based posture recognition algorithm is trained through a high-performance image human body posture recognition algorithm. In addition, the invention designs a unique U-shaped multi-head attention algorithm structure, and provides guarantee for the accuracy and robustness of human body posture estimation.

The technical scheme of the invention is as follows:

a WiFi human body posture estimation algorithm based on Performer-Unet comprises the following steps:

s1, collecting a human body activity video, disassembling the human body activity video into sample image frames of various human body postures, and extracting real posture marking information containing coordinates of human body skeleton points; acquiring a CSI data packet, namely a channel state information sequence, of the sample image frame according to the time stamp;

s2, inputting a sample image frame containing real attitude marking information and a channel state information sequence of the corresponding sample image frame obtained according to a time stamp into an artificial neural network for training;

obtaining human body posture estimation output according to the channel state information sequence, carrying out loss marking on the human body posture estimation output and real posture marking information of a corresponding sample image frame, optimizing an artificial neural network by adopting a gradient descent method until the neural network is converged, finishing training and obtaining a Performer-Unet human body posture estimation model;

and S3, acquiring a CSI (channel state information) data stream of the human body posture to be detected in real time by adopting a WiFi (wireless fidelity) receiving antenna, inputting a Performer-Unet human body posture estimation model, and acquiring the human body posture according to the CSI data stream.

Further, the human body posture comprises: standing, walking, squatting, running and jumping; the real posture marking information is 18 human body skeleton point coordinates.

Further, step S1 specifically includes: arranging a WiFi sending antenna and a WiFi receiving antenna on two sides of a human body, arranging a monitoring camera on one side of the WiFi sending antenna, wherein the monitoring camera is aligned with the WiFi sending antenna and is used for shooting a moving video of the human body and disassembling the video into a sample image frame containing a posture of the human body; and a WiFi receiving antenna is adopted to collect the CSI data stream of the human body posture in real time.

Further, in step S1, the obtaining manner of the real posture marking information is as follows:

processing the sample image frame of the human body posture by adopting a human body posture recognition algorithm alpha to obtain real posture marking information P containing coordinates of human body skeleton points _A ；

P _A ＝Alpha(I _k )

Wherein: k represents a frame number of the sample image frame, and Ik represents a kth frame of the sample image frame captured by the camera; alpha () represents a human body posture recognition algorithm;

marking information P according to real attitude _A Generating pose annotation coordinates

And the confidence coefficient C:

wherein: j represents the number of the skeleton node;

coordinates representing the jth skeleton point, C ^j A confidence level of the coordinates is identified for the skeleton point.

Further, in step S1, the channel state information, i.e. the CSI data stream, is obtained by: and adopting Matlab to split the data stream of the CSI into CSI data packets corresponding to each frame of image according to the time stamp.

Further, the training process in step S2 specifically includes:

step 21, constructing a teacher network T (-) and a student network S (-) which are used for storing the real posture marking information of the sample image frame and storing a CSI data packet of the sample image frame;

step 22, the student network S (-) converts the channel state information sequence X corresponding to the k frame sample image _k Inputting the data into a Performer-Unet of an artificial neural network, and generating a coordinate labeled with the attitude

Consistent-sized attitude estimation matrix

Wherein: j represents the number of the skeleton node; PU () represents the sequence data X of the channel state information using the model Performer-Unet _k Carrying out treatment;

representing coordinates of a jth skeleton point in the attitude estimation matrix;

step 23, calculating the attitude annotation coordinate in the student network S (-)

And attitude estimation matrix

Where the error L of the student network is calculated by using the L2 loss _S(·) ：

Wherein: j represents the number of the skeleton node; c ^j Representing the confidence of the jth skeleton node;

respectively representing a real attitude marking matrix and an attitude estimation matrix;

respectively representing the coordinates (x) of the j-th skeleton point position _j ，y _j )；

And step 24, the student network S (-) transmits the calculated error back in a gradient mode, and the network Performer-Unet is optimized through a gradient descent method until the network converges, and training is completed.

Further, the PU () represents a Performer-Unet human posture estimation model, a U-shaped structure is adopted, and an N-layer Performer layer is added in a fusion layer at the bottommost layer, and the processing steps of the model are as follows:

step 22.1, the U-shaped structure model inputs the channel state information sequence X _k Carrying out three times of downsampling to carry out background semantic extraction, wherein the downsampling comprises one time of convolution Conv () operation and one time of pooling Pool () operation to obtain the output of the three times of downsampling

Step 22.2, sequence information after three times of down sampling

Inputting the attitude characteristic sequence into an N-layer Performer layer, and extracting an attitude characteristic sequence through a multi-head attention mechanism MulAttn ()

Step 22.3, extracting the attitude feature sequence

Performing up-sampling to amplify the characteristic information to obtain a characteristic amplification sequence

The upsampling operation includes a convolution operation Conv () and an interpolation operation Int ():

step 22.4, amplifying the characteristic amplified sequence after the up-sampling in sequence

Downsampling output corresponding to semantic sequence before cross-layer connection

Performing three times of fusion to obtain a posture prediction sequence considering context semantics and characteristic information

Step 22.5, finally outputting the attitude prediction sequence of the Performer-Unet

Labeling coordinates by double convolution and pose

Obtaining an attitude estimation matrix by carrying out scale matching

Further, in the N-layer Performer layer, the number of layers N is 12.

The invention has the beneficial effects that:

compared with a human body posture estimation algorithm based on RGB images and traditional sensors, the human body posture estimation method based on the RGB images and the traditional sensors adopts WiFi equipment which is low in cost, wide in application range and good in privacy protection to carry out human body posture estimation, greatly expands the application range of posture estimation in multiple fields, and makes up for the defects of application of the traditional algorithm.

The multi-head attention mechanism is introduced into the attitude estimation algorithm, so that the performance defect of the traditional attitude estimation algorithm is effectively overcome, the noise reduction at the algorithm level is achieved, and the network robustness is enhanced.

The Performer-Unet algorithm network structure is a high-performance structure, and achieves excellent performance in human body posture estimation based on WiFi.

Additional features and advantages of the invention will be set forth in the detailed description which follows.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings, wherein like reference numerals generally represent like parts in the exemplary embodiments of the present invention.

FIG. 1 shows a schematic diagram of the Performer-Unet model structure.

FIG. 2 shows a flow chart of a model training method of an embodiment of the invention.

Fig. 3 shows a structure diagram of an image and CSI data acquisition apparatus according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.

As shown in fig. 2, the present invention uses an existing image-based high-performance pose estimation network to obtain true pose annotation information. True gesture mark information be 18 individual human skeleton point coordinates (ear, eye, nose, neck, shoulder, elbow, hand, buttockss, knee, ankle), true gesture mark information's acquisition mode is:

P _A ＝Alpha(I _k )

Wherein: k represents a frame number of the sample image frame, ik represents a kth frame sample image frame captured by the camera; alpha () represents a human body posture recognition algorithm;

marking information P according to real attitude _A Generating gesture annotation coordinates

And the confidence coefficient C:

wherein: j represents the number of the skeleton node;

Acquiring a CSI data packet, namely a channel state information sequence, of the sample image frame according to the time stamp; as shown in fig. 2, the channel state information, i.e., the CSI data stream, is obtained by: and adopting Matlab to split the data stream of the CSI into CSI data packets corresponding to each frame of image according to the time stamp.

The training process in fig. 2 specifically includes:

step 21, constructing a teacher network T (-) and a student network S (-) which are used for storing the real attitude marking information of the sample image frame and storing the CSI data packet of the sample image frame;

step 22, the student network S (-) converts the channel state information sequence X corresponding to the k frame sample image _k Inputting the data into an artificial neural network Performer-Unet, and generating and labeling coordinates with the attitude

Consistent-sized attitude estimation matrix

Wherein: j represents the number of the skeleton node; PU () represents sequence data X of channel state information using model Performer-Unet _k Carrying out treatment;

And attitude estimation matrix

And 24, the student network S (-) transmits the calculated error back in a gradient mode, and optimizes the network Performer-Unet through a gradient descent method until the network converges and the error does not obviously descend any more, so that the training is completed.

As shown in fig. 3, in the invention, a WiFi transmitting antenna and a WiFi receiving antenna are arranged on two sides of a human body, a monitoring camera is arranged on one side of the WiFi transmitting antenna, and the monitoring camera and the WiFi transmitting antenna are arranged in alignment for shooting a moving video of the human body, and the video is disassembled into sample image frames containing human body postures; and a WiFi receiving antenna is adopted to collect the CSI data stream of the human body posture in real time.

According to the description of fig. 1, the WiFi human body posture estimation algorithm based on the Performer-uet adopts a U-shaped structure, and adds N layers of performers to the fusion layer at the bottommost layer, and the processing steps of the model are as follows:

Step 22.2, sequence information after three times of down sampling

Step 22.3, extracting the attitude feature sequence

The upsampling operation comprises a convolution operation Conv () and an interpolation operation Int ():

Labeling coordinates by double convolution and pose

Obtaining an attitude estimation matrix by carrying out scale matching

In the N layers of the final Performer-Unet WiFi human body posture estimation algorithm, the number of the layers N is 12.

In recent years, rapid development of network technologies such as 5G has led to popularization of wireless WiFi devices. Open source drivers for a range of network cards, such as Atheros AR9580, intel WiFi Link 5300, etc., are commercially available, and WiFi devices are deployed today in either public or private homes. This makes the WiFi device have low cost, flexibility advantage such as strong compared with other sensors. With the research on WiFi, the multipath channel characteristics of WiFi signals are gradually uncovered. The carrier electromagnetic signals of each subcarrier are subjected to multipath superposition propagation such as reflection, scattering, penetration and the like on various obstacles and human bodies, so that a special transmission mode is formed.

The invention acquires the surrounding environment characteristics by collecting the Channel State Information (CSI) of each subcarrier signal in the WiFi transmission process to analyze and process. The WiFi is utilized for human body posture estimation, and the method has good universality, privacy protection and the like.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.

Claims

1. A WiFi human body posture estimation algorithm based on Performer-Unet is characterized by comprising the following steps:

2. The Performer-Unet based WiFi human body pose estimation algorithm of claim 1, wherein the human body pose comprises: standing, walking, squatting, running and jumping; the real posture marking information is 18 human skeleton point coordinates.

3. The WiFi human pose estimation algorithm based on Performer-uet of claim 1, characterized in that step S1 specifically is: arranging WiFi (wireless fidelity) transmitting antennas and WiFi receiving antennas on two sides of a human body, arranging a monitoring camera on one side of each WiFi transmitting antenna, wherein the monitoring cameras are aligned with the WiFi transmitting antennas and used for shooting a moving video of the human body, and disassembling the video into a sample image frame containing a posture of the human body; and a WiFi receiving antenna is adopted to collect the CSI data stream of the human body posture in real time.

4. The WiFi human body pose estimation algorithm based on the Performer-uet of claim 1, wherein in step S1, the obtaining manner of the real pose labeling information is:

P _A ＝Alpha(I _k )

Wherein: k represents a frame of a sample image frameNumber, I _k Representing a k frame sample image frame captured by a camera; alpha () represents a human body posture recognition algorithm;

And the confidence coefficient C:

wherein: j represents the number of the skeleton node;

coordinates representing the jth skeleton point, C ^j A confidence in the coordinates is identified for the skeleton point.

5. The WiFi human body pose estimation algorithm based on the Performer-uet of claim 4, characterized in that in step S1, the Channel State Information (CSI) data stream obtaining method is: and adopting Matlab to split the data stream of the CSI into CSI data packets corresponding to each frame of image according to the time stamp.

6. The WiFi human body pose estimation algorithm based on the Performer-uet of claim 5, wherein the training process in step S2 specifically includes:

Attitude estimation matrix with consistent size

And attitude estimation matrix

Where the error L of the student network is calculated using the L2 loss _S(·) ：

7. The WiFi human body posture estimation algorithm based on the Performer-uet of claim 6, characterized in that, the PU () represents the Performer-uet human body posture estimation model, which adopts U-shaped structure and adds N layers of Performer layer in the lowest fusion layer, and the processing steps of the model are as follows:

step 22.1, the U-shaped structure model inputs the channel state information sequence X _k Carrying out three times of downsampling to carry out background semantic extraction, wherein the downsampling comprises a convolution Conv () operation and a pooling Pool () operation to obtain the output of the three times of downsampling

Step 22.2, sequence information after three times of down sampling

Step 22.3, extracting the attitude feature sequence

Labeling coordinates by double convolution and pose

Obtaining attitude estimation moment by carrying out scale matchingMatrix of

8. The WiFi human pose estimation algorithm based on the Performer-uet of claim 6, wherein in N layers of Performer layer, the number of layers N is 12.