CN113963229A

CN113963229A - Video-based wireless signal enhancement and cross-target gesture recognition method

Info

Publication number: CN113963229A
Application number: CN202111110503.0A
Authority: CN
Inventors: 陈晓江; 宋凤仪; 王楠; 张扬帆; 李欣怡; 房鼎益; 李珂; 王夫蔚; 任宇辉
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2022-01-21
Anticipated expiration: 2041-09-23
Also published as: CN113963229B

Abstract

The invention discloses a video image-based wireless signal enhancement and human body gesture recognition method. And generating virtual data by using a MoCoGAN video generation model, and expanding a video data set. And removing frame set background noise by using a contour detection and target extraction algorithm, converting the 2D image into 3D point cloud data by using an HMR algorithm, setting the height and the shape of the human body by parameter adjustment, and expanding the data set again. And eliminating invisible points of the transmitting terminal in the point cloud by using HPR, and obtaining Wi-Fi signals under corresponding deployment conditions by simulation. And respectively extracting time-frequency domain characteristics of the collected Wi-Fi channel state information data and Wi-Fi signals under corresponding deployment conditions, and establishing a gesture characteristic system analysis model of the Wi-Fi channel state information data to realize high-precision cross-target gesture recognition.

Description

Video-based wireless signal enhancement and cross-target gesture recognition method

Technical Field

The invention relates to the field of passive sensing, in particular to a low-cost video-based wireless signal enhancement and high-precision passive target-crossing gesture recognition method.

Technical Field

With the rapid development of the internet of things and wireless communication technology, most common devices in life gradually have computing and sensing capabilities, such as sensors, cameras, routers and the like, and therefore, human-computer interaction research becomes more and more important. Gesture recognition is a typical application in this field, and a large number of researchers detect gesture activities of users in different ways to complete specific application functions. In the past decades, researchers have proposed various gesture recognition methods, which can be classified into an active type and a passive type according to the difference of recognition modes:

the first type: provided is an active identification method. The active human body gesture recognition method mainly utilizes various sensing devices such as an accelerometer, a gyroscope, a pressure gauge and the like to collect data on different characteristic dimensions to complete recognition of gesture activities of a user, however, the method requires the user to carry additional sensing devices for a long time, is poor in friendliness, and is not widely applied.

The second type: a passive identification method. Compared with an active recognition method, the passive recognition method is more convenient and fast, does not need to carry additional equipment, and mainly comprises two modes of visual image and wireless signal. The gesture recognition method based on the visual image data obtains user activity data through video acquisition equipment, and further performs gesture recognition by using technologies such as image processing and the like, but the method is limited by light conditions in the environment, and no shielding is required between the user and the equipment, so that potential threats exist to the privacy security of the user. The method for sensing and identifying by utilizing the wireless signals has the advantages of low cost, good privacy and the like, wherein the Wi-Fi equipment is widely deployed at present, so that the gesture identification realized by utilizing the Wi-Fi equipment is more universal. However, the gesture recognition method based on the Wi-Fi still has many limiting factors, firstly, the method based on the Wi-Fi needs a large amount of data to train a recognition model with robustness and strong generalization capability, secondly, CSI data are sensitive, and due to diversity of user characteristics, the law of influence on signals is different, so that the method is poor in robustness and the precision is greatly reduced.

In summary, the existing passive gesture recognition technology has disadvantages in cost, robustness and generalization capability. Therefore, it is desirable to have a cross-target, robust and high-precision passive gesture recognition technique.

Disclosure of Invention

In order to solve the problems in the prior art, an object of the present invention is to provide a video-based method for enhancing wireless signals and recognizing gestures across targets, which can provide a high-precision target material recognition rate and greatly reduce the cost required by the method.

In order to realize the task, the invention adopts the following technical solution:

a video-based wireless signal enhancement and cross-target gesture recognition method comprises the following steps:

collecting Wi-Fi channel state information of gesture actions in a monitoring area;

step two, eliminating environmental noise and outlier influence by preprocessing Wi-Fi channel state information data;

collecting original human body gesture action video data in a monitoring area;

randomly generating new video from the original human body gesture motion video data by using a video generation model to obtain an expanded video data set;

step five, a frame set is obtained by preprocessing the extended video data set;

removing background noise and extracting a human body contour through a contour detection and target extraction algorithm;

step seven: converting the frame set into corresponding standard human body surface 3D point cloud data;

step eight: restoring a plurality of human body surface 3D point cloud data sets with different heights and body types from the standard 3D point cloud data through parameter adjustment;

step nine: performing gesture signal simulation on a human body surface 3D point cloud image set to obtain Wi-Fi signals under corresponding deployment conditions;

step ten, respectively extracting time-frequency domain characteristics of the collected Wi-Fi channel state information data and Wi-Fi signals under corresponding deployment conditions;

step eleven, establishing a gesture characteristic system analysis model of the simulated wireless signals and the collected Wi-Fi channel state information data, and completing gesture recognition of the cross-target.

Further, the two pairs of channel state information data are preprocessed, and a Hampel filter is adopted to remove outliers; and retaining data information of a low frequency band through a Butterworth low-pass filter to eliminate high-frequency noise.

Further, the video generation model in the fourth step is a MoCoGAN video generation model, the MoCoGAN video generation model generates a new video according to the original human body gesture action video data, and two discriminators are used for discriminating the image and the video frame sequence respectively.

Further, in the seventh step, the frame set is converted into corresponding standard human body surface 3D point cloud data through an HMR algorithm.

Further, in the step eight, each point in the human body surface 3D point cloud data set describes human body surface information of a corresponding characteristic target user, including a body type, a posture or a direction of the target user.

Further, in the eighth step, a hidden point elimination algorithm is adopted to eliminate the disturbance signal in the human body surface 3D point cloud data set, so as to obtain a point set visible to the emission end in the human body surface 3D point cloud data set.

And in the ninth step, performing gesture signal simulation on the human body surface 3D point cloud image set to obtain Wi-Fi signals under corresponding deployment conditions:

wherein M' (t) is a point set visible to the emission end in a 3D point cloud data set of the human body surface, X_T、X_R、X_mRespectively represent a transmitting end, a receiving end and a reflecting point on the surface of a human body, A_m、G_mRespectively representing the reflectivity and the angle; g (X)_T,X_R) Representing the slave transmitting end X_TPropagating through a line-of-sight path to reach a receiving end X_RThe apparent distance signal strength of (c).

And the time-frequency domain characteristics extracted in the step ten at least comprise minimum value, variance, mean value, skewness, standard deviation, kurtosis, energy and FFT peak value.

The eleventh gesture characteristic system analysis model at least comprises a classification layer, a characteristic mapping layer and a reconstruction layer; the classification layer is used for extracting gesture features, and the feature mapping layer is used for learning a gesture feature mapping relation between Wi-Fi channel state information data and simulated Wi-Fi signals under corresponding deployment conditions; the reconstruction layer is used for assisting the feature mapping layer to perform learning training, and emphasizes and extracts feature mapping relation related to the gesture.

The video-based wireless signal enhancement and cross-target gesture recognition method has the beneficial effects that:

the gesture actions of the cross-target user are recognized through a method of combining video and Wi-Fi, so that the training data acquisition and marking cost is reduced, the data diversity is improved through the 3D point cloud simulation wireless signals, and the robustness and the generalization capability of the system are enhanced. Meanwhile, the constructed model can learn the mapping relation between the analog signal and the real signal, and the robust cross-target identification precision can be realized.

Drawings

FIG. 1 is a flow chart of a video-based wireless signal enhancement and cross-target gesture recognition method of the present invention.

FIG. 2 is a diagram of a general framework of a gesture feature system analysis model.

FIG. 3 is a schematic diagram of a feature mapping layer structure.

FIG. 4 is a deployment diagram of the video-based wireless signal enhancement and cross-target gesture recognition method of the present invention.

FIG. 5 is a graph of the impact of different numbers of training users on recognition accuracy.

FIG. 6 is a graph of the impact of different classification gesture numbers on recognition accuracy.

Fig. 7 is a multi-model comparative evaluation diagram.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Detailed Description

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Referring to fig. 1, the present embodiment provides a video-based wireless signal enhancement and cross-target gesture recognition method, including:

collecting original human body gesture action video data in a monitoring area;

In the implementation of the invention, the state information of the gesture action channel on the Wi-Fi link and the corresponding gesture video information are collected at first, and the data are preprocessed respectively. And generating virtual data by using a MoCoGAN video generation model, and expanding a video data set. And removing frame set background noise by using a contour detection and target extraction algorithm, converting the 2D image into 3D point cloud data by using an HMR algorithm, setting the height and the shape of the human body by parameter adjustment, and expanding the data set again. And eliminating invisible points of the transmitting terminal in the point cloud by using HPR, and obtaining Wi-Fi signals under corresponding deployment conditions by simulation. And respectively extracting time-frequency domain characteristics of the collected Wi-Fi channel state information data and Wi-Fi signals under corresponding deployment conditions, and establishing a gesture characteristic system analysis model of the Wi-Fi channel state information data to realize high-precision cross-target gesture recognition.

Optionally, step one, collecting Wi-Fi channel state information of the gesture motion in the monitoring area specifically includes:

the channel state information data collection method comprises the steps that Intel-5300Wi-Fi-NIC is used as a receiving end, a TP-Link AC1750 kilomega wireless router is used as a sending end, a CSITool tool is used for extracting 30 sub-carrier data of an antenna, wherein the Wi-Fi experiment packet sending rate is set to be 1000 pkts/s;

optionally, in the second step, by preprocessing the Wi-Fi channel state information data, environmental noise and outlier influence are eliminated, which specifically includes:

and aiming at special outliers existing in the Wi-Fi channel state information sequence of the collected gesture actions in the monitoring area, a Hampel filter is adopted to remove the outliers. The influence of the gesture action on the signal is mainly concentrated on a low-frequency part, the environmental noise is mainly existed in a high frequency, and the influence caused by the high-frequency noise can be eliminated by reserving data information of a low frequency band through a Butterworth low-pass filter. In order to remove the influence of different gesture making time lengths of different people and ensure the uniformity of data in time dimension, interpolation processing needs to be carried out on the data.

Optionally, step three, collecting video data of the original human body gesture actions in the monitoring area, specifically including:

acquiring original video data of a target gesture by using mobile phone equipment, wherein the acquisition frame rate is set to be 30 fps/s;

optionally, a video generation model is used to randomly generate new video from the original human body gesture video data to obtain an extended video data set, which specifically includes:

the extended video data set adopts a MoCoGAN video generation model, utilizes the countermeasure thought, generates a new virtual video according to the original human body gesture action video data distribution, and uses two discriminators to discriminate the image and the video frame sequence respectively.

MoCoGAN mainly divides human body posture into two parts of content and action, namely { Z_I＝Z_C+Z_MIn which Z is_IRepresenting a set of video frames, each point Z ∈ Z_IRepresenting an image, a video of K frames is represented by a path of length K, [ z [ ]⁽¹⁾,...,z^(K)]。Z_CRepresenting a content vector space, Z_MRepresenting a motion vector space.

Follow-up of content and actionsThe machine vectors are mapped to a sequence of video frames for video generation. The model structure mainly adopts 4 structures: the GRU is a recurrent neural network which is mainly used for simulating the next action; g_IFor generating successive frame images; d_IThe method is used for judging whether the generated image is real or not, namely whether the content is real or not; d_VAnd the method is used for judging whether the video formed by the generated images is real, namely whether the motion is real. During the training process, the generator G_IOperation content and action representation { Z }_C,Z_MGenerating a sequence of video frames

Sequence of frames

Randomly taking 1 frame (S)₁) Input picture discriminator D_ISequence of frames to be paired

Randomly taking T frame (S)_T) Input video discriminator D_VTherefore, the virtual video with reasonable content and action is generated, and the diversity of video data is improved.

In addition, in order to reduce the training complexity, the color three-channel video can be converted into the gray single-channel video for training.

Step five, preprocessing the extended video data set to obtain a frame set specifically comprises the following steps:

and intercepting the original human body gesture action video data frame by frame through an MATLAB video processing tool and storing the intercepted data as frame images to obtain frame image stream data of each action video, namely obtaining a frame set.

Removing background noise and extracting a human body contour through a contour detection and target extraction algorithm, and specifically comprising the following steps of:

and (3) carrying out contour detection and target extraction by using a DeepLabv3plus model, finding out the boundary between the human body contour part and other parts, and further extracting a human body target region from the boundary.

Step seven: converting the frame set into corresponding standard human body surface 3D point cloud data, which specifically comprises the following steps:

the HMR algorithm is used to reconstruct a complete human 3D point cloud from a single image centered around the person in a feed forward manner. In the training, it is assumed that all images are labeled with joints in 2D, and that some of the data have 3D labels. The original 2D image is convolutionally encoded to obtain convolution characteristics, which are sent to an iterative 3D regression module SMPL, which aims to infer 3D human and camera parameters, which allow its 3D joints to be projected onto the labeled 2D joints. The derived parameters are also sent to a challenge discriminator network to determine from unpaired data whether the 3D parameters are true body movements, thereby making the resulting 3D point cloud information more realistic.

When 3D annotation information is present, it is used as an intermediate loss, the overall loss function being: l ═ λ (L)_reproj+t×L_3D)+L_adv. Wherein L is_reprojDescribes the loss function of a 3D regression model SMPL whose optimization goal is to make L_reprojMinimization; the lambda parameter is used for describing the relative importance of each target, t is an indication parameter, when the training data set contains a real 3D label, the value of the parameter is 1, otherwise, the value is 0; corresponding L_3DNamely the loss value of training when a real 3D image exists; l is_advThe loss function of discriminator D is described.

Step eight: through parameter adjustment, restore the human surface 3D point cloud data set of a plurality of different heights, sizes in the standard 3D point cloud data, specifically include:

each point in the human body surface 3D point cloud data set describes human body surface information of a corresponding characteristic target user, and the human body surface information at least comprises the body type, posture or direction of the target user. By the HMR algorithm, a standard (normalized) human body surface 3D point cloud data set can be obtained, and by adjusting parameters, a plurality of human body surface 3D point cloud data sets with different heights and body types can be recovered from an image of one user, so that the data diversity is further improved.

Further, in the eighth step, a hidden point elimination algorithm is adopted to eliminate the disturbance signal in the human body surface 3D point cloud data set, so as to obtain a point set visible to the emission end in the human body surface 3D point cloud data set. The method specifically comprises the following steps:

extracting a point set visible to a transmitting end in a human body surface 3D point cloud data set by adopting a hidden point elimination algorithm HPR, wherein the method comprises the following steps:

because there is an angle problem between the deployment position of the transceiving end and the target position, the signal is not propagated to the receiving end through the reflection action of all 3D surface points (including the back and side of the target), i.e. not all points in the 3D point cloud will generate the reflection action on the signal. In order to correctly estimate signals, hidden points in the 3D point cloud are eliminated by using an HPR algorithm, the algorithm judges visible points in the 3D point cloud according to a preset simulation receiving and transmitting end position by using spherical transformation and convex hulls, so that the hidden points in the point cloud are eliminated, and a point set visible to a transmitting end in a 3D point cloud data set on the surface of a human body is obtained.

Step nine: the method comprises the following steps of carrying out gesture signal simulation on a human body surface 3D point cloud image set to obtain Wi-Fi signals under corresponding deployment conditions, and specifically comprises the following steps:

under the condition of not considering the influence of environmental multipath, a signal received by a receiving end Rx consists of two parts, namely a transmitting end X under a line-of-sight path_TIs propagated to the receiving end X_RAnd in a non line-of-sight path from the transmitting end X_TReflected to a receiving end X by a visible reflection point M' (t) on the surface of the human body_ROf the signal of (1). The method carries out gesture signal simulation on a human body surface 3D point cloud image set, and obtains Wi-Fi signals under corresponding deployment conditions as follows:

wherein M' (t) is a point set visible to the emission end in a 3D point cloud data set of the human body surface, X_T、X_R、X_mRespectively represent a transmitting end, a receiving end and a reflecting point on the surface of a human body, A_m、G_mRespectively representing the reflectivity and the angle; g (X)_T,X_R) Representing the slave transmitting end X_TPropagating through a line-of-sight path to reach a receiving end X_RThe apparent distance signal strength of the site;

the green function g (x, y) describes the signal strength of the signal propagating from x to the receiving end at y, as shown in the formula:

where | l. - | describes the euclidean norm, λ represents the wavelength of the signal.

Reflecting point X via the surface of the user_mThe signal strength reflected to the receiving end is determined by two factors, namely the surface area and the reflection point X_mThe direction of reflection of (1). Wherein the larger the reflection area is, the larger the reflectance A_mThe higher. The direction r of the strongest reflected signal can be obtained by the reflection model when the signal is transmitted to the surface of the human body_mMay pass through the normal n_mAnd (3) determining:

from the point of reflection X_mDirection X of propagation to receiver Rx_R-X_mSignal strength sum r_mThe included angle between the two is inversely proportional and can be determined by a Gaussian function G_mAnd (3) determining:

by the method, gesture signal simulation is carried out on the human body surface 3D point cloud image set, and Wi-Fi signals under corresponding deployment conditions are obtained.

Step ten, respectively extracting time-frequency domain characteristics of the collected Wi-Fi channel state information data and Wi-Fi signals under corresponding deployment conditions, and specifically comprising the following steps:

the artificial features related to the invention specifically comprise time domain and frequency domain features. The time domain features specifically include a maximum, a minimum/maximum, a variance, a mean, a skewness, a standard deviation, a kurtosis, a (0.25/0.5/0.75) quantile; the frequency domain features specifically include energy, FFT peaks.

Step eleven, establishing a simulated wireless signal and a gesture feature system analysis model of the collected Wi-Fi channel state information data to complete gesture recognition of a cross-target, specifically comprising the following steps of:

in consideration of the multipath influence in the real environment, the video-simulated wireless signals cannot be directly applied to the identification of the Wi-Fi channel state information of collected gesture actions, so that a gesture feature system analysis model of the simulated wireless signals and the collected Wi-Fi channel state information data needs to be established, and a robust high-precision target-crossing gesture identification effect is achieved. And environmental multipath influence can be removed as far as possible in order to ensure the extracted characteristics.

The gesture characteristic system analysis model at least comprises a classification layer, a characteristic mapping layer and a reconstruction layer; the classification layer is used for extracting gesture features, and the feature mapping layer is used for learning a gesture feature mapping relation between Wi-Fi channel state information data and simulated Wi-Fi signals under corresponding deployment conditions; the reconstruction layer is used for assisting the feature mapping layer to perform learning training, and emphasizes and extracts feature mapping relation related to the gesture. FIG. 2 is a general framework diagram of a gesture feature system analysis model.

A classification layer with high generalization capability is obtained by training with simulated high-quality non-interference data, a reconstruction layer is added aiming at the problem of multipath interference of the collected Wi-Fi channel state information data, the branch assists a feature mapping layer to realize the mapping conversion of the collected Wi-Fi channel state information data and the simulated wireless signals of corresponding gestures by neglecting the external environment influence.

The classification layer uses CNN to extract features and learn the wide features of gestures in high-diversity analog signals, so that a classifier with strong generalization capability is realized. The layer mainly performs convolution, pooling and normalization operations, introduces a nonlinear mapping function in the final full-connection operation, and outputs the prediction label distribution of the model to the data. This layer will be used to guide the feature mapping layer to learn as much as possible the gesture feature mapping features between the real data and the simulated data.

And the feature mapping layer is used for learning a gesture feature mapping relation from Wi-Fi signals to analog signals in a real environment, and finally realizing robust gesture recognition based on the classification layer model.

The reconstruction layer is used for assisting the feature mapping layer to perform learning training, and emphasizes and extracts feature mapping relation related to the gesture. The reconstruction layer is only used for assisting the feature mapping layer to learn in a training stage, when the network training is put into practical use after finishing, the reconstruction layer does not participate in a gesture classification task, and the specific organization structure of the reconstruction layer introduces Euclidean distance as a loss function for the learning effect of the weighing reconstruction layer:

and d (,) is an Euclidean distance, and the mapping capability of the feature mapping layer on the gesture-related features is enhanced through continuous iteration and auxiliary optimization of the layer, so that a better effect is finally achieved. FIG. 3 is a diagram illustrating a feature mapping layer structure.

And (3) comparing experimental results:

the inventor tries to evaluate the video-based wireless signal enhancement and cross-target gesture recognition method provided by the embodiment from the following three aspects:

the recognition performance of different training user numbers; different classification gesture number recognition performance; and (5) multi-model comparative evaluation.

Identification performance of different training user numbers:

fig. 5 shows the verification of the system validity under 5 gestures. The training set A, B, C, D contains 5 types of gesture data for 4, 8, 12, and 16 users, respectively, and the test set contains 4 user data. Under 5 gestures, the cross-target recognition accuracy gradually improves with the increase of the data volume of the training set.

Different classification gesture number recognition performance:

fig. 6 shows training sets of 4 different numbers of users under 5 and 10 gestures, and using the same test set to evaluate recognition performance. It can be seen that as the number of classified gestures increases, the recognition accuracy decreases, but still higher accuracy recognition can be maintained.

And (3) multi-model comparison evaluation:

FIG. 7 is a graph of the use of a conventional classification model: SVM, KNN, RF, CNN and our classification model are used for obtaining cross-target classification precision results under 5 and 10 gestures respectively. From the figure we can see that our method is a significant improvement over the baseline method.

Cross-target gesture recognition performance:

in general, the invention greatly reduces the cost and can achieve satisfactory high-precision cross-target gesture recognition precision.

Claims

1. A video-based wireless signal enhancement and cross-target gesture recognition method is characterized by comprising the following steps: the method comprises the following steps:

collecting original human body gesture action video data in a monitoring area;

2. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 1, wherein: preprocessing the channel state information data by two pairs, and removing outliers by adopting a Hampel filter; and retaining data information of a low frequency band through a Butterworth low-pass filter to eliminate high-frequency noise.

3. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 1, wherein: and the video generation model in the fourth step is a MoCoGAN video generation model, the MoCoGAN video generation model generates a new video according to the original human body gesture action video data, and two discriminators are used for discriminating the image and the video frame sequence respectively.

4. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 1, wherein: and in the seventh step, the frame set is converted into corresponding standard human body surface 3D point cloud data through an HMR algorithm.

5. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 1, wherein: and step eight, each point in the human body surface 3D point cloud data set describes human body surface information of a target user with corresponding characteristics, and the human body surface information at least comprises the body type, posture or direction of the target user.

6. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 5, wherein: and step eight, eliminating disturbance signals in the human body surface 3D point cloud data set by adopting a hidden point elimination algorithm to obtain a point set with a visible emission end in the human body surface 3D point cloud data set.

7. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 6, wherein: and in the ninth step, performing gesture signal simulation on the human body surface 3D point cloud image set to obtain Wi-Fi signals under corresponding deployment conditions:

8. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 1, wherein: and the time-frequency domain characteristics extracted in the step ten at least comprise minimum value, variance, mean value, skewness, standard deviation, kurtosis, energy and FFT peak value.

9. The video-based wireless signal enhancement and cross-target gesture recognition method of claim 1, wherein: the eleventh gesture characteristic system analysis model at least comprises a classification layer, a characteristic mapping layer and a reconstruction layer; the classification layer is used for extracting gesture features, and the feature mapping layer is used for learning a gesture feature mapping relation between Wi-Fi channel state information data and simulated Wi-Fi signals under corresponding deployment conditions; the reconstruction layer is used for assisting the feature mapping layer to perform learning training, and emphasizes and extracts feature mapping relation related to the gesture.