CN115393955A

CN115393955A - Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi

Info

Publication number: CN115393955A
Application number: CN202211000411.1A
Authority: CN
Inventors: 杨明泽; 吴飞; 朱海; 朱润哲; 杨运成
Original assignee: Shanghai University of Engineering Science
Current assignee: Shanghai University of Engineering Science
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2022-11-25

Abstract

The invention relates to a gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi, wherein the method comprises the following steps: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible values of speed components along an x axis and a y axis in the BVPs is recorded as N; carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences; and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework. Compared with the prior art, the method modifies the Transformer model, adds a time sequence information stacking fusion and frame position coding module in front of an encoder, adjusts the structure of the encoder, adds a classifier behind the encoder, and can also stably and accurately identify when facing a high-complexity identification task.

Description

Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi

Technical Field

The invention relates to the technical field of human perception based on wireless signals, in particular to a gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi.

Background

The wireless signals are perceived by human, namely, the human behavior activities in the perception area are identified by utilizing wireless electromagnetic signals such as a camera, wi-Fi and the like. Human activity recognition and monitoring based on wireless perception are important components of modern intelligent technology development of intelligent medical treatment, man-machine interaction, smart cities and the like. Originally, people were intuitively implemented using computer vision and wearable devices. Both methods are capable of efficiently recognizing human activities. However, vision-based methods have limitations such as privacy leakage, lighting conditions, etc.; wearable devices lack ease of use and convenience. In order to protect privacy, reduce use conditions and increase convenience, a perception identification system based on Wi-Fi signals is developed. Based on the fact that Wi-Fi signals are reflected after encountering obstacles (as shown in fig. 1), from document [1], human activity recognition using commercial Wi-Fi is becoming a hotspot technology due to the characteristics of no extra cost, no privacy disclosure, ubiquitous deployment environment, passive perception, and the like.

The underlying features that can be directly used for wireless sensing are generally two kinds — Received Signal Strength Indication (RSSI) and Channel State Information (CSI). RSSI is a wireless signal feature that is widely used for indoor positioning and human activity recognition. However, due to the influence of environmental noise and signal information superposition, the RSSI measurement is coarse-grained and unstable, and is only suitable for tasks with low accuracy requirements. For high-precision positioning and behavior recognition, CSI is a more reliable feature that reflects richer fine-grained information. The original CSI contains a high signal-to-noise ratio and usually needs to be identified after a certain signal processing means.

In addition, by utilizing abundant information in the bottom-layer CSI, more stable and more reliable upper-layer identification features can be extracted. This process is generally divided into two categories-statistical feature-based and physical feature-based identification methods. The former generally treats a wireless signal as time series data, and extracts a waveform and a power distribution pattern of the signal in a time domain or a frequency domain along the time series as fingerprint information for positioning or identification. This approach, while effective, lacks interpretability and extensibility. In contrast, the latter relies on complete physical principles, and has sufficient interpretability and stable scalability, such as signal Time Of Flight (TOF), signal Angle Of Arrival (AOA), signal attenuation (signal attenuation), and the like. Among these physical characteristics, doppler Shift (DFS) is adopted by a great deal of research work because it has more information directly related to the motion state and process.

However, DFS is closely related to domain factors (domain factors, i.e., factors that affect the recognition characteristics but are not related to the recognized action itself, such as orientation, location, etc. of the recognition object) so that the recognition performance is greatly degraded when the recognition task is performed across domains. Accordingly, document [2] derives a Body Coordinate Velocity spectrum (BVP) independent of the domain factor using Doppler Frequency shift spectrum (DFP) in combination with spatial Coordinate transformation. Based on BVP features, document [2] implements a cross-domain gesture recognition system by modeling BVP (hereinafter referred to as CGNN) through a gate control loop unit (GRU) combined Convolutional Neural Network (CNN), which is a combined model of CNN and GRU. So far, the Wi-Fi-based gesture recognition task does not need to spend a great deal of energy on researching feature extraction from bottom layer data, and instead, the next stage can be carried out, and a stable, sensitive and accurate recognition model can be built around reliable and cross-domain upper layer features.

When establishing an identification model, the related work based on different modeling methods in the prior art is divided into two categories: data-based (learning) and model-based. Model-based activity recognition establishes a theoretical model with clear significance through physics principles. For example, the signal propagation principle and the like are quantified by the change in distance, multipath and dynamic level caused by human activity in the fresnel zone. It is clear that model-based identification methods rely on fixed, strict physical principles with strong interpretability. However, this way of extracting "hard features" tends to miss potential correlations between data, resulting in insufficient generalization. Data-driven (learning) based approaches aim at training models with large amounts of data to recognize activities and extract "elastic features" from the data to map features to discrete actions. These "elastic characteristics" include both physical and statistical characteristics. It can be seen that the data-driven modeling method based on physical principles not only retains partial interpretability, but also does not lose the potential relationship among data.

In summary, the behavior recognition performed by the data-driven modeling method can be divided into two parts: one is to generate features from the CSI that can be used for identification based on physical or statistical methods; the other is a recognition model based on these features. The first part has made a lot of progress with research investment, while the second part is still using "outdated" modeling methods (e.g., CGNN). This makes the performance of the recognition system a research bottleneck.

When Wi-Fi cross-domain identification is realized by adopting BVP, the applicant finds that the identification system based on the CGNN structure lacks identification stability. First, the applicant identified 6 common interactive gestures using CGNN with an identification accuracy of about 90%. However, when applicants increased the gesture categories to 22, that is, when the category categories were multiplied and similar categories appeared, the recognition accuracy of the system dropped by about 20%. Through experiments, the applicant finds that, like CGNN, a model structure which depends on a convolution kernel to extract spatial features and uses recursion to model a time dimension is often lack of identification stability when facing a high-complexity identification task.

Disclosure of Invention

The present invention is directed to overcome the above-mentioned drawbacks of the prior art, and to provide a gesture recognition method and system based on BVP and witransform using WiFi.

The purpose of the invention can be realized by the following technical scheme:

a gesture recognition method based on BVP and WiTransformer by utilizing WiFi comprises the following steps:

acquiring data: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible speed component values in the BVPs along an x axis and a y axis is recorded as N;

data processing: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;

identification: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.

Further, the step of acquiring data comprises the steps of:

t1, arranging a Wi-Fi signal transmitter and a receiver, and receiving and recording a CSI mode caused by the motion of an identification object; the CSI amplitude expression is expressed by using a measured carrier frequency f at time t corresponding to a complex value of a channel frequency response CFR:

wherein, a _k (f, t) is a magnitude representation of the initial phase offset and component attenuation for the kth path;

is a propagation delay of τ _k (t) a phase offset of the kth path; e.g. of the type ^j∈(f,t) Is the phase error caused by time alignment offset, sampling frequency offset and carrier frequency offset;

t2, representing the multipath signal phase by using the corresponding DFS, separating CFR caused by human activity from CSI, and converting the expression of the CSI into:

wherein, constant H _s Is the sum of all steady state signals with zero DFS,

the term is the sum of all non-zero DFS dynamic signals;

corresponding to the signal frequency shift caused by body motion;

t3, calculating conjugate multiplication of CSI of different antennas on a receiver, eliminating high-frequency noise and random deviation, and only keeping a measurement value with obvious DFS multipath components, namely a signal measurement value reflected by a perception target;

t4, filtering out a main component from the related CSI subcarrier through main component analysis, generating power distribution in a time domain and a frequency domain from the main component through short-time Fourier transform, wherein the time-frequency power distribution spectrum snapshot is a Doppler frequency shift spectrum DFSP;

t5, constructing matrix V _bvp Dimension N × N, where N is the number of possible values of the velocity component along each axis of the body coordinate system;

t6, moving any speed component of the gesture

Projecting on a certain frequency component on DFSP, introducing the idea of compressed sensing, and taking the BVP estimation as l ₀ Solving the optimization problem to obtain a BVP matrix V directly mapped with the gesture motion _bvp Obtaining the BVP sequence.

Further, sequence padding and normalization processing are performed on the BVP sequence as follows:

all BVP sequences are applied with a zero matrix O e 0 along the time dimension T ^N×N The padding is equal in length, and the overall weight of the BVP sequence is affected by the power of the transmitter, so that the two-dimensional value matrix of the BVP of each frame is normalized.

Further, the step of adapting the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:

s1, stacking the BVP sequences to be regarded as multi-channel image data B _bvp ∈R ^H×W×C Each frame BVP in the BVP sequence is arranged into the image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;

s2, establishing a three-dimensional One-hot matrix I ^H×W×C To save the relative position information between BVP sequences BVP and embed the position information into B _bvp In (1), the following:

wherein,

the multi-channel image data after the position information is embedded is obtained;

s3, mixing

Is divided into N _tb Three-dimensional matrix pipeline Tube:

wherein, N _tb In order to divide the number of the divided pipes,

T _tb representing the size of the three-dimensional matrix pipeline;

s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:

wherein,

representing three-dimensional matrix pipes

Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is

Wherein D is the hidden layer constant vector size used in the encoder of the Transformer model framework;

s5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence

The learnable class label "", is used to characterize the features of the entire Token sequence and participate in the final classification;

s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:

wherein,

the position coding is learnable and is used for storing position information among three-dimensional matrix pipelines.

Further, the step of adapting the Transformer model framework comprises the following steps: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.

Further, a classifier is added after the encoder.

A gesture recognition system based on a BVP and a WiTransformer by utilizing WiFi comprises a data acquisition module, a data processing module and a recognition module;

the data acquisition module is used for: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible velocity component values in the BVPs along an x axis and a y axis is recorded as N;

the data processing module is used for: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;

the identification module is configured to: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.

Further, in the identification module, the step of modifying the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:

s2, establishing a three-dimensional One-hot matrix I ^H×W×C To store the relative position information between BVP sequences BVP and embed the position information into B _bvp In (1), the following:

wherein,

the multi-channel image data after the position information is embedded;

s3, mixing

Is divided into N _tb Three-dimensional matrix pipeline Tube:

wherein N is _tb In order to divide the number of the divided pipes,

T _tb representing the dimensions of the three-dimensional matrix pipeline;

wherein,

representing three-dimensional matrix pipes

wherein,

Further, in the identification module, the step of modifying the Transformer model framework comprises the following steps: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.

Further, a classifier is added after the encoder.

Compared with the prior art, the invention has the following beneficial effects:

(1) A time sequence information stacking fusion and frame position coding module is added in front of an encoder, BVP stacking, pipeline embedding and twice position information embedding are stacked in the time sequence information stacking fusion and frame position coding module, and compared with a double-current structure, the time sequence information stacking fusion and frame position coding module can utilize the stacked BVP to fuse three-dimensional space-time information in advance; embedding a one-dimensional information marking sequence Tokens in the form of a three-dimensional information pipeline in a matrix reshaping mode. On the premise of keeping the identification accuracy not to be greatly reduced, the calculation complexity of the encoder model is greatly reduced. Only one encoder is needed to realize the information encoding of the three-dimensional space-time characteristics.

(2) The encoder structure is modified, compared with an original Transformer encoder, the sequence of a normalization layer is adjusted (from the back to the front of each network module), and meanwhile, the batch normalization layer is used for replacing the layer normalization layer, so that overfitting and gradient disappearance are prevented while convergence of the model is accelerated.

Drawings

FIG. 1 is a schematic diagram of a reflection of a Wi-Fi signal after encountering an obstacle;

FIG. 2 is an architecture diagram of a WiTransformer;

FIG. 3 is a schematic view of a duct partition;

FIG. 4 is a modified encoder structure;

FIG. 5 is an overall flow diagram of gesture recognition;

fig. 6 is a flowchart for generating BVP sequences;

FIG. 7 is a schematic diagram of the arrangement of transmitters and receivers;

fig. 8 is a time-frequency power distribution spectrum snapshot generated in a Wi-Fi antenna link.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present invention is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, it is obvious that the described embodiment is only a part of the embodiment of the present invention, not all embodiments, and the protection scope of the present invention is not limited to the following embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic may be included in at least one implementation of the invention. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The present specification provides method steps as in the examples or flow diagrams, but may include more or fewer steps based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or server product execution, the steps in the method according to the embodiment or the figures may be executed sequentially or in parallel (for example, in the context of parallel processors or multi-thread processing), or the execution order of the steps without timing limitation may be adjusted.

Example 1:

after research, the inventor finds that, under the influence of other leading edge fields (CV, NLP), in the existing Wi-Fi gesture recognition method, features are modeled along a network structure using CNN and RNN as main stems, for example, document [2] models BVP by combining a gate controlled Recurrent unit (GRU) with a Convolutional Neural Network (CNN). However, the model based on CNN relies on local feature extraction of convolution kernel and thus lacks global feature correlation, so that such systems cannot resist the problems of easy confusion of partial features, increase of classification categories, loss of partial features, and the like. The structure of a Recurrent Neural Network (RNN) can only be modeled serially along a sequence. When long sequence features are encountered, RNN-based systems are limited by memory constraints, while it is difficult to capture long range feature dependencies and correlations.

After analysis, the inventors propose that modeling can be performed using a Transformer-based model framework. However, existing Transformer-based model frameworks mainly focus on single-dimensional modeling of time or space, while BVPs contain spatio-temporal information at the same time. There is little relevant and referential experience (i.e., an embedding method of how to map the joint spatiotemporal information to a high-dimensional feature space) in the field of Wi-Fi based gesture recognition. Existing visual Transformer (ViT) classification models are a framework that can be intuitively migrated to BVP-based gesture recognition tasks. However, there is no correlation between feature maps of image signals, so that there is no corresponding feature extraction structure in the ViT architecture to supplement the position relationship and correlation between feature maps.

In order to solve the above problems, the inventor proposes a gesture recognition method based on BVP and wifransformer using WiFi, which modifies a fransformer encoder, adds a temporal information stacking fusion and frame position coding module to make it have a temporal-spatial feature extraction capability, and embeds position codes in disordered channels using a channel position embedding structure, so that a one-dimensional model can simultaneously extract third-dimensional temporal information.

Before Wi-Fi gesture recognition is carried out, firstly, a WiTransformer needs to be constructed and trained, in the application and application stage, as shown in FIG. 5, CSI of a recognition object is collected firstly, then, the CSI is converted into a BVP sequence, and finally, the trained WiTransformer is sent to carry out gesture recognition. The method comprises the following steps:

(1) Constructing a data set;

in this embodiment, a public data set is used as a training set to train the WiTransformer, and the public data set document [5] may be referred to as specific experimental device position parameters in the data set. The collected data includes 22 gesture instances from 8 positions and 5 orientations of 17 recognition objects in 3 rooms. Theoretically 44880 instances from 2040 (17 × 3 × 8 × 5) mixed multi-domains can be used for model training. For practical applications, consider repeated measurements and objective factor effects, please refer to public data set literature [5]. In other embodiments, the system can also build itself up

(2) Constructing and training a WiTransformer:

the architecture of the WiTransformer is shown in FIG. 2. Based on the characteristics of BVP data, the present application proposes a "Tube Embedding" (Tube Embedding) method to enable BVPs to be embedded in a higher dimensional feature space than the original transform encoder. Meanwhile, the internal operation unit of the original Transformer encoder structure is also adjusted, so that the Witransformer can adapt to the data characteristics of the BVP. It should be emphasized that, in fig. 2, the BVP sequence is represented as an image only for intuitive visualization, so as to facilitate understanding how BVP is inferred in the witransform, but there is an essential difference between the BVP sequence and BVP and image data.

The method for modifying the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:

s1, stacking the BVP sequences to be regarded as multi-channel image data B _bvp ∈R ^H×W×C Each frame BVP in the BVP sequence is arranged into an image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, and C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;

the application treats a BVP sequence as a piece of multi-channel image data, wherein H and W are the concepts of borrowing images, and the essence of the BVP is the number of possible values of the velocity component along the x axis and the y axis, namely H = W = N; therefore, what corresponds to the number of image Feature maps (Feature maps) is the frame number of BVPs, i.e., C = T. This process is similar to 3D convolution.

wherein,

the multi-channel image data after the position information is embedded;

s3, mixing

Is divided into N _tb Three-dimensional matrix pipeline Tube:

wherein N is _tb In order to divide the number of the divided pipes,

T _tb representing three-dimensional matrix pipesSize;

to embed three-dimensional BVP sequence data of an input model into a one-dimensional tag (Token) sequence, the present application embeds a BVP sequence

Remoulded into a series of three-dimensional matrix pipes, T _tb The size of the matrix pipeline obtained by segmentation is the effective sequence length of the input encoder, and the dimension corresponds to H and W.

All hidden layer constant vector sizes used in the encoder are D, i.e., the witransform encoder core arithmetic unit MSA does not change the vector dimension.

S4, respectively carrying out flattening (Flatten) operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of a high-order feature space through linear projection E, wherein the steps are as follows:

wherein,

representing three-dimensional matrix pipes

after reshaping, a "one-piece" complete BVP sequence (consisting of T-channel BVPs) is split into N _tb Each Tube, the time dimension T of each Tube remains. And (3) carrying out a flattening (Flatten) operation on each three-dimensional matrix pipeline, and simultaneously mapping vectors generated by flattening into a Token sequence in the high-order feature space through a trainable linear projection E. As shown in FIG. 3, this process is referred to herein as pipe Embedding (Tube Embedding))：

Learnable class labels "", used to characterize the features of the entire Token sequence and participate in the final classification;

wherein,

the position codes which can be learnt are used for storing the position information between the three-dimensional matrix pipelines, the position 0 corresponding to the "+",

corresponding position is 1-N _tb In this embodiment, N _tb The value is 9.

The core computational unit of the Transformer is MSA. It performs pairwise calculations for all tokens. In this process, the relative position of each Tube in the BVP is not considered. More importantly, features cannot be processed sequentially along the sequence as RNNs do. Therefore, it is necessary to use a learnable position code to store the relative position information between the tubes. Spatial features are extracted using a transform encoder, in which point the BVP is the same as the image data. However, unlike image data, digital images do not require 3D segmentation along the "channel direction". Because there is no dependency between image channels (between feature maps). In contrast, there is a phase between BVPs in a BVP sequenceThe mutual dependency relationship is realized, and the BVPs in the BVP sequence have a timing relationship, so the relative position information needs to be embedded before the segmentation. In response to the two requirements (dependency relationship and position information between BVPs), the former, in the present application, MSA calculation is also adopted along the pipeline depth direction in "Tube Embedding", which is similar to the Conv3D (3 Dimensional stability) processing manner; the latter proposes a Channel Position Embedding (CPE) method to supplement Position information, i.e. step S2, and specifically, the present application makes a three-dimensional One-hot matrix (I) ^H×W×C ) Position coding is implemented to preserve relative position information between VDMs. In practical operation, the application embeds the position information into the BVP sequence in an addition manner.

Through experiments, the applicant finds that the problem can be effectively alleviated by stacking the BVP sequence as a multi-channel picture and fusing the spatial-temporal information before inputting the BVP sequence into the network. In addition, through experimental comparison, compared with a method for separating spatio-temporal data, the stacked method for combining spatio-temporal data effectively reduces the time for model training and reduces the training cost. Since BVPs in a BVP sequence have correlation, a model architecture of a visual Transformer cannot be directly applied, so the present application proposes that a CPE structure and a Tube Embedding manner implement MSA in a three-dimensional space, and finally implement identification.

The method for modifying the Transformer model framework comprises the following steps: the method comprises the steps of carrying out structural adjustment on an encoder of a Transformer model framework, removing a Layer Normalization (LN) behind each network module of the encoder, adding a Batch Normalization (BN) in front of each network module, and adopting a residual error connection mode among the network modules.

The method for modifying the Transformer model framework further comprises the following steps: a classifier is added after the encoder, and in this embodiment, MLP is used, but in other embodiments, other multi-classifiers may be used.

The encoder is a core computing unit in the Transformer model architecture. Specifically, as shown in fig. 4, the modified encoder of the present application is connected in series by using a Residual Connection (Residual Connection) manner between the modules, so as to prevent the network model from degrading; adding a batch normalization layer in front of each module to accelerate the convergence speed of the model and improve the stability of network training to prevent overfitting and gradient disappearance; in the application, a Multi-Layer Perceptron (MLP) is a classifier for final classification, and mainly comprises nonlinear mapping activated by a Relu function, so that the nonlinear fitting capability of a model is enhanced; the scoring function of the core computation module MSA of the encoder follows the Scaled Dot-Product attribute of the transform. The following uses a simple example to describe the inner operation process of the WiTransformer encoder, assuming that the space-time encoder layers are formed by overlapping L-layer space-time encoders:

z′ _l ＝MSA(LN(z _l-1 ))+z _l-1 ,l＝1…L

z _l ＝MLP(LN(z′ _l ))+z′ _l ,l＝1…L

the global space-time characteristics of the BVP sequence are extracted in parallel by adopting a modified Transformer encoder. Depending on the capture of the MSA to the global feature, under the condition that the recognition task is difficult to be simplified and is gradually harsh, the recognition accuracy rate of the method is only reduced by about 3 percent, and the method is 1/7-1/5 of that of other existing space-time feature extraction models.

(3) The WiTransformer is trained using the dataset.

(4) Acquiring data: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible values of speed components along an x axis and a y axis in the BVPs is recorded as N;

the data acquisition comprises the following steps:

t1, arranging a Wi-Fi signal transmitter and a receiver, and receiving and recording a CSI mode caused by the motion of an identification object;

in the sensing region, object motion causes changes in the dynamic reflection path by changing the length of the Wi-Fi signal propagation path. Then a particular pattern of signal features (e.g., CSI, DFS, or BVP) corresponding to the perceived object behavior may be used to characterize this motion process. In this embodiment, 1 Wi-Fi signal transmitter and at least 3 receivers are required. The transmitter can adopt a general commercial Wi-Fi wireless router (specific parameters can be purchased according to experiment needs), each receiver needs to be provided with an Intel 5300 wireless network card, and a notebook computer or a microcomputer can be adopted generally. In addition, a driver corresponding to the network card is also required, and a driver designed in document [4] is generally used. After the device is turned on, the device position theoretically does not affect the recognition result, and generally the device can be placed according to the mode of fig. 7.

As shown in fig. 6, BVP is obtained by CSI inference. The CSI also needs to derive a BVP sequence from the CSI information by using techniques such as doppler shift, coordinate transformation, and compressed sensing after performing appropriate noise reduction and data preprocessing.

According to the multipath effect, in a Frequency domain and an indoor multipath environment, when a Wi-Fi signal reaches a receiver through k different paths, a CSI amplitude expression represents by measuring a carrier Frequency f at time t corresponding to a Channel Frequency Response (CFR) complex value:

wherein, the constant H _s Is all have zeroThe sum of the steady-state signals of the DFS,

the term is the sum of all non-zero DFS dynamic signals;

corresponding to the signal frequency shift caused by body motion; to this end, CFR caused by human motion has been isolated from CSI.

and T4, filtering out Principal components from the relevant CSI subcarriers by Principal Component Analysis (PCA), and generating power distributions in the time domain and the frequency domain from the Principal components by Short-time Fourier Transform (STFT). As shown in fig. 8, which is a snapshot of the frequency spectrum of the time-frequency power distribution generated in the Wi-Fi antenna link. Along time, each snapshot is a Doppler Frequency Shift spectrum (DFSP), which is a matrix D with F × M dimensions _dfs Where F is the number of sampling points and M is the number of transceiver link pairs;

when a person performs a gesture, his body parts (e.g., arms and hands) move at different speeds. The signals of the multiple links caused by limb movement are superimposed at the receiver and form a corresponding DFSP. Heretofore, these DFS-based signal power distributions at different frequencies have been used to quantify the relationship between the speed of motion of different body parts of a human and specific human activities. It should be noted, however, that while DFS can already be used for activity recognition, DFS-based sensing is not cross-domain capable since its reasoning process is closely related to the location and orientation of the sensing object.

the BVP sequence is composed ofVDM of columns. The BVP sequence may be quantized to matrix V _bvp 。

T6, mixing V _bvp And performing coordinate transformation on the dependent position information, and obtaining the BVP sequence by using a compressed sensing technology.

V _bvp The dependent position information is converted from an environment global coordinate system into a body local coordinate system through coordinate transformation; the coordinate origin is the position of the perception object; the positive x-axis direction coincides with the orientation of the person. In particular, assuming that the transmitter and receiver positions of the ith link are known (which, as mentioned above, are considered to be settable), then in the human coordinate system, an arbitrary velocity component is present

Its signal power will be allocated to the frequency component of the ith link in DFSP:

wherein the coefficients

And

having transmitter and receiver position determination, in particular by measurement of the position of the transmitter and receiver [2]；

Is the velocity in the global coordinate system. FIG. 8 shows that the three velocity components are 3 velocity components generated by a human body

The velocity components are projected (distributed) separately onto the DFSPs of the three links.

Due to the coefficient

And

and is only related to the position of the ith link, so the projection relation of the BVP on the ith link is fixed. Therefore, the relationship between DFSP and BVP for the ith link can be modeled as:

D ⁽ⁱ⁾ ＝c ⁽ⁱ⁾ A ⁽ⁱ⁾ V _bvp

wherein, c ⁽ⁱ⁾ Is a scaling factor determined by the reflected signal propagation loss;

is an allocation matrix, f _j Is the jth frequency sample in DFSP,

is the corresponding kth element in the vectorized BVP matrix. Then, the solution of BVP in combination with the dozing Distance (Earth Mover's Distance, EMD) is expressed as l ₀ And (5) optimizing. Finally, a BVP corresponding to the motion is obtained based on the above model and the compressed sensing technique.

(5) Processing data: BVP sequences can be considered as a set of two-dimensional Velocity Distribution matrices (whose data structure is detailed in "BVP generation") arranged along the time dimension, and since the sample sampling duration cannot be accurately controlled during data acquisition, so that each BVP sequence has a different length, i.e. the number of BVP frames is different, the BVP can also be understood as a Velocity Distribution Matrix (VDM). The application uses a zero matrix O epsilon 0 for all BVP sequences along the time dimension T ^N×N Fill to equal length, where N is the number of possible values for the velocity component in the BVP along the x-axis and y-axis. This process is referred to herein as Total Padding.

Since a general commercial Wi-Fi device has a power adjustment mechanism based on a communication function, that is, the overall weight of the BVP may change with the power of the transmitter. To this end, the values need to be adjusted to a fixed value range, so that the two-dimensional value matrix of each frame BVP needs to be normalized (Normalization).

(6) And inputting the processed BVP sequence into a trained WiTransformer model to obtain a gesture classification result.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the above steps or functions. As such, the software programs (including associated data structures) of the present application can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.

Example 2:

the data acquisition module is used for: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible values of speed components along an x axis and a y axis in the BVPs is recorded as N;

the identification module is to: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.

In the identification module, the step of modifying the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:

wherein,

the multi-channel image data after the position information is embedded;

s3, mixing

Is divided into N _tb Three-dimensional matrix pipeline Tube:

wherein N is _tb In order to divide the number of the divided pipes,

T _tb representing the dimensions of the three-dimensional matrix pipeline;

s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping the vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:

wherein,

representing three-dimensional matrix pipes

wherein,

In the identification module, the step of modifying the Transformer model framework comprises the following steps: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.

The relevant references to which this application relates are as follows:

[1]Wang,W.,Liu,A.X.,Shahzad,M.,Ling,K.,&amp；Lu,S.(2015).Understanding and modeling of WIFI signal based human activity recognition.Proceedings of the 21st Annual International Conference on Mobile Computing and Networking.https://doi.org/10.1145/2789168.2790093.

[2]Zhang,Y.,Zheng,Y.,Qian,K.,Zhang,G.,Liu,Y.,Wu,C.,&amp；Yang,Z.(2021).Widar3.0:Zero-effort cross-domain gesture recognition with Wi-Fi.IEEE Transactions on Pattern Analysis and Machine Intelligence,1–1.https://doi.org/10.1109/tpami.2021.3105387.

[3]Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N.Gomez,Lukasz Kaiser and Illia Polosukhin.“Attention is All you Need”neural information processing systems(2017):n.pag.

[4]Halperin,D.,Hu,W.,Sheth,A.,&amp；Wetherall,D.(2011).Tool release.ACM SIGCOMM Computer Communication Review,41(1),53–53.https://doi.org/10.1145/1925861.1925870.

[5]Zheng Yang,Yi Zhang,Guidong Zhang,Yue Zheng."Widar 3.0:WiFi-based Activity Recognition Dataset."doi:10.21227/7znf-qp86.

the foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A gesture recognition method based on BVP and WiTransformer by utilizing WiFi is characterized by comprising the following steps:

acquiring data: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible velocity component values in the BVPs along an x axis and a y axis is recorded as N;

2. The method of claim 1, wherein the step of obtaining data comprises the steps of:

wherein, constant H _s Is the sum of all steady state signals with zero DFS,

the term is the sum of all non-zero DFS dynamic signals;

corresponding to the signal frequency shift caused by body motion;

t4, filtering out principal components from the related CSI subcarriers through principal component analysis, and generating power distribution in a time domain and a frequency domain from the principal components through short-time Fourier transform, wherein the time-frequency power distribution spectrum snapshot is a Doppler frequency shift spectrum DFSP;

t5, constructing a matrix V _bvp Dimension N × N, where N is the number of possible values of the velocity component along each axis of the body coordinate system;

t6, dividing any speed of gesture movementMeasurement of

3. The method of claim 2, wherein the sequence filling and normalization processing is performed on the BVP sequence as follows:

4. The method of claim 1, wherein the adapting the fransformer model framework comprises: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:

wherein,

the multi-channel image data after the position information is embedded;

s3, mixing

Is divided into N _tb Three-dimensional matrix pipeline Tube:

wherein N is _tb In order to divide the number of the divided pipes,

T _tb representing the dimensions of the three-dimensional matrix pipeline;

wherein,

representing three-dimensional matrix pipes

wherein,

5. The method of claim 1, wherein the adapting the fransformer model framework comprises: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.

6. The method of claim 5, wherein a classifier is added after the encoder.

7. A gesture recognition system based on BVP and WiTransformer by utilizing WiFi is characterized by comprising a data acquisition module, a data processing module and a recognition module;

the data acquisition module is used for: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible speed component values in the BVPs along an x axis and a y axis is recorded as N;

8. The system of claim 7, wherein the modifying the Transformer model framework in the recognition module comprises: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:

wherein,

the multi-channel image data after the position information is embedded;

s3, mixing

Is divided into N _tb A three-dimensional matrix tubeRoad Tube:

wherein N is _tb In order to divide the number of the divided pipes,

T _tb representing the size of the three-dimensional matrix pipeline;

wherein,

representing three-dimensional matrix pipes

Wherein D is the size of a hidden layer constant vector used in an encoder of a Transformer model frame;

wherein,

9. The system of claim 7, wherein the modifying the Transformer model framework in the recognition module comprises: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.

10. The system of claim 9, wherein a classifier is added after the encoder.