CN115393955A - Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi - Google Patents

Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi Download PDF

Info

Publication number
CN115393955A
CN115393955A CN202211000411.1A CN202211000411A CN115393955A CN 115393955 A CN115393955 A CN 115393955A CN 202211000411 A CN202211000411 A CN 202211000411A CN 115393955 A CN115393955 A CN 115393955A
Authority
CN
China
Prior art keywords
bvp
sequence
encoder
csi
witransformer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211000411.1A
Other languages
Chinese (zh)
Inventor
杨明泽
吴飞
朱海
朱润哲
杨运成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai University of Engineering Science
Original Assignee
Shanghai University of Engineering Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai University of Engineering Science filed Critical Shanghai University of Engineering Science
Priority to CN202211000411.1A priority Critical patent/CN115393955A/en
Publication of CN115393955A publication Critical patent/CN115393955A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi, wherein the method comprises the following steps: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible values of speed components along an x axis and a y axis in the BVPs is recorded as N; carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences; and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework. Compared with the prior art, the method modifies the Transformer model, adds a time sequence information stacking fusion and frame position coding module in front of an encoder, adjusts the structure of the encoder, adds a classifier behind the encoder, and can also stably and accurately identify when facing a high-complexity identification task.

Description

Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi
Technical Field
The invention relates to the technical field of human perception based on wireless signals, in particular to a gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi.
Background
The wireless signals are perceived by human, namely, the human behavior activities in the perception area are identified by utilizing wireless electromagnetic signals such as a camera, wi-Fi and the like. Human activity recognition and monitoring based on wireless perception are important components of modern intelligent technology development of intelligent medical treatment, man-machine interaction, smart cities and the like. Originally, people were intuitively implemented using computer vision and wearable devices. Both methods are capable of efficiently recognizing human activities. However, vision-based methods have limitations such as privacy leakage, lighting conditions, etc.; wearable devices lack ease of use and convenience. In order to protect privacy, reduce use conditions and increase convenience, a perception identification system based on Wi-Fi signals is developed. Based on the fact that Wi-Fi signals are reflected after encountering obstacles (as shown in fig. 1), from document [1], human activity recognition using commercial Wi-Fi is becoming a hotspot technology due to the characteristics of no extra cost, no privacy disclosure, ubiquitous deployment environment, passive perception, and the like.
The underlying features that can be directly used for wireless sensing are generally two kinds — Received Signal Strength Indication (RSSI) and Channel State Information (CSI). RSSI is a wireless signal feature that is widely used for indoor positioning and human activity recognition. However, due to the influence of environmental noise and signal information superposition, the RSSI measurement is coarse-grained and unstable, and is only suitable for tasks with low accuracy requirements. For high-precision positioning and behavior recognition, CSI is a more reliable feature that reflects richer fine-grained information. The original CSI contains a high signal-to-noise ratio and usually needs to be identified after a certain signal processing means.
In addition, by utilizing abundant information in the bottom-layer CSI, more stable and more reliable upper-layer identification features can be extracted. This process is generally divided into two categories-statistical feature-based and physical feature-based identification methods. The former generally treats a wireless signal as time series data, and extracts a waveform and a power distribution pattern of the signal in a time domain or a frequency domain along the time series as fingerprint information for positioning or identification. This approach, while effective, lacks interpretability and extensibility. In contrast, the latter relies on complete physical principles, and has sufficient interpretability and stable scalability, such as signal Time Of Flight (TOF), signal Angle Of Arrival (AOA), signal attenuation (signal attenuation), and the like. Among these physical characteristics, doppler Shift (DFS) is adopted by a great deal of research work because it has more information directly related to the motion state and process.
However, DFS is closely related to domain factors (domain factors, i.e., factors that affect the recognition characteristics but are not related to the recognized action itself, such as orientation, location, etc. of the recognition object) so that the recognition performance is greatly degraded when the recognition task is performed across domains. Accordingly, document [2] derives a Body Coordinate Velocity spectrum (BVP) independent of the domain factor using Doppler Frequency shift spectrum (DFP) in combination with spatial Coordinate transformation. Based on BVP features, document [2] implements a cross-domain gesture recognition system by modeling BVP (hereinafter referred to as CGNN) through a gate control loop unit (GRU) combined Convolutional Neural Network (CNN), which is a combined model of CNN and GRU. So far, the Wi-Fi-based gesture recognition task does not need to spend a great deal of energy on researching feature extraction from bottom layer data, and instead, the next stage can be carried out, and a stable, sensitive and accurate recognition model can be built around reliable and cross-domain upper layer features.
When establishing an identification model, the related work based on different modeling methods in the prior art is divided into two categories: data-based (learning) and model-based. Model-based activity recognition establishes a theoretical model with clear significance through physics principles. For example, the signal propagation principle and the like are quantified by the change in distance, multipath and dynamic level caused by human activity in the fresnel zone. It is clear that model-based identification methods rely on fixed, strict physical principles with strong interpretability. However, this way of extracting "hard features" tends to miss potential correlations between data, resulting in insufficient generalization. Data-driven (learning) based approaches aim at training models with large amounts of data to recognize activities and extract "elastic features" from the data to map features to discrete actions. These "elastic characteristics" include both physical and statistical characteristics. It can be seen that the data-driven modeling method based on physical principles not only retains partial interpretability, but also does not lose the potential relationship among data.
In summary, the behavior recognition performed by the data-driven modeling method can be divided into two parts: one is to generate features from the CSI that can be used for identification based on physical or statistical methods; the other is a recognition model based on these features. The first part has made a lot of progress with research investment, while the second part is still using "outdated" modeling methods (e.g., CGNN). This makes the performance of the recognition system a research bottleneck.
When Wi-Fi cross-domain identification is realized by adopting BVP, the applicant finds that the identification system based on the CGNN structure lacks identification stability. First, the applicant identified 6 common interactive gestures using CGNN with an identification accuracy of about 90%. However, when applicants increased the gesture categories to 22, that is, when the category categories were multiplied and similar categories appeared, the recognition accuracy of the system dropped by about 20%. Through experiments, the applicant finds that, like CGNN, a model structure which depends on a convolution kernel to extract spatial features and uses recursion to model a time dimension is often lack of identification stability when facing a high-complexity identification task.
Disclosure of Invention
The present invention is directed to overcome the above-mentioned drawbacks of the prior art, and to provide a gesture recognition method and system based on BVP and witransform using WiFi.
The purpose of the invention can be realized by the following technical scheme:
a gesture recognition method based on BVP and WiTransformer by utilizing WiFi comprises the following steps:
acquiring data: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible speed component values in the BVPs along an x axis and a y axis is recorded as N;
data processing: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;
identification: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.
Further, the step of acquiring data comprises the steps of:
t1, arranging a Wi-Fi signal transmitter and a receiver, and receiving and recording a CSI mode caused by the motion of an identification object; the CSI amplitude expression is expressed by using a measured carrier frequency f at time t corresponding to a complex value of a channel frequency response CFR:
Figure BDA0003807126460000031
wherein, a k (f, t) is a magnitude representation of the initial phase offset and component attenuation for the kth path;
Figure BDA0003807126460000032
is a propagation delay of τ k (t) a phase offset of the kth path; e.g. of the type j∈(f,t) Is the phase error caused by time alignment offset, sampling frequency offset and carrier frequency offset;
t2, representing the multipath signal phase by using the corresponding DFS, separating CFR caused by human activity from CSI, and converting the expression of the CSI into:
Figure BDA0003807126460000033
wherein, constant H s Is the sum of all steady state signals with zero DFS,
Figure BDA0003807126460000041
the term is the sum of all non-zero DFS dynamic signals;
Figure BDA0003807126460000042
corresponding to the signal frequency shift caused by body motion;
t3, calculating conjugate multiplication of CSI of different antennas on a receiver, eliminating high-frequency noise and random deviation, and only keeping a measurement value with obvious DFS multipath components, namely a signal measurement value reflected by a perception target;
t4, filtering out a main component from the related CSI subcarrier through main component analysis, generating power distribution in a time domain and a frequency domain from the main component through short-time Fourier transform, wherein the time-frequency power distribution spectrum snapshot is a Doppler frequency shift spectrum DFSP;
t5, constructing matrix V bvp Dimension N × N, where N is the number of possible values of the velocity component along each axis of the body coordinate system;
t6, moving any speed component of the gesture
Figure BDA0003807126460000043
Projecting on a certain frequency component on DFSP, introducing the idea of compressed sensing, and taking the BVP estimation as l 0 Solving the optimization problem to obtain a BVP matrix V directly mapped with the gesture motion bvp Obtaining the BVP sequence.
Further, sequence padding and normalization processing are performed on the BVP sequence as follows:
all BVP sequences are applied with a zero matrix O e 0 along the time dimension T N×N The padding is equal in length, and the overall weight of the BVP sequence is affected by the power of the transmitter, so that the two-dimensional value matrix of the BVP of each frame is normalized.
Further, the step of adapting the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:
s1, stacking the BVP sequences to be regarded as multi-channel image data B bvp ∈R H×W×C Each frame BVP in the BVP sequence is arranged into the image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;
s2, establishing a three-dimensional One-hot matrix I H×W×C To save the relative position information between BVP sequences BVP and embed the position information into B bvp In (1), the following:
Figure BDA0003807126460000044
wherein,
Figure BDA0003807126460000045
the multi-channel image data after the position information is embedded is obtained;
s3, mixing
Figure BDA0003807126460000046
Is divided into N tb Three-dimensional matrix pipeline Tube:
Figure BDA0003807126460000047
wherein, N tb In order to divide the number of the divided pipes,
Figure BDA0003807126460000048
T tb representing the size of the three-dimensional matrix pipeline;
s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:
Figure BDA0003807126460000051
wherein,
Figure BDA0003807126460000052
representing three-dimensional matrix pipes
Figure BDA0003807126460000053
Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is
Figure BDA0003807126460000054
Wherein D is the hidden layer constant vector size used in the encoder of the Transformer model framework;
s5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence
Figure BDA0003807126460000055
The learnable class label "", is used to characterize the features of the entire Token sequence and participate in the final classification;
s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:
Figure BDA0003807126460000056
wherein,
Figure BDA0003807126460000057
the position coding is learnable and is used for storing position information among three-dimensional matrix pipelines.
Further, the step of adapting the Transformer model framework comprises the following steps: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.
Further, a classifier is added after the encoder.
A gesture recognition system based on a BVP and a WiTransformer by utilizing WiFi comprises a data acquisition module, a data processing module and a recognition module;
the data acquisition module is used for: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible velocity component values in the BVPs along an x axis and a y axis is recorded as N;
the data processing module is used for: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;
the identification module is configured to: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.
Further, in the identification module, the step of modifying the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:
s1, stacking the BVP sequences to be regarded as multi-channel image data B bvp ∈R H×W×C Each frame BVP in the BVP sequence is arranged into the image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;
s2, establishing a three-dimensional One-hot matrix I H×W×C To store the relative position information between BVP sequences BVP and embed the position information into B bvp In (1), the following:
Figure BDA0003807126460000061
wherein,
Figure BDA0003807126460000062
the multi-channel image data after the position information is embedded;
s3, mixing
Figure BDA0003807126460000063
Is divided into N tb Three-dimensional matrix pipeline Tube:
Figure BDA0003807126460000064
wherein N is tb In order to divide the number of the divided pipes,
Figure BDA0003807126460000065
T tb representing the dimensions of the three-dimensional matrix pipeline;
s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:
Figure BDA0003807126460000066
wherein,
Figure BDA0003807126460000067
representing three-dimensional matrix pipes
Figure BDA0003807126460000068
Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is
Figure BDA0003807126460000069
Wherein D is the hidden layer constant vector size used in the encoder of the Transformer model framework;
s5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence
Figure BDA00038071264600000610
The learnable class label "", is used to characterize the features of the entire Token sequence and participate in the final classification;
s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:
Figure BDA00038071264600000611
wherein,
Figure BDA00038071264600000612
the position coding is learnable and is used for storing position information among three-dimensional matrix pipelines.
Further, in the identification module, the step of modifying the Transformer model framework comprises the following steps: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.
Further, a classifier is added after the encoder.
Compared with the prior art, the invention has the following beneficial effects:
(1) A time sequence information stacking fusion and frame position coding module is added in front of an encoder, BVP stacking, pipeline embedding and twice position information embedding are stacked in the time sequence information stacking fusion and frame position coding module, and compared with a double-current structure, the time sequence information stacking fusion and frame position coding module can utilize the stacked BVP to fuse three-dimensional space-time information in advance; embedding a one-dimensional information marking sequence Tokens in the form of a three-dimensional information pipeline in a matrix reshaping mode. On the premise of keeping the identification accuracy not to be greatly reduced, the calculation complexity of the encoder model is greatly reduced. Only one encoder is needed to realize the information encoding of the three-dimensional space-time characteristics.
(2) The encoder structure is modified, compared with an original Transformer encoder, the sequence of a normalization layer is adjusted (from the back to the front of each network module), and meanwhile, the batch normalization layer is used for replacing the layer normalization layer, so that overfitting and gradient disappearance are prevented while convergence of the model is accelerated.
Drawings
FIG. 1 is a schematic diagram of a reflection of a Wi-Fi signal after encountering an obstacle;
FIG. 2 is an architecture diagram of a WiTransformer;
FIG. 3 is a schematic view of a duct partition;
FIG. 4 is a modified encoder structure;
FIG. 5 is an overall flow diagram of gesture recognition;
fig. 6 is a flowchart for generating BVP sequences;
FIG. 7 is a schematic diagram of the arrangement of transmitters and receivers;
fig. 8 is a time-frequency power distribution spectrum snapshot generated in a Wi-Fi antenna link.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present invention is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, it is obvious that the described embodiment is only a part of the embodiment of the present invention, not all embodiments, and the protection scope of the present invention is not limited to the following embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic may be included in at least one implementation of the invention. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
The present specification provides method steps as in the examples or flow diagrams, but may include more or fewer steps based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In actual system or server product execution, the steps in the method according to the embodiment or the figures may be executed sequentially or in parallel (for example, in the context of parallel processors or multi-thread processing), or the execution order of the steps without timing limitation may be adjusted.
Example 1:
after research, the inventor finds that, under the influence of other leading edge fields (CV, NLP), in the existing Wi-Fi gesture recognition method, features are modeled along a network structure using CNN and RNN as main stems, for example, document [2] models BVP by combining a gate controlled Recurrent unit (GRU) with a Convolutional Neural Network (CNN). However, the model based on CNN relies on local feature extraction of convolution kernel and thus lacks global feature correlation, so that such systems cannot resist the problems of easy confusion of partial features, increase of classification categories, loss of partial features, and the like. The structure of a Recurrent Neural Network (RNN) can only be modeled serially along a sequence. When long sequence features are encountered, RNN-based systems are limited by memory constraints, while it is difficult to capture long range feature dependencies and correlations.
After analysis, the inventors propose that modeling can be performed using a Transformer-based model framework. However, existing Transformer-based model frameworks mainly focus on single-dimensional modeling of time or space, while BVPs contain spatio-temporal information at the same time. There is little relevant and referential experience (i.e., an embedding method of how to map the joint spatiotemporal information to a high-dimensional feature space) in the field of Wi-Fi based gesture recognition. Existing visual Transformer (ViT) classification models are a framework that can be intuitively migrated to BVP-based gesture recognition tasks. However, there is no correlation between feature maps of image signals, so that there is no corresponding feature extraction structure in the ViT architecture to supplement the position relationship and correlation between feature maps.
In order to solve the above problems, the inventor proposes a gesture recognition method based on BVP and wifransformer using WiFi, which modifies a fransformer encoder, adds a temporal information stacking fusion and frame position coding module to make it have a temporal-spatial feature extraction capability, and embeds position codes in disordered channels using a channel position embedding structure, so that a one-dimensional model can simultaneously extract third-dimensional temporal information.
Before Wi-Fi gesture recognition is carried out, firstly, a WiTransformer needs to be constructed and trained, in the application and application stage, as shown in FIG. 5, CSI of a recognition object is collected firstly, then, the CSI is converted into a BVP sequence, and finally, the trained WiTransformer is sent to carry out gesture recognition. The method comprises the following steps:
(1) Constructing a data set;
in this embodiment, a public data set is used as a training set to train the WiTransformer, and the public data set document [5] may be referred to as specific experimental device position parameters in the data set. The collected data includes 22 gesture instances from 8 positions and 5 orientations of 17 recognition objects in 3 rooms. Theoretically 44880 instances from 2040 (17 × 3 × 8 × 5) mixed multi-domains can be used for model training. For practical applications, consider repeated measurements and objective factor effects, please refer to public data set literature [5]. In other embodiments, the system can also build itself up
(2) Constructing and training a WiTransformer:
the architecture of the WiTransformer is shown in FIG. 2. Based on the characteristics of BVP data, the present application proposes a "Tube Embedding" (Tube Embedding) method to enable BVPs to be embedded in a higher dimensional feature space than the original transform encoder. Meanwhile, the internal operation unit of the original Transformer encoder structure is also adjusted, so that the Witransformer can adapt to the data characteristics of the BVP. It should be emphasized that, in fig. 2, the BVP sequence is represented as an image only for intuitive visualization, so as to facilitate understanding how BVP is inferred in the witransform, but there is an essential difference between the BVP sequence and BVP and image data.
The method for modifying the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:
s1, stacking the BVP sequences to be regarded as multi-channel image data B bvp ∈R H×W×C Each frame BVP in the BVP sequence is arranged into an image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, and C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;
the application treats a BVP sequence as a piece of multi-channel image data, wherein H and W are the concepts of borrowing images, and the essence of the BVP is the number of possible values of the velocity component along the x axis and the y axis, namely H = W = N; therefore, what corresponds to the number of image Feature maps (Feature maps) is the frame number of BVPs, i.e., C = T. This process is similar to 3D convolution.
S2, establishing a three-dimensional One-hot matrix I H×W×C To store the relative position information between BVP sequences BVP and embed the position information into B bvp In (1), the following:
Figure BDA0003807126460000091
wherein,
Figure BDA0003807126460000092
the multi-channel image data after the position information is embedded;
s3, mixing
Figure BDA0003807126460000093
Is divided into N tb Three-dimensional matrix pipeline Tube:
Figure BDA0003807126460000094
wherein N is tb In order to divide the number of the divided pipes,
Figure BDA0003807126460000095
T tb representing three-dimensional matrix pipesSize;
to embed three-dimensional BVP sequence data of an input model into a one-dimensional tag (Token) sequence, the present application embeds a BVP sequence
Figure BDA0003807126460000096
Remoulded into a series of three-dimensional matrix pipes, T tb The size of the matrix pipeline obtained by segmentation is the effective sequence length of the input encoder, and the dimension corresponds to H and W.
All hidden layer constant vector sizes used in the encoder are D, i.e., the witransform encoder core arithmetic unit MSA does not change the vector dimension.
S4, respectively carrying out flattening (Flatten) operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of a high-order feature space through linear projection E, wherein the steps are as follows:
Figure BDA0003807126460000101
wherein,
Figure BDA0003807126460000102
representing three-dimensional matrix pipes
Figure BDA0003807126460000103
Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is
Figure BDA0003807126460000104
Wherein D is the hidden layer constant vector size used in the encoder of the Transformer model framework;
after reshaping, a "one-piece" complete BVP sequence (consisting of T-channel BVPs) is split into N tb Each Tube, the time dimension T of each Tube remains. And (3) carrying out a flattening (Flatten) operation on each three-dimensional matrix pipeline, and simultaneously mapping vectors generated by flattening into a Token sequence in the high-order feature space through a trainable linear projection E. As shown in FIG. 3, this process is referred to herein as pipe Embedding (Tube Embedding)):
Figure BDA0003807126460000105
S5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence
Figure BDA0003807126460000106
Learnable class labels "", used to characterize the features of the entire Token sequence and participate in the final classification;
s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:
Figure BDA0003807126460000107
wherein,
Figure BDA0003807126460000108
the position codes which can be learnt are used for storing the position information between the three-dimensional matrix pipelines, the position 0 corresponding to the "+",
Figure BDA0003807126460000109
corresponding position is 1-N tb In this embodiment, N tb The value is 9.
The core computational unit of the Transformer is MSA. It performs pairwise calculations for all tokens. In this process, the relative position of each Tube in the BVP is not considered. More importantly, features cannot be processed sequentially along the sequence as RNNs do. Therefore, it is necessary to use a learnable position code to store the relative position information between the tubes. Spatial features are extracted using a transform encoder, in which point the BVP is the same as the image data. However, unlike image data, digital images do not require 3D segmentation along the "channel direction". Because there is no dependency between image channels (between feature maps). In contrast, there is a phase between BVPs in a BVP sequenceThe mutual dependency relationship is realized, and the BVPs in the BVP sequence have a timing relationship, so the relative position information needs to be embedded before the segmentation. In response to the two requirements (dependency relationship and position information between BVPs), the former, in the present application, MSA calculation is also adopted along the pipeline depth direction in "Tube Embedding", which is similar to the Conv3D (3 Dimensional stability) processing manner; the latter proposes a Channel Position Embedding (CPE) method to supplement Position information, i.e. step S2, and specifically, the present application makes a three-dimensional One-hot matrix (I) H×W×C ) Position coding is implemented to preserve relative position information between VDMs. In practical operation, the application embeds the position information into the BVP sequence in an addition manner.
Through experiments, the applicant finds that the problem can be effectively alleviated by stacking the BVP sequence as a multi-channel picture and fusing the spatial-temporal information before inputting the BVP sequence into the network. In addition, through experimental comparison, compared with a method for separating spatio-temporal data, the stacked method for combining spatio-temporal data effectively reduces the time for model training and reduces the training cost. Since BVPs in a BVP sequence have correlation, a model architecture of a visual Transformer cannot be directly applied, so the present application proposes that a CPE structure and a Tube Embedding manner implement MSA in a three-dimensional space, and finally implement identification.
The method for modifying the Transformer model framework comprises the following steps: the method comprises the steps of carrying out structural adjustment on an encoder of a Transformer model framework, removing a Layer Normalization (LN) behind each network module of the encoder, adding a Batch Normalization (BN) in front of each network module, and adopting a residual error connection mode among the network modules.
The method for modifying the Transformer model framework further comprises the following steps: a classifier is added after the encoder, and in this embodiment, MLP is used, but in other embodiments, other multi-classifiers may be used.
The encoder is a core computing unit in the Transformer model architecture. Specifically, as shown in fig. 4, the modified encoder of the present application is connected in series by using a Residual Connection (Residual Connection) manner between the modules, so as to prevent the network model from degrading; adding a batch normalization layer in front of each module to accelerate the convergence speed of the model and improve the stability of network training to prevent overfitting and gradient disappearance; in the application, a Multi-Layer Perceptron (MLP) is a classifier for final classification, and mainly comprises nonlinear mapping activated by a Relu function, so that the nonlinear fitting capability of a model is enhanced; the scoring function of the core computation module MSA of the encoder follows the Scaled Dot-Product attribute of the transform. The following uses a simple example to describe the inner operation process of the WiTransformer encoder, assuming that the space-time encoder layers are formed by overlapping L-layer space-time encoders:
z′ l =MSA(LN(z l-1 ))+z l-1 ,l=1…L
z l =MLP(LN(z′ l ))+z′ l ,l=1…L
the global space-time characteristics of the BVP sequence are extracted in parallel by adopting a modified Transformer encoder. Depending on the capture of the MSA to the global feature, under the condition that the recognition task is difficult to be simplified and is gradually harsh, the recognition accuracy rate of the method is only reduced by about 3 percent, and the method is 1/7-1/5 of that of other existing space-time feature extraction models.
(3) The WiTransformer is trained using the dataset.
(4) Acquiring data: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible values of speed components along an x axis and a y axis in the BVPs is recorded as N;
the data acquisition comprises the following steps:
t1, arranging a Wi-Fi signal transmitter and a receiver, and receiving and recording a CSI mode caused by the motion of an identification object;
in the sensing region, object motion causes changes in the dynamic reflection path by changing the length of the Wi-Fi signal propagation path. Then a particular pattern of signal features (e.g., CSI, DFS, or BVP) corresponding to the perceived object behavior may be used to characterize this motion process. In this embodiment, 1 Wi-Fi signal transmitter and at least 3 receivers are required. The transmitter can adopt a general commercial Wi-Fi wireless router (specific parameters can be purchased according to experiment needs), each receiver needs to be provided with an Intel 5300 wireless network card, and a notebook computer or a microcomputer can be adopted generally. In addition, a driver corresponding to the network card is also required, and a driver designed in document [4] is generally used. After the device is turned on, the device position theoretically does not affect the recognition result, and generally the device can be placed according to the mode of fig. 7.
As shown in fig. 6, BVP is obtained by CSI inference. The CSI also needs to derive a BVP sequence from the CSI information by using techniques such as doppler shift, coordinate transformation, and compressed sensing after performing appropriate noise reduction and data preprocessing.
According to the multipath effect, in a Frequency domain and an indoor multipath environment, when a Wi-Fi signal reaches a receiver through k different paths, a CSI amplitude expression represents by measuring a carrier Frequency f at time t corresponding to a Channel Frequency Response (CFR) complex value:
Figure BDA0003807126460000121
wherein, a k (f, t) is a magnitude representation of the initial phase offset and component attenuation for the kth path;
Figure BDA0003807126460000122
is a propagation delay of τ k (t) a phase offset of the kth path; e.g. of the type j∈(f,t) Is the phase error caused by time alignment offset, sampling frequency offset and carrier frequency offset;
t2, representing the multipath signal phase by using the corresponding DFS, separating CFR caused by human activity from CSI, and converting the expression of the CSI into:
Figure BDA0003807126460000131
wherein, the constant H s Is all have zeroThe sum of the steady-state signals of the DFS,
Figure BDA0003807126460000132
the term is the sum of all non-zero DFS dynamic signals;
Figure BDA0003807126460000133
corresponding to the signal frequency shift caused by body motion; to this end, CFR caused by human motion has been isolated from CSI.
T3, calculating conjugate multiplication of CSI of different antennas on a receiver, eliminating high-frequency noise and random deviation, and only keeping a measurement value with obvious DFS multipath components, namely a signal measurement value reflected by a perception target;
and T4, filtering out Principal components from the relevant CSI subcarriers by Principal Component Analysis (PCA), and generating power distributions in the time domain and the frequency domain from the Principal components by Short-time Fourier Transform (STFT). As shown in fig. 8, which is a snapshot of the frequency spectrum of the time-frequency power distribution generated in the Wi-Fi antenna link. Along time, each snapshot is a Doppler Frequency Shift spectrum (DFSP), which is a matrix D with F × M dimensions dfs Where F is the number of sampling points and M is the number of transceiver link pairs;
when a person performs a gesture, his body parts (e.g., arms and hands) move at different speeds. The signals of the multiple links caused by limb movement are superimposed at the receiver and form a corresponding DFSP. Heretofore, these DFS-based signal power distributions at different frequencies have been used to quantify the relationship between the speed of motion of different body parts of a human and specific human activities. It should be noted, however, that while DFS can already be used for activity recognition, DFS-based sensing is not cross-domain capable since its reasoning process is closely related to the location and orientation of the sensing object.
T5, constructing matrix V bvp Dimension N × N, where N is the number of possible values of the velocity component along each axis of the body coordinate system;
the BVP sequence is composed ofVDM of columns. The BVP sequence may be quantized to matrix V bvp
T6, mixing V bvp And performing coordinate transformation on the dependent position information, and obtaining the BVP sequence by using a compressed sensing technology.
V bvp The dependent position information is converted from an environment global coordinate system into a body local coordinate system through coordinate transformation; the coordinate origin is the position of the perception object; the positive x-axis direction coincides with the orientation of the person. In particular, assuming that the transmitter and receiver positions of the ith link are known (which, as mentioned above, are considered to be settable), then in the human coordinate system, an arbitrary velocity component is present
Figure BDA0003807126460000134
Its signal power will be allocated to the frequency component of the ith link in DFSP:
Figure BDA0003807126460000135
wherein the coefficients
Figure BDA0003807126460000141
And
Figure BDA0003807126460000142
having transmitter and receiver position determination, in particular by measurement of the position of the transmitter and receiver [2];
Figure BDA0003807126460000143
Is the velocity in the global coordinate system. FIG. 8 shows that the three velocity components are 3 velocity components generated by a human body
Figure BDA0003807126460000144
The velocity components are projected (distributed) separately onto the DFSPs of the three links.
Due to the coefficient
Figure BDA0003807126460000145
And
Figure BDA0003807126460000146
and is only related to the position of the ith link, so the projection relation of the BVP on the ith link is fixed. Therefore, the relationship between DFSP and BVP for the ith link can be modeled as:
D (i) =c (i) A (i) V bvp
Figure BDA0003807126460000147
wherein, c (i) Is a scaling factor determined by the reflected signal propagation loss;
Figure BDA0003807126460000148
is an allocation matrix, f j Is the jth frequency sample in DFSP,
Figure BDA0003807126460000149
is the corresponding kth element in the vectorized BVP matrix. Then, the solution of BVP in combination with the dozing Distance (Earth Mover's Distance, EMD) is expressed as l 0 And (5) optimizing. Finally, a BVP corresponding to the motion is obtained based on the above model and the compressed sensing technique.
(5) Processing data: BVP sequences can be considered as a set of two-dimensional Velocity Distribution matrices (whose data structure is detailed in "BVP generation") arranged along the time dimension, and since the sample sampling duration cannot be accurately controlled during data acquisition, so that each BVP sequence has a different length, i.e. the number of BVP frames is different, the BVP can also be understood as a Velocity Distribution Matrix (VDM). The application uses a zero matrix O epsilon 0 for all BVP sequences along the time dimension T N×N Fill to equal length, where N is the number of possible values for the velocity component in the BVP along the x-axis and y-axis. This process is referred to herein as Total Padding.
Since a general commercial Wi-Fi device has a power adjustment mechanism based on a communication function, that is, the overall weight of the BVP may change with the power of the transmitter. To this end, the values need to be adjusted to a fixed value range, so that the two-dimensional value matrix of each frame BVP needs to be normalized (Normalization).
(6) And inputting the processed BVP sequence into a trained WiTransformer model to obtain a gesture classification result.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In one embodiment, the software programs of the present application may be executed by a processor to implement the above steps or functions. As such, the software programs (including associated data structures) of the present application can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or a solution according to the aforementioned embodiments of the present application.
Example 2:
a gesture recognition system based on a BVP and a WiTransformer by utilizing WiFi comprises a data acquisition module, a data processing module and a recognition module;
the data acquisition module is used for: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible values of speed components along an x axis and a y axis in the BVPs is recorded as N;
the data processing module is used for: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;
the identification module is to: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.
In the identification module, the step of modifying the Transformer model framework comprises the following steps: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:
s1, stacking the BVP sequences to be regarded as multi-channel image data B bvp ∈R H×W×C Each frame BVP in the BVP sequence is arranged into the image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;
s2, establishing a three-dimensional One-hot matrix I H×W×C To store the relative position information between BVP sequences BVP and embed the position information into B bvp In (1), the following:
Figure BDA0003807126460000151
wherein,
Figure BDA0003807126460000152
the multi-channel image data after the position information is embedded;
s3, mixing
Figure BDA0003807126460000153
Is divided into N tb Three-dimensional matrix pipeline Tube:
Figure BDA0003807126460000154
wherein N is tb In order to divide the number of the divided pipes,
Figure BDA0003807126460000161
T tb representing the dimensions of the three-dimensional matrix pipeline;
s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping the vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:
Figure BDA0003807126460000162
wherein,
Figure BDA0003807126460000163
representing three-dimensional matrix pipes
Figure BDA0003807126460000164
Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is
Figure BDA0003807126460000165
Wherein D is the hidden layer constant vector size used in the encoder of the Transformer model framework;
s5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence
Figure BDA0003807126460000166
Learnable class labels "", used to characterize the features of the entire Token sequence and participate in the final classification;
s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:
Figure BDA0003807126460000167
wherein,
Figure BDA0003807126460000168
the position coding is learnable and is used for storing position information among three-dimensional matrix pipelines.
In the identification module, the step of modifying the Transformer model framework comprises the following steps: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.
The method for modifying the Transformer model framework further comprises the following steps: a classifier is added after the encoder, and in this embodiment, MLP is used, but in other embodiments, other multi-classifiers may be used.
The relevant references to which this application relates are as follows:
[1]Wang,W.,Liu,A.X.,Shahzad,M.,Ling,K.,&Lu,S.(2015).Understanding and modeling of WIFI signal based human activity recognition.Proceedings of the 21st Annual International Conference on Mobile Computing and Networking.https://doi.org/10.1145/2789168.2790093.
[2]Zhang,Y.,Zheng,Y.,Qian,K.,Zhang,G.,Liu,Y.,Wu,C.,&Yang,Z.(2021).Widar3.0:Zero-effort cross-domain gesture recognition with Wi-Fi.IEEE Transactions on Pattern Analysis and Machine Intelligence,1–1.https://doi.org/10.1109/tpami.2021.3105387.
[3]Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N.Gomez,Lukasz Kaiser and Illia Polosukhin.“Attention is All you Need”neural information processing systems(2017):n.pag.
[4]Halperin,D.,Hu,W.,Sheth,A.,&Wetherall,D.(2011).Tool release.ACM SIGCOMM Computer Communication Review,41(1),53–53.https://doi.org/10.1145/1925861.1925870.
[5]Zheng Yang,Yi Zhang,Guidong Zhang,Yue Zheng."Widar 3.0:WiFi-based Activity Recognition Dataset."doi:10.21227/7znf-qp86.
the foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A gesture recognition method based on BVP and WiTransformer by utilizing WiFi is characterized by comprising the following steps:
acquiring data: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible velocity component values in the BVPs along an x axis and a y axis is recorded as N;
data processing: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;
identification: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.
2. The method of claim 1, wherein the step of obtaining data comprises the steps of:
t1, arranging a Wi-Fi signal transmitter and a receiver, and receiving and recording a CSI mode caused by the motion of an identification object; the CSI amplitude expression is expressed by using a measured carrier frequency f at time t corresponding to a complex value of a channel frequency response CFR:
Figure FDA0003807126450000011
wherein, a k (f, t) is a magnitude representation of the initial phase offset and component attenuation for the kth path;
Figure FDA0003807126450000012
is a propagation delay of τ k (t) a phase offset of the kth path; e.g. of the type j∈(f,t) Is the phase error caused by time alignment offset, sampling frequency offset and carrier frequency offset;
t2, representing the multipath signal phase by using the corresponding DFS, separating CFR caused by human activity from CSI, and converting the expression of the CSI into:
Figure FDA0003807126450000013
wherein, constant H s Is the sum of all steady state signals with zero DFS,
Figure FDA0003807126450000014
the term is the sum of all non-zero DFS dynamic signals;
Figure FDA0003807126450000015
corresponding to the signal frequency shift caused by body motion;
t3, calculating conjugate multiplication of CSI of different antennas on a receiver, eliminating high-frequency noise and random deviation, and only keeping a measurement value with obvious DFS multipath components, namely a signal measurement value reflected by a perception target;
t4, filtering out principal components from the related CSI subcarriers through principal component analysis, and generating power distribution in a time domain and a frequency domain from the principal components through short-time Fourier transform, wherein the time-frequency power distribution spectrum snapshot is a Doppler frequency shift spectrum DFSP;
t5, constructing a matrix V bvp Dimension N × N, where N is the number of possible values of the velocity component along each axis of the body coordinate system;
t6, dividing any speed of gesture movementMeasurement of
Figure FDA0003807126450000021
Projecting on a certain frequency component on DFSP, introducing the idea of compressed sensing, and taking the BVP estimation as l 0 Solving the optimization problem to obtain a BVP matrix V directly mapped with the gesture motion bvp Obtaining the BVP sequence.
3. The method of claim 2, wherein the sequence filling and normalization processing is performed on the BVP sequence as follows:
all BVP sequences are applied with a zero matrix O e 0 along the time dimension T N×N The padding is equal in length, and the overall weight of the BVP sequence is affected by the power of the transmitter, so that the two-dimensional value matrix of the BVP of each frame is normalized.
4. The method of claim 1, wherein the adapting the fransformer model framework comprises: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:
s1, stacking the BVP sequences to be regarded as multi-channel image data B bvp ∈R H×W×C Each frame BVP in the BVP sequence is arranged into the image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;
s2, establishing a three-dimensional One-hot matrix I H×W×C To store the relative position information between BVP sequences BVP and embed the position information into B bvp In (1), the following:
Figure FDA0003807126450000022
wherein,
Figure FDA0003807126450000023
the multi-channel image data after the position information is embedded;
s3, mixing
Figure FDA0003807126450000024
Is divided into N tb Three-dimensional matrix pipeline Tube:
Figure FDA0003807126450000025
wherein N is tb In order to divide the number of the divided pipes,
Figure FDA0003807126450000026
T tb representing the dimensions of the three-dimensional matrix pipeline;
s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:
Figure FDA0003807126450000027
wherein,
Figure FDA0003807126450000028
representing three-dimensional matrix pipes
Figure FDA0003807126450000029
Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is
Figure FDA00038071264500000210
Wherein D is the hidden layer constant vector size used in the encoder of the Transformer model framework;
s5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence
Figure FDA0003807126450000031
The learnable class label "", is used to characterize the features of the entire Token sequence and participate in the final classification;
s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:
Figure FDA0003807126450000032
wherein,
Figure FDA0003807126450000033
the position coding is learnable and is used for storing position information among three-dimensional matrix pipelines.
5. The method of claim 1, wherein the adapting the fransformer model framework comprises: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.
6. The method of claim 5, wherein a classifier is added after the encoder.
7. A gesture recognition system based on BVP and WiTransformer by utilizing WiFi is characterized by comprising a data acquisition module, a data processing module and a recognition module;
the data acquisition module is used for: receiving and recording a CSI mode caused by the motion of an identification object, deducing a BVP sequence based on CSI, wherein the BVP sequence is three-dimensional data and is a group of BVPs arranged along a time dimension T, the length of the BVP sequence is determined by the frame number of the BVPs, and the number of possible speed component values in the BVPs along an x axis and a y axis is recorded as N;
the data processing module is used for: carrying out sequence filling and normalization processing on BVP sequences with different lengths to obtain processed BVP sequences;
the identification module is configured to: and inputting the BVP sequence into a trained WiTransformer model to obtain a gesture recognition result, wherein the WiTransformer model is obtained by modifying a Transformer model framework.
8. The system of claim 7, wherein the modifying the Transformer model framework in the recognition module comprises: adding a timing information stacking fusion and frame position coding module before an encoder of a Transformer model framework, wherein the BVP sequence executes the following steps in the timing information stacking fusion and frame position coding module:
s1, stacking the BVP sequences to be regarded as multi-channel image data B bvp ∈R H×W×C Each frame BVP in the BVP sequence is arranged into the image channel, H and W are essentially the number of possible values of the velocity component in the BVP along the x-axis and the y-axis, i.e. H = W = N, C is essentially the number of frames of the BVP in the BVP sequence, i.e. C = T;
s2, establishing a three-dimensional One-hot matrix I H×W×C To store the relative position information between BVP sequences BVP and embed the position information into B bvp In (1), the following:
Figure FDA0003807126450000041
wherein,
Figure FDA0003807126450000042
the multi-channel image data after the position information is embedded;
s3, mixing
Figure FDA0003807126450000043
Is divided into N tb A three-dimensional matrix tubeRoad Tube:
Figure FDA0003807126450000044
wherein N is tb In order to divide the number of the divided pipes,
Figure FDA0003807126450000045
T tb representing the size of the three-dimensional matrix pipeline;
s4, respectively carrying out flattening operation on each three-dimensional matrix pipeline, and mapping vectors obtained by flattening to a Token sequence of the high-order feature space through linear projection E, wherein the method comprises the following steps:
Figure FDA0003807126450000046
wherein,
Figure FDA0003807126450000047
representing three-dimensional matrix pipes
Figure FDA0003807126450000048
Flattening the resulting vector, E is a linear projection determined by training, and the resulting Token sequence is
Figure FDA0003807126450000049
Wherein D is the size of a hidden layer constant vector used in an encoder of a Transformer model frame;
s5, setting a learnable classification label' in the head of the Token sequence to obtain a preliminary input sequence
Figure FDA00038071264500000410
The learnable class label "", is used to characterize the features of the entire Token sequence and participate in the final classification;
s6, embedding position information among three-dimensional matrix pipelines in the preliminary input sequence to obtain an input sequence z input to the encoder, wherein the input sequence z comprises the following steps:
Figure FDA00038071264500000411
wherein,
Figure FDA00038071264500000412
the position coding is learnable and is used for storing position information among three-dimensional matrix pipelines.
9. The system of claim 7, wherein the modifying the Transformer model framework in the recognition module comprises: and performing structure adjustment on an encoder of the Transformer model frame, removing a layer normalization layer behind each network module of the encoder, adding a batch normalization layer in front of each network module, and adopting a residual error connection mode between the network modules.
10. The system of claim 9, wherein a classifier is added after the encoder.
CN202211000411.1A 2022-08-19 2022-08-19 Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi Withdrawn CN115393955A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211000411.1A CN115393955A (en) 2022-08-19 2022-08-19 Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211000411.1A CN115393955A (en) 2022-08-19 2022-08-19 Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi

Publications (1)

Publication Number Publication Date
CN115393955A true CN115393955A (en) 2022-11-25

Family

ID=84119867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211000411.1A Withdrawn CN115393955A (en) 2022-08-19 2022-08-19 Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi

Country Status (1)

Country Link
CN (1) CN115393955A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434334A (en) * 2023-03-28 2023-07-14 湖南工商大学 WiFi human body gesture recognition method based on transducer, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116434334A (en) * 2023-03-28 2023-07-14 湖南工商大学 WiFi human body gesture recognition method based on transducer, electronic equipment and storage medium
CN116434334B (en) * 2023-03-28 2024-02-06 湖南工商大学 WiFi human body gesture recognition method based on transducer, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Hsieh et al. Deep learning-based indoor localization using received signal strength and channel state information
Zhao et al. Convolutional neural network and dual-factor enhanced variational Bayes adaptive Kalman filter based indoor localization with Wi-Fi
KR20230028249A (en) Apparatuses and methods for 3D posture estimation
Wang et al. A unified framework for guiding generative ai with wireless perception in resource constrained mobile edge networks
US20220262093A1 (en) Object detection method and system, and non-transitory computer-readable medium
US12000945B2 (en) Visual and RF sensor fusion for multi-agent tracking
Zou et al. Joint adversarial domain adaptation for resilient WiFi-enabled device-free gesture recognition
He et al. A robust CSI-based Wi-Fi passive sensing method using attention mechanism deep learning
Yan et al. Device-free activity detection and wireless localization based on CNN using channel state information measurement
Kabir et al. CSI-IANet: An inception attention network for human-human interaction recognition based on CSI signal
Wei et al. RSSI-based location fingerprint method for RFID indoor positioning: a review
CN115393955A (en) Gesture recognition method and system based on BVP and WiTransformer by utilizing WiFi
Yan et al. Joint activity recognition and indoor localization with WiFi sensing based on multi-view fusion strategy
Bulugu Gesture recognition system based on cross-domain CSI extracted from Wi-Fi devices combined with the 3D CNN
Ayinla et al. SALLoc: An Accurate Target Localization In Wifi-Enabled Indoor Environments Via Sae-Alstm
Liu et al. UniFi: A Unified Framework for Generalizable Gesture Recognition with Wi-Fi Signals Using Consistency-guided Multi-View Networks
Shen et al. WiAgent: Link selection for CSI-based activity recognition in densely deployed wi-Fi environments
CN115469303A (en) Cognitive biological radar method and device for detecting human body posture and vital signs
Zhong et al. Point‐convolution‐based human skeletal pose estimation on millimetre wave frequency modulated continuous wave multiple‐input multiple‐output radar
Zhang et al. WiFi-Based Indoor Human Activity Sensing: A Selective Sensing Strategy and a Multi-Level Feature Fusion Approach
Bian et al. SimpleViTFi: A lightweight vision transformer model for Wi-Fi-based person identification
Tiku et al. A Scalable Framework for Indoor Localization Using Convolutional Neural Networks
Xu et al. Real-time robust and precise kernel learning for indoor localization under the internet of things
Huang et al. Sparse representation for device-free human detection and localization with COTS RFID
Chai et al. Tourist Street View Navigation and Tourist Positioning Based on Multimodal Wireless Virtual Reality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20221125

WW01 Invention patent application withdrawn after publication