CN111914709B

CN111914709B - Deep learning-based action segmentation framework construction method aiming at WiFi signal behavior recognition

Info

Publication number: CN111914709B
Application number: CN202010718841.1A
Authority: CN
Inventors: 肖春静; 何海生; 段宇晨; 雷越; 陈世名
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2024-02-06
Anticipated expiration: 2040-07-23
Also published as: CN111914709A

Abstract

The invention belongs to the technical field of behavior recognition, and discloses a deep learning-based action segmentation framework construction method aiming at WiFi signal behavior recognition, which comprises the following steps: training a state inference model through the CSI data; dividing the CSI data into equal-length data segments through a sliding window with the size omega, and setting the sliding step length to be 1; deducing the label of the data segment by adopting a trained state deducing model; determining a starting point and an ending point of a human body action based on the label of the data segment, and carrying out segmentation extraction on the human body action; based on the CSI data with segmented actions, adopting a deep learning model CNN as a behavior classification model, performing behavior classification, and calculating confidence as feedback information; and integrating feedback information from the classification model into the construction of the state inference model to complete the construction of the action segmentation framework based on the deep learning. The invention solves the problem of reduced motion segmentation performance under the mixed motion without relying on empirical noise removal and threshold calculation.

Description

Deep learning-based action segmentation framework construction method aiming at WiFi signal behavior recognition

Technical Field

The invention belongs to the technical field of behavior recognition, and particularly relates to a deep learning-based action segmentation framework construction method aiming at WiFi signal behavior recognition.

Background

As an important component of the internet of things, behavior recognition plays a key role in many internet of things applications (e.g., smart home and healthcare, etc.), which require recognition of different types of behaviors to improve quality of life. The existing behavior recognition can be divided into contact behavior recognition and non-contact behavior recognition, such as wearable sensors, smart phones, cameras and the like, which are typical contact behavior recognition, but behavior recognition based on WiFi signals is greatly favored in recent years due to the ubiquitous availability and non-contact characteristics. By collecting channel state information (Channel State Information, CSI), existing works explore behavior recognition of different types of actions, such as fine-grained actions (gestures and sign language) and coarse-grained actions (walks and falls).

For CSI-based behavior recognition, after data acquisition and preprocessing, the remaining steps can be roughly divided into two parts: action segmentation and behavior classification. The action segmentation aims at finding the point at which the action starts and the point at which it ends from continuously received CSI data. The behavior classification is to classify the results of the action segmentation into corresponding categories. As inputs to behavior classification, the quality of action segmentation has an important impact on performance of behavior classification, and since the extent of CSI amplitude fluctuations is much greater when an action occurs than when no action occurs, most existing works such as Tw-See (Wu, X.G., et al, TW-See: human Activity Recognition Through the Wall With Commodity Wi-Fi devices. Ieee Transactions on Vehicular Technology,2019.68 (1): p.306-319.) and Wi-Multi (Feng, C., et al., wi-Multi: AThree-phase System for Multiple Human Activity Recognition with Commercial WiFi devices. IEEE Internet of Things Journal, IEEE 869: p.1-1.) consider finding an optimal threshold to detect if an action occurs, thereby designing a threshold-based action segmentation method, i.e., if the amplitude of the CSI fluctuations exceeds the set threshold, then no action is deemed to occur, otherwise, no action occurs.

Although these threshold-based segmentation approaches have been successful, there are some drawbacks. First, the methods of noise removal and threshold calculation are typically determined based on empirical observations, even though some researchers have recommended methods that are conflicting. For example, CARM (Wang, W., et al, understanding and Modeling of WiFi Signal Based Human Activity Recognition, in Proceedings of the 21st Annual International Conference on Mobile Computing and Networking.2015,ACM:Paris,France.p.65-76) and WiAG (Virmani, A.and M.Shahzad, position and Orientation Agnostic Gesture Recognition Using WiFi, in Proceedings of the 15th Annual International Conference on Mobile Systems,Applications,and Services.2017,ACM:Niagara Falls,New York,USA.p.252-264) both indicate that noise can be effectively removed using principal component analysis (Principal Component Analysis, PCA), but a different principal component is selected for subsequent processing. Wherein the CARM suggests the use of a second, third principal component, whereas the WiAG selects the third principal component, since they find that the second principal component also contains unwanted noise. In contrast, tw-See considers PCA as unsuitable for WiFi signals through walls, while Or-PCA (an opposite robust PCA) can be used to remove noise while the first principal component is selected for action segmentation processing.

Second, threshold-based segmentation methods are applied in actual scenarios, and may have significant performance degradation. The existing research work mainly focuses on partitioning fine-grained activity (fine-grained activity) or coarse-grained activity (coarse-grained activity) alone and achieving better performance. However, in an actual scenario, fine-granularity actions and coarse-granularity actions may occur alternately and randomly. Because of the large difference in CSI amplitude between mixing actions, the optimal threshold for fine granularity (coarse granularity) actions may not be applicable for coarse granularity (fine granularity) actions. As shown in fig. 1, CSI waveforms of one coarse-granularity action (left) and one fine-granularity action (right) are presented with a first order difference of 1 subcarrier as a y-axis, where solid lines mark actual start points and end points of actions, and broken lines mark start points and end points detected by a given threshold. According to the illustration in fig. 1, if a small threshold is selected to detect the start and end points of the motion, the fine-granularity motion can be accurately extracted, but the excessive noise around the coarse-granularity motion can be mistakenly treated as coarse-granularity motion data. Also, a large threshold can accurately detect coarse-grained motion, but some fine-grained motion may be considered noise and ignored. Therefore, it is quite difficult to set a threshold value that can accurately extract all types of actions. Our experiments also show that the accuracy of the threshold-based method in mixed actions is lower than when there is only one type of action. Therefore, the segmentation performance based on the threshold method may be severely degraded in the mixing action.

Third, action segmentation and behavior classification are closely related, but are generally considered as two separate parts. Existing researches often consider action segmentation first, then classify segmented action results, and do not consider the relation between the action segmentation and the action results.

Disclosure of Invention

Aiming at the problems that noise removal and threshold calculation are required to depend on experience and segmentation performance is reduced under mixed actions, the invention provides a deep learning-based action segmentation framework construction method aiming at WiFi signal behavior recognition.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a deep learning-based action segmentation framework construction method aiming at WiFi signal behavior recognition comprises the following steps:

step 1: training a state inference model through the CSI data;

step 2: dividing the CSI data into equal-length data segments through a sliding window with the size omega, and setting the sliding step length to be 1;

step 3: deducing the label of the data segment by adopting a trained state deducing model;

step 4: determining a starting point and an ending point of a human body action based on the label of the data segment, and carrying out segmentation extraction on the human body action;

step 5: based on the CSI data with segmented actions, adopting a deep learning model CNN as a behavior classification model, performing behavior classification, and calculating confidence as feedback information;

step 6: and integrating feedback information from the classification model into the construction of the state inference model to complete the construction of the action segmentation framework based on the deep learning.

Further, the step 1 includes:

step 1.1: dividing the continuously received CSI data into data segments with the same size, defining four state labels for the data segments, wherein the four state labels are respectively as follows: a stationary state, a start state, a motion state and an end state are respectively represented by 1, 2, 3 and 4;

step 1.2: and training the state inference model by using the CNN architecture as the state inference model and using the CSI data segment with the state label as training data.

Further, the step 1.1 includes:

in a given CSI data segment, from t _end +ω/2 to t _end CSI data of +ω/2+ω is marked as stationary, from t _start Omega/2 to t _start The +ω/2 CSI data is marked as a start state, from (t _start +t _end -omega)/2 to (t _start +t _end CSI data of +ω)/2 is marked as motion state, from t _end Omega/2 to t _end The +ω/2 CSI data is marked as an end state; wherein t is _start And t _end Refer to the actual starting and ending points of the human motion, respectively, ω refers to the size of the data segment.

Further, the step 4 includes:

determining whether human action starts, if the mode of the state label is changed from 1 to 2, setting i-m/2+1 as a starting point of the action, m representing the length of a data list calculating the mode;

it is determined whether the human action is ended, and if the mode of the state label is changed from 4 to 1, i-m/2+1- ω is set as the end point of the action.

Further, the calculating the confidence level includes, as feedback information:

confidence D calculation using maximum probability and probability entropy _i As feedback information:

wherein D is _i Representing confidence of the ith sample, M _i Is the maximum probability E _i Is probability entropy, alpha is a parameter for adjusting the maximum probability and the weight occupied by the probability entropy, y and y ^* The predicted tag and the actual tag, respectively.

Further, the step 6 includes:

after adding feedback information, the objective function J (Φ) of the state inference model can be expressed as:

where |X| represents the number of pieces of input CSI data, i.e., the number of samples input, X is the input CSI data, D _i Representing the confidence level of the ith sample, p (r _i |x _i The method comprises the steps of carrying out a first treatment on the surface of the Φ) represents the ith sample x _i Is the predictive probability function of r _i Representing sample x _i Corresponding state, Φ= { W _f ,b _f ,W _c ,b _c }，W _f And b _f Is a convolution parameter, W _c And b _c Is a parameter of the full connectivity layer.

Compared with the prior art, the invention has the beneficial effects that:

the motion segmentation based on the state inference model in the invention avoids the noise removal and threshold calculation which need to depend on experience, and solves the problem of motion segmentation performance degradation under the mixed motion. Unlike existing methods that use motion segmentation and behavior classification as two independent phases, the present invention designs a feedback mechanism between the two, which considers feedback information extracted from the behavior classification results to improve the state inference model for motion segmentation. A large number of experiments prove that the framework proposed by the invention is superior to the existing method for scenes with fine granularity and/or coarse granularity actions.

Drawings

FIG. 1 is a schematic diagram of threshold-based motion segmentation in a hybrid motion;

fig. 2 is a flowchart of a deep learning-based action segmentation framework construction method for WiFi signal behavior recognition according to an embodiment of the invention;

fig. 3 is a diagram illustrating four states of a first-order differential CSI action waveform of a deep learning-based action segmentation framework construction method for WiFi signal behavior recognition according to an embodiment of the present invention;

fig. 4 is a behavior recognition performance comparison chart of different types of actions of a deep learning-based action segmentation framework construction method for WiFi signal behavior recognition according to an embodiment of the present invention;

FIG. 5 is a graph of behavior recognition accuracy and F1 for mixed action, coarse-grain action, and fine-grain action using LSTM model alone;

FIG. 6 is a graph of behavior recognition accuracy and F1 for a mixed action, coarse-granularity action, and fine-granularity action using SVM models alone;

FIG. 7 is a graph comparing segmentation accuracy of WiAG, tw-See, wi-Multi, and DeepSeg for different types of actions;

FIG. 8 is a graph of motion segmentation accuracy for different sized tag data;

fig. 9 is a graph of behavior recognition accuracy of tag data of different sizes.

Detailed Description

The invention is further illustrated by the following description of specific embodiments in conjunction with the accompanying drawings:

as shown in fig. 2, a method for constructing an action segmentation framework based on deep learning for WiFi signal behavior recognition includes:

step S101: training a state inference model through the CSI data;

step S102: dividing the CSI data into equal-length data segments through a sliding window with the size omega, and setting the sliding step length to be 1;

step S103: deducing the label of the data segment by adopting a trained state deducing model;

step S104: determining a starting point and an ending point of a human body action based on the label of the data segment, and carrying out segmentation extraction on the human body action;

step S105: based on the CSI data with segmented actions, adopting a deep learning model CNN as a behavior classification model, performing behavior classification, and calculating confidence as feedback information;

step S106: and integrating feedback information from the classification model into the construction of the state inference model to complete the construction of the action segmentation framework based on the deep learning.

Further, the step S101 includes:

step S101.1: dividing the continuously received CSI data into data segments with the same size, defining four state labels for the data segments, wherein the four state labels are respectively as follows: a stationary state, a start state, a motion state and an end state are respectively represented by 1, 2, 3 and 4;

step S101.2: and training the state inference model by using the CNN architecture as the state inference model and using the CSI data segment with the state label as training data.

Further, the step S101.1 includes:

Specifically, in step S101, since the fluctuation ranges of the various actions are greatly different, it is difficult to obtain an optimal threshold value to divide the mixing action, and thus the division performance of the threshold-based method under the mixing action may be significantly degraded. Meanwhile, the method of removing noise and calculating threshold value by manually searching is easily influenced by personal experience and environment. To solve this problem, we do not use a threshold to segment the action, but instead translate the action segmentation task into a classification problem.

To train the state inference model, training data with action start and end tags is extracted from the CSI sequences. Specifically, the CSI received consecutively is divided into data segments of the same size, and four status flags are defined for these data segments: stationary state: CSI data when no action exists in the data segment; start state: the data segment comprises a CSI waveform for starting the action; motion state: CSI waveforms when actions exist in the data segments; end state: the data segment contains CSI waveforms for the end of the action. Fig. 3 gives an example of four states of a first-order differential CSI action waveform, where the dashed lines are the actual start and end points of the action. The start state includes half of the non-action portion and half of the action portion, and the end state also includes both, but the action portion is in front. In contrast, the stationary state contains only the non-moving part and the moving state contains only the moving part.

Table 1 starting and ending points for four states

No.	State labels	Starting point	End point
				1	Rest state	t _end +ω/2	t _end +ω/2+ω
2	Start state	t _start -ω/2	t _start +ω/2
				3	State of motion	(t _start +t _end -ω)/2	(t _start +t _end +ω)/2
4	End state	t _end -ω/2	t _end +ω/2

Details of the extracted data segment states are given in table 1. t is t _start And t _end Refer to the actual start and end points of the action, respectively, ω being the size of the data segment. In the table, the start and end points of each state are shown in columns 3 and 4. For example, in a given motion sample, from t _end +ω/2 to t _end CSI data of +ω/2+ω is marked as stationary. Note that here, the first order difference of CSI amplitude, which is more effective than the original amplitude, is used for the action division. According to this definition, each action sample may ultimately generate four data segments, which are then used to train a state inference model.

The present embodiment uses the CNN architecture as a state inference model, which can be expressed simply as:

L＝CNN(X) (3-1)

where X is the CSI data input and L is the output of the last fully connected layer. The architecture consists of a convolution layer, a dropout layer and a maximum pooling layer, wherein the convolution parameter is W _f And b _f . State prediction probabilityThe function p (r|x; Φ) can be expressed as:

p(r|x；Φ)＝softmax(W _c *L+b _c ) (3-2)

wherein r represents the state corresponding to the sample x, W _c And b _c Is a parameter of the full connection layer, phi represents a set of parameters required in the state inference model, phi= { W _f ,b _f ,W _c ,b _c It is worth noting that the subscripts c and f are merely used to distinguish parameters and have no essential meaning. For a given training set { X }, the objective function of the state inference model using cross entropy is expressed as:

wherein r is _i Representing sample x _i Corresponding states, |X| represent the number of input CSI data segments, i.e., the number of input samples;

adam's algorithm using default parameters is used as the optimization method.

Further, the step S104 includes:

Specifically, after the state inference model is trained, the start and end points of the action are determined according to the following three steps: (1) And dividing the CSI data stream into equal-length data segments by using a sliding window with the size omega, and setting the sliding step length to be 1. (2) A trained state inference model is employed to infer labels for the data segments. (3) Based on the labels of the data segments, the starting and ending points of human action are determined by studying the mode changes of a set of data segment labels. In other words, if the mode changes from 1 (stationary state) to 2 (start state), the corresponding data segment is considered as the start of action. If the mode changes from 4 (end state) to 1 (rest state), then this segment of data is considered the end of the action. The algorithm 1 gives detailed information about the judgment of the method, firstly, the input CSI data is divided into data segments with the length omega (line 1), a trained state inference model is adopted to predict the state labels of the data segments, and then the prediction result is stored in the transmitted_label (line 2). Thereafter, the start and end points of the action are detected by traversing the inperfered_label (line 5). During the detection, lines 8-13 represent a determination of whether an action is to be started, and if the state mode changes from 1 (rest state) to 2 (start state), i-m/2+1 is set as the start of an action (line 11). Lines 13-18 represent determining if human actions are finished, and if the state mode changes from 4 (end state) to 1 (rest state), then i-m/2+1- ω is set as the end of actions (line 16). Wherein ω is subtracted as the end because when the state mode becomes 1, there is no part of the current data segment that is active, so the starting point of the action should be the beginning of this data segment instead of the end. After the human body actions are segmented and extracted, the CSI data are used for behavior classification.

As an alternative embodiment, m=10 is set, i.e. 10 status labels are grouped into a data list, the mode of the list is calculated, and then the start or end of the behavior is deduced from the mode change.

Table 2 motion segmentation algorithm based on state inference model

Specifically, in step S105, the behavior classification is performed using the deep learning model CNN as a behavior classification model, which takes as input CSI data with motion divided, and then outputs probability distribution of the motion class to which it belongs.

In addition to behavior classification, probability distributions are used as feedback for the computation of concentration as an action segmentation algorithm. For an action sample, the concentration of its probability distribution may reflect the confidence of the sample. When the predicted class is the same as the actual class, a high confidence level indicates that the sample is a high quality sample expected for behavior recognition. Therefore, the confidence is used as feedback of the action segmentation algorithm, and the action segmentation algorithm is optimally adjusted according to the feedback, so that more expected samples are output.

For a given sample, we use the combination of the maximum probability and the probability entropy to represent the concentration, i.e. the confidence, and as feedback. The maximum probability here refers to the maximum value of the output probabilities, and the category with the highest probability is taken as the final category of the input data. The probability entropy is used to reflect other probabilities than the maximum probability. Their combination may comprehensively represent probability distributions.

Specifically, assume thatIs the probabilistic output of the behavior classification model for the action segmentation sample, C is the number of action categories. Each element p (c|x _i ) Refers to input data x _i Probability belonging to class c, maximum probability M _i The calculation method comprises the following steps:

for sample x _i Its probability entropy E _i The calculation formula is as follows:

smaller E _i Meaning that input samples with high confidence can be easily distinguished. For example, when only item 1 in P is 100%, the other items are 0, and the probability entropy is 0. This means that the sample with the highest confidence will be marked as class 1. Conversely, the flatter the probability distribution, the greater the probability entropy and, correspondingly, the harder it is to distinguish the class to which the input sample belongs.

Furthermore, combining the maximum probability and the probability entropy, sample x _i Feedback information of (i.e. confidence level D) _i Expressed as:

where α is a parameter that adjusts the weight of the two parts, y and y ^* The predicted tag and the actual tag, respectively. If the predicted class is the same as the actual class, there is a higher confidence, i.e., a greater D is generated _i . These values will be used as feedback to adjust the motion segmentation algorithm to improve overall performance.

Specifically, in step S106, to improve overall performance, we incorporate feedback information from the classification model into the motion segmentation algorithm and perform joint training on them. The feedback information is mainly applied to the state inference model, the training sample with high confidence level plays an important role in the training of the state inference model, and the classification performance can be effectively improved by increasing the weight of the expected sample trained by the state inference model.

In particular, after adding feedback information, the objective function of the state inference model can be expressed as:

wherein D is _i The confidence level representing the ith sample can be calculated using equation (3-6). The loss function takes into account the difference in importance of the different samples compared to the loss function in equation (3-3).

Table 3 Joint training algorithm

Considering the new loss function, we use a joint training action segmentation algorithm and a behavior classification model to improve overall performance. Algorithm 2 gives a complete joint trainingAnd (3) a training step. The algorithm first extrapolates model L to a state based on tag data (lines 1-2) _seg And behavior classification model L _act Pretraining is performed. After pre-training, training is updated L by iteration _seg And L _act The parameters (lines 3-8) of (a) proceed. Specifically, algorithm 1 extracts action data from continuously received CSI and then saves the extracted samples as(line 4). For each sample i, a feedback D is calculated using equation (3-6) _i (line 5) and then optimizing the state inference model with the new loss function in equation (3-7) (line 6) and by +.>The behavior classification model is trained (line 7). And then classifying the actions by using the behavior classification model after all training is finished.

To verify the effect of the invention, the following experiments were performed:

for convenience of description, the action segmentation framework based on deep learning for WiFi signal behavior recognition constructed by the invention is simply called deep seg. In this section, we will evaluate the effectiveness of deep seg on coarse-grained, fine-grained, and mixed-action datasets while also studying the effects of behavioral classification models and training data sizes.

1. Experimental setup

To collect CSI data of human body motion, we deployed a common commercial WiFi device and a notebook computer with Intel 5300 network card in a conference room with a size of 10 x 6 meters. A router with one antenna was used as the transmitter and a laptop with 3 antennas was used as the receiver, with CSI sampling rate set to 50 packets per second. The collection tool we use can acquire 30 sub-carrier CSI data from each transmit-receive antenna pair. Using these devices, 5 volunteers of different sizes and ages were asked to do 10 exercises, including 5 fine-grained exercises (waving hands, lifting hands, pushing hands, drawing circles, drawing X) and 5 coarse-grained exercises (punching fists, bending over, running, squatting, walking). For each CSI sequence, each volunteer was required to perform 5 fine-grained actions and 5 coarse-grained actions at time intervals, and the collection was repeated 30 times, thus obtaining 30 samples per action for each volunteer, and the final experiment was total of 1500 samples, with 80% as training set and 20% as test set.

During training, the learning rate was set to 0.006 and the mini-batch size was set to 16. The super parameter α in the formulas (3-6) and the window size ω in table 1 were empirically set to 0.8 and 120, respectively. Regarding implementation details of deep seg, table 4 presents the CNN architecture of the state inference model, with Conv, T-Conv, BN, WN, FM, and NiN referring to convolution, transpose convolution, batch normalization, weight normalization, feature mapping, and Network in Network, respectively, in table 4. The behavior classification model uses the same architecture except that the input CSI matrix for row 1 is 200×30×3 and the step size (Stride) for rows 5 and 9 is 5*2.

Table 4 CNN network architecture

2. Behavior recognition performance

The performance of deep seg in behavior recognition is first demonstrated. DeepSeg is unique in that it is a state-inference model based motion segmentation algorithm, whereas to explicitly exhibit its segmentation effect, we eliminate the behavior classification model and compare the performance of DeepSeg and other segmentation techniques in place of state-inference model based motion segmentation algorithms. The basic method is as follows: (1) WiAG: a typical threshold-based gesture extraction method determines the beginning and end of a gesture by comparing the magnitude of the principal component to a given threshold. (2) Tw-See: an effective action segmentation algorithm for through-wall signals can eliminate the influence of tiny fluctuation in a CSI waveform and further accurately detect the starting point and the end point of an action. (3) Wi-Multi: a novel motion segmentation algorithm in a multi-target noise environment. The algorithm can eliminate potential false detection by calculating the maximum eigenvalues of the correlation matrix of amplitude and calibration phase, and can improve the accuracy of the segmentation action in noise or/and multi-objective environments. (4) NoFeedBack: the feedback mechanism is removed by the proposed deep seg, and the model adopts the same segmentation algorithm and classification model as the deep seg, but the action segmentation algorithm does not have feedback of the action classification model.

The behavior recognition results for the mixed action, coarse grain action, and fine grain action are shown in fig. 4. For mixed actions, as shown in fig. 4 (4 a), noFeedback and DeepSeg are significantly better than the other three threshold-based segmentation methods, mainly because the deep learning model can effectively handle differences in different action magnitudes, so the two deep learning-based models can more accurately identify the start and end of actions. Thus, although the same behavior recognition model is used, both their behavior recognition Accuracy (Accuracy) and F1 (F1 Score) are significantly higher than the three threshold-based motion segmentation methods. In the two algorithms based on CNN, the deep seg has better recognition performance, which indicates that the proposed deep seg framework with a feedback mechanism can further improve the final recognition performance.

In addition, for the coarse granularity action in (4 b) in fig. 4, since there is only a small difference in amplitude of CSI action data, all three threshold-based methods can accurately separate actions, and obtain a high recognition accuracy. For example, the best performing WiAG is near 94% accurate, similar to the recognition result of NoFeedback. However, all threshold-based methods have a performance lower than the proposed deep seg, and a similar trend can be found in the fine-grained action in fig. 4 (4 c). These results indicate that the deep seg method is effective in improving recognition performance even in the case of only the same type of motion, as compared with the threshold-based method.

3. Action of behavior classification model

Analysis in the previous section shows that deep seg can achieve better performance using the CNN architecture as a behavioral classification model. To evaluate the effectiveness of the feedback mechanism in this framework, we removed the action segmentation algorithm and evaluated the performance of deep seg in behavioral classification with other machine learning models instead of the CNN architecture. There are three typical approaches to CSI-based behavior classification: dynamic time warping (Dynamic Time Warping, DTW), shallow learning and deep learning based methods. Since the DTW-based method cannot output the classification probability of the input samples required by deep seg, a support vector machine (Support Vector Machine, SVM) and a Long Short-Term Memory network (LSTM) are selected as representatives of shallow learning and deep learning. These two models are widely used in different CSI behavior recognition applications and achieve better performance such as gesture recognition, fall detection and behavior recognition. For the SVM model, the same features and configurations as those of the study (Palipana, S., et al, fallDeFi: ubiquitous Fall Detection using Commodity Wi-Fi devices.proceedings of the ACM on Interactive, mobile, wearable and Ubiquitous Technologies,2018.1 (4): p.1-25.) were used. Whereas for the LSTM model, the model consists of three LSTM hidden layers and one fully connected layer.

Fig. 5 and 6 show the behavior recognition accuracy and F1 of the Mixed action (Mixed), coarse grain action (Coarse) and Fine grain action (Fine) alone using the LSTM model and the SVM model, respectively. Here, deep seg-LSTM in fig. 5 refers to our proposed framework, except that LSTM replaces CNN for behavior classification, while NoFeedback-LSTM refers to a framework without feedback mechanism. Similar explanations apply to the deep seg-SVM and the NoFeedBack-SVM in FIG. 6. When LSTM is used for behavior classification, as shown in FIG. 5, deep seg-LSTM performs significantly better than NoFeedBack-LSTM for different types of actions. For example, for a hybrid action, the recognition accuracy of deep seg-LSTM is about 3% higher than NoFeedback-LSTM, and the SVM model in fig. 6 has a similar trend. These results show that the framework with feedback mechanism is always better than the framework without feedback mechanism and that the framework can be applied to different supervised learning models.

4. Motion segmentation performance

We have demonstrated that deep seg can always achieve better performance in behavior recognition than threshold-based methods. However, it is not clear how effective in motion segmentation, and next by performing experiments, the segmentation accuracy (Accuracy of segmentation) of the deep seg motion segmentation algorithm is compared with three threshold-based segmentation (Threshold for segmentation) methods of WiAG (17), TW-See (14), and Wi-Multi (15). For a given action, the segmentation accuracy is defined as |A n B|/max { |A|, |B| }, where A is the actual packet index set for the action, B is the prediction set, |A| and |B| are the number of data corresponding to A and B, respectively, max { |A|, |B| } is the maximum of |A| and |B|. Intuitively, accurate segmentation facilitates behavior classification because a behavior classification model can correctly classify an action even if the beginning and ending of the action are not accurately detected, such that the accuracy of behavior classification may exceed the accuracy of action segmentation. If the start and end points of the predicted action are close to the actual action, a large portion of the action data can be extracted and sufficient for proper behavior classification.

FIG. 7 shows segmentation accuracy for coarse-grained, fine-grained, and hybrid actions. The solid line represents the accuracy of the three threshold-based methods at different thresholds. The dotted line refers to the accuracy of DeepSeg, which is a horizontal line because DeepSeg does not require a threshold. Note that Wi-Multi in fig. 7 (7 c) employs dynamic thresholds to divide the action, however for a given CSI sequence, the thresholds are determined only by the sampling rate. The x-axis in (7 c) of fig. 7 refers to the sampling rate, which is referred to as a threshold value for convenience of description.

We observe 3 points from figure 7. First, the segmentation performance of deep seg is always better than the other three threshold-based methods, especially for blending actions, regardless of the change in threshold. As shown in FIG. 7, the segmentation accuracy of deep seg for the mixing action was 10%,9% and 6% higher than that of WiAG, tw-seg and Wi-Multi, respectively. Whereas for granularity-like actions, such as coarse-granularity actions, the threshold-based method performance can often be close to deep seg, because there is only a small difference in CSI amplitudes for these like actions, the beginning and end of these actions can be accurately detected by a given threshold. Nevertheless, these three methods always exhibit lower performance in detecting different types of actions than deep seg. These analyses show that deep seg can effectively improve segmentation performance compared to these threshold-based methods, especially in the case of mixed actions.

Second, for deep seg, different types of actions can achieve very similar performance. For example, the difference between the accuracy of the mixing action and the fine grain action is less than 1%. This shows that the proposed deep seg is suitable for different types of actions and a relatively stable performance can be obtained. Third, for three threshold-based methods, actions of different granularity require different thresholds to achieve better segmentation performance. For example, with the WiAG of (7 a) in fig. 7, when the threshold is 0.25, the segmentation accuracy of the fine-granularity operation reaches 83.89% of the highest value, but for the coarse-granularity operation, only an accuracy of 81.83 is obtained, which is approximately 9% lower than the optimum value at the threshold of 0.6. Similar trends are seen for Tw-See in FIG. 7 (7 b) and Wi-Multi in FIG. 7 (7 c). Experimental results indicate that these threshold-based methods can achieve good performance for similar granularity actions. But their performance may be significantly degraded when used in a mixing action. While the deep seg proposed by us can obtain relatively stable better performance for different types of actions.

4. Effect of training data size

This section studies the impact of training data size on separation performance and recognition performance. First 20% of the experimental samples were used as test data, then different percentages of training data were chosen as training sets, and mixed motion data was used for these experiments. Deep seg is compared to three threshold-based methods, wiAG, tw-seal, and Wi-Multi, for which the training set is used to train a state inference model that will perform segmentation actions on the test set. For these three threshold-based methods, the training set is used to find the best threshold for motion segmentation.

The segmentation accuracy and the corresponding classification accuracy are shown in fig. 8 and 9, respectively, and as the tag data proportion (proportion of labeled data) increases in fig. 8, the DeepSeg obtains higher and higher segmentation accuracy, which indicates that the size of the tag data has a great influence on the segmentation accuracy of the DeepSeg. Furthermore, deep seg is significantly better than the other three threshold-based methods at other times than with only a small amount of training data, such as 10% data, because deep seg uses a deep learning model to segment motion, requiring a certain number of samples for model training. The accuracy of DeepSeg is significantly higher than the other three threshold-based methods when the tag data ratio reaches 20%, and the behavior classification in fig. 9 also has a similar trend. These results indicate that deep seg requires a certain amount of training data for model training to achieve better performance. However, the size of the tag data, such as 20% training data, i.e. 25 samples per action category, is chosen, so that it is easy to manually add an action tag.

The deep learning-based action segmentation framework deep seg for WiFi signal behavior recognition constructed by the invention adopts the action segmentation algorithm based on the state inference model to replace the action segmentation algorithm for adjusting the threshold value according to experience, avoids the noise removal and threshold value calculation which need to depend on experience, and solves the problem of action segmentation performance reduction under the mixed action. Unlike existing methods that use motion segmentation and behavior classification as two independent phases, we have devised a feedback mechanism between the two to improve the motion segmentation algorithm in view of feedback information extracted from the behavior classification results. A number of experiments have demonstrated that our proposed framework is superior to existing methods for scenes with fine-grained and/or coarse-grained actions.

The foregoing is merely illustrative of the preferred embodiments of this invention, and it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of this invention, and it is intended to cover such modifications and changes as fall within the true scope of the invention.

Claims

1. The deep learning-based action segmentation framework construction method for WiFi signal behavior recognition is characterized by comprising the following steps of:

step 1: training a state inference model through the CSI data;

the calculating the confidence coefficient includes:

wherein D is _i Representing confidence of the ith sample, M _i Is the maximum probability E _i Is probability entropy, alpha is a parameter for adjusting the maximum probability and the weight occupied by the probability entropy, y and y ^* Respectively a predictive label and an actual label;

step 6: the feedback information from the classification model is merged into the construction of the state inference model, so that the construction of the action segmentation framework based on deep learning is completed;

the step 6 comprises the following steps:

where |X| represents the number of pieces of input CSI data, i.e., the number of samples input, X is the number of CSI inputAccording to D _i Representing the confidence level of the ith sample, p (r _i |x _i The method comprises the steps of carrying out a first treatment on the surface of the Φ) represents the ith sample x _i Is the predictive probability function of r _i Representing sample x _i Corresponding state, Φ= { W _f ,b _f ,W _c ,b _c }，W _f And b _f Is a convolution parameter, W _c And b _c Is a parameter of the full connectivity layer.

2. The method for constructing a deep learning-based action segmentation framework for WiFi signal behavior recognition according to claim 1, wherein the step 1 includes:

3. The deep learning-based action segmentation framework construction method for WiFi signal behavior recognition according to claim 2, wherein the step 1.1 includes:

4. The deep learning-based action segmentation framework construction method for WiFi signal behavior recognition according to claim 2, wherein the step 4 includes: