CN114118167A

CN114118167A - Action sequence segmentation method based on self-supervision less-sample learning and aiming at behavior recognition

Info

Publication number: CN114118167A
Application number: CN202111471435.0A
Authority: CN
Inventors: 肖春静; 陈世名; 韩艳会; 康红霞; 王一凡
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2021-12-04
Filing date: 2021-12-04
Publication date: 2022-03-01
Anticipated expiration: 2041-12-04
Also published as: CN114118167B

Abstract

The invention belongs to the technical field of sensor action recognition, and discloses an action sequence segmentation method based on self-supervision less sample learning for action recognition, which comprises the following steps: constructing an automatic supervision small sample action sequence segmentation framework SFTSeg; the framework is based on a twin neural network, and takes marked samples of a large number of source sensors, marked samples of a small number of target sensors and unmarked samples of target sensors as input data; respectively constructing a cross entropy loss function, a consistency regularization loss function and an automatic supervision loss function to carry out twin neural network training; the trained SFTSeg is then used for state label prediction and activity segmentation. The method has good activity segmentation effect under different sensors in different scenes, and can achieve the good activity segmentation effect only by marking samples with few target sensors.

Description

Action sequence segmentation method based on self-supervision less-sample learning and aiming at behavior recognition

Technical Field

The invention belongs to the technical field of sensor action recognition, and particularly relates to an action sequence segmentation method based on self-supervision few-sample learning and aiming at action recognition.

Background

Human activity recognition is considered a key aspect of many emerging internet of things applications, such as smart home and healthcare, where the effect of activity segmentation is crucial. Prior to activity classification, the continuously received sensor data is typically subdivided into subsequences (each corresponding to a single activity). And the segmentation results will be input into the classification model for behavior recognition. Accordingly, these results have a significant impact on the performance of the activity classification. Therefore, a great deal of research has been conducted on activity segmentation, including unsupervised methods and supervised models.

For unsupervised approaches in the activity segmentation task: both CPD (change point detection) and threshold-based approaches require a threshold to distinguish active boundaries. However, the optimal threshold value requires a user to have a great deal of experience and is determined according to the actual scenario. Furthermore, time shape based methods (such as FLOSS) require specific information for a given problem to determine time constraint parameters, which are relatively environmentally dependent.

For the supervision method in the activity segmentation task: although the problems of subjectivity and environmental dependence can be alleviated, a large number of target sensor marking samples are needed to train the model, and the practical situation that the model is time-consuming, labor-consuming and limited by the human environment cannot be achieved necessarily.

Disclosure of Invention

The invention provides an action sequence segmentation method based on self-supervision less sample learning aiming at behavior recognition, aiming at the problems that the existing method for segmenting activities is relatively dependent on environment and needs a large number of labeled samples to train a model, wherein the action sequence segmentation method has good activity segmentation effect under different sensors in different scenes, and can achieve good activity segmentation effect only by using few labeled samples of a target sensor.

In order to achieve the purpose, the invention adopts the following technical scheme:

an action sequence segmentation method aiming at behavior recognition and based on self-supervision few-sample learning comprises the following steps:

step 1: constructing an automatic supervision small sample action sequence segmentation framework SFTSeg; the framework is based on a twin neural network, and takes marked samples of a large number of source sensors, marked samples of a small number of target sensors and unmarked samples of target sensors as input data; the marking sample of the source sensor and the marking sample of the target sensor correspond to four state labels which are respectively in a static state, a starting state, a motion state and an ending state; the samples refer to a sequence of actions derived from sensor data;

step 2: constructing a cross entropy loss function for the labeled sample of the source sensor to carry out twin neural network training;

and step 3: for the labeled sample of the target sensor, taking the labeled sample of the source sensor as disturbance, injecting the disturbance into the labeled sample of the target sensor as enhancement data, and constructing a consistency regularization loss function to perform twin neural network training;

and 4, step 4: constructing a positive sample pair and a negative sample pair based on the unlabeled sample of the target sensor, and constructing an auto-supervision loss function based on the positive sample pair and the negative sample pair to train the twin neural network so that the twin neural network can capture the characteristics of the unlabeled sample of the target sensor;

and 5: and (3) obtaining the trained SFTSeg through the steps 1-4, inputting the sample of the target sensor serving as the test sample into the trained SFTSeg, predicting the state label of the test sample by the trained SFTSeg, and then performing activity segmentation on the test sample according to the predicted state label.

Further, the step 3 comprises:

the enhancement data is constructed according to the following rules:

A. the labeled samples of the compressed source sensor as perturbations have the same class as the labeled samples of the target sensor;

B. adding the compressed labeled sample of the source sensor to the labeled sample of the target sensor according to the warping path; the warp path is generated by a dynamic time warping algorithm.

Further, the step 4 comprises:

dispersing the action sequence into an overlapped window with a fixed size of w by adopting a sliding window, wherein the sliding step length is l;

two windows are considered a positive sample pair if they meet the following constraints: the two windows are adjacent; the two windows contain the same number of change points, and the difference of the two windows does not contain any change points;

two windows are considered as a negative example pair if they meet the following constraints: the two windows are spaced apart by more than a given minimum distance in time; the two windows contain different numbers of change points; the change point is a time point when the action sequence behavior changes suddenly.

Further, the step 4 further includes:

for the positive sample pairs, firstly calculating the SEP score of a difference set of the positive sample pairs, and then filtering the positive sample pairs according to the SEP score;

for the negative sample pair, dividing each sample of the negative sample pair into h disjoint parts, and then calculating SEP scores of all two continuous parts to obtain the highest SEP score of each sample of the negative sample pair; then, calculating the dissimilarity score of the negative sample pair according to the highest SEP score of each sample of the negative sample pair; and eliminating negative sample pairs with lower dissimilarity scores.

Compared with the prior art, the invention has the following beneficial effects:

the invention proposes a self-supervised, sample-less motion sequence segmentation framework SFTSeg to segment the activity on motion sequence data and to implement sample-less learning and classification using a twin neural network. The conventional activity segmentation method is usually based on the same sensor, the method can enhance the identification accuracy of target sensor data by using source sensor data, and can realize good activity segmentation and identification effects by using few target sensor marking samples. The method realizes a few-sample activity segmentation technology, and adopts a twin neural network as a main realization method of few-sample learning. Aiming at three different data, the invention respectively designs different loss functions to enhance the training effect: aiming at the marked samples of the source sensor, constructing a cross entropy loss function to force the input samples to corresponding categories; in order to enhance the generalization capability of the target sensor data, a consistency regularization method is introduced, a labeled sample of a source sensor is used as disturbance, the disturbance is used as enhancement data and is injected into the labeled sample of the target sensor, and the limited labeled sample of the target sensor is utilized for model training; to mitigate the large amount of drift between the source domain and the target domain, an auto-supervised learning is introduced, and a twin neural network is trained by constructing a positive sample pair and a negative sample pair based on unlabeled samples of the target sensor, so that the twin neural network can capture the characteristics of the target data.

The invention solves the problems of environmental dependence and subjectivity of designers of unsupervised methods (such as detection based on change points and threshold) in the activity segmentation task, and has good activity segmentation effect under different sensors in different scenes. The invention also solves the problem that the supervision method in the activity segmentation task needs a large amount of target sensor marking data (high cost and is limited by various conditions), and realizes that good activity segmentation effect can be achieved only by few target sensor marking samples.

Drawings

Fig. 1 is an exemplary diagram of four motion states extracted from one motion sequence;

FIG. 2 is a flowchart illustrating an action sequence segmentation method based on self-supervised low-sample learning for behavior recognition according to an embodiment of the present invention;

FIG. 3 is a diagram of an example of the difference between the surrounding and shortest paths;

FIG. 4 is a diagram of a pair of exemplary positive (negative) samples taken from an action sequence;

FIG. 5 is an exemplary diagram of activity start point detection;

FIG. 6 is a graph of the segmentation performance (F1-score) lines for different size label target data.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

activity segmentation aims at determining the start and end times of an activity, which is the first step in human activity recognition. Because of the difficulty in collecting large amounts of label data from target sensors, unsupervised methods are widely used for activity segmentation, such as CPD-based methods and threshold-based methods. However, these methods all suffer from experience and environmental dependence. Therefore, we transform the activity segmentation task into a classification problem by the following steps: (1) firstly, discretizing continuous time sequence data into windows with equal size; (2) each window is then divided into four state categories: a stationary state, a starting state, a moving state, and an ending state; (3) finally, the start and end points of the activity are identified based on the status tags. Here, four status tags are defined as: a static state: the window is filled with time series data with no activity; a starting state: the window contains the start of the activity; and (3) motion state: the window is filled with time sequence data of human body activity; and (4) ending state: the window contains the end of the activity. An example of these four states extracted from an activity is shown in fig. 1, which illustrates the first order difference in WiFi Channel State Information (CSI) amplitude for one subcarrier as a function of time. The vertical dashed lines here are the actual start and end points of the activity.

Thus, the segmentation result depends largely on the state inference effect. Thus, the activity segmentation problem becomes a way of how to design a suitable state inference model to predict the state labels of the discrete data from the target sensors. Since labeled samples are limited, we introduce sample-less learning as a state inference model. Given that it is feasible to collect multiple labeled samples of a source sensor, our goal becomes how to construct robust small sample learning models for state inference of sensor data using three types of input data: a large amount of labeled sample from the source sensor, a small amount of labeled sample from the target sensor, and a portion of unlabeled sample (i.e., a large amount of labeled sample from the source sensor, a small amount of labeled sample from the target sensor, an unlabeled sample from the target sensor). In practical applications, since the target data may be collected in different scenarios (e.g., different people, environments, and sensor devices), their styles and characteristics may be quite different. Therefore, the low-sample learning model should be able to solve the problem of large differences in the source domain and target domain data distributions.

Specifically, the action sequence segmentation method aiming at behavior recognition and based on self-supervision few-sample learning comprises the following steps:

step 1: constructing an automatic supervision small sample action sequence segmentation framework SFTSeg; the framework is based on a twin neural Network (Simense Network) and takes a marked sample of a source sensor, a marked sample of a target sensor and an unmarked sample of the target sensor as input data; the marking sample of the source sensor and the marking sample of the target sensor correspond to four state labels which are respectively in a static state, a starting state, a motion state and an ending state; the samples refer to a sequence of actions derived from sensor data; specifically, the framework is a twin neural network-based state inference model;

and 4, step 4: constructing a positive sample pair and a negative sample pair based on the unlabeled sample of the target sensor, and constructing an automatic supervision loss function based on the positive sample pair and the negative sample pair to train the twin neural network, so that the twin neural network can capture the characteristics of the unlabeled sample of the target sensor, and construct the automatic supervision loss function to carry out training constraint;

On the basis of the above embodiment, the present invention further provides another action sequence segmentation method based on self-supervision few-sample learning for behavior recognition, which specifically includes:

A. overview of State inference model

Specifically, for the problems of subjective and environmental dependence and insufficient target sensor labeling data, we introduce a sample-less learning model to predict the state labels of discrete data and further use the labels for segmentation activities. However, unlike general few sample learning, in an activity segmentation scenario, there is a large offset between the source domain and the target domain, and the same class label exists. To this end, we propose an auto-supervised, sample-less motion sequence segmentation framework, SFTSeg, which is specifically a twin neural network-based state inference model. As shown in fig. 2.

Specifically, the framework is based on a twin neural network with three data as inputs: the marked samples of a large number of source sensors, the marked samples of a small number of target sensors and the unmarked samples of the target sensors respectively correspond to classification loss, consistency regularization loss and self-supervision loss. First, since the labeled sample of the source sensor and the labeled sample of the target sensor have the same four categories, we are based on the labeled sample of the source sensor

A classification (cross entropy) loss L is constructed_cl. Second, since models trained based on source sensor data alone may not accurately capture features of target sensor data, we are based on limited labeled samples of target sensors

Exploit the consistency regularization loss L_crTo enhance the smoothness of the model. Here, the labeled sample of the source sensor

Labeled sample narrowed for injection as perturbation to target sensor

To generate enhanced data

Then the sample pair

Is used to construct the consistency regularization loss L_cr. Third, to enhance the generalization ability of target sensor data, we designed a sample-based pair

Is self-monitoring of loss

This sample pair is extracted from unlabeled target data using our designed secondary task for the time series. We further propose an adaptive weighting method to enhance this loss.

Specifically, we achieve sample-less learning through a twin neural network. A typical twin neural network employs a convolutional neural network that is trained using a large amount of labeled data from source sensors to extract feature vectors and perform sample-less classification by measuring the distance between the new sample and the samples of each class of target sensors. Specifically, a twin neural network is composed of two networks, and its parameters are shared. Each branch uses the same network architecture, such as Convolutional Neural Network (CNN). Generally, the twin network is trained by minimizing contrast loss based on sample pairs. Given input pair sample pair (x)₁,x₂) And its feature vector pair (f (x) obtained by twin network₁),f(x₂) A distance between feature vectors in the potential space is calculated as

d_e＝||f(x₁)-f(x₂)|| (1)

Contrast loss function L_ctIs defined as:

where y is the binary label assigned to the pair, i.e. if x₁And x₂Belonging to the same class, y is 0, otherwise y is 1; m is the margin.

B. Enhancing sample-less learning with consistency regularization

Here we try to enhance the twin neural network model from two aspects using the marker data. First, for labeled source data, we use cross-entropy (classification) loss for model training since they have the same four classes as the target data. Secondly, for labeled target data, because of their very small number, we propose a line-level data enhancement method and design a consistency regularization loss, which forces the enhanced data and the original data to have the same label distribution to strengthen model smoothness.

The classification is lost. Unlike typical few-sample learning tasks, for activity segmentation, the source sensor data and the target sensor data have the same four categories: a rest state, a start state, a motion state, and an end state. Therefore, to take advantage of labeled source sensor data, we train the neural network with cross-entropy loss rather than the general loss of few-sample learning, which can enhance the classification capability of the network. Let D_ls＝(xⁱ _ls,yⁱ)^N _i＝1Is a labeled sample set of a source sensor, where yⁱIs xⁱ _lsThe status tag of (1). The classifier f is a function that maps the input feature space to the label space. Labeled sample D taking into account all active sensors_lsThe cross-entropy (classification) penalty is as follows:

where θ is the model parameter, y^ijRepresents a sample xⁱJ of the tag of one-hot form ofElement, f_jIs the jth element of f.

And (5) consistency regularization. Due to the large offset between the source and target domains, a model trained based only on source sensor data may not fully capture features of the target sensor data, and accordingly may not effectively predict the state label of the target sensor data. Therefore, we introduce consistency regularization here to enhance the generalization capability of the model with limited labeled target sensor data. In other words, we devised a line-level data enhancement method for action sequence data that will generate enhancement data to build a consistency regularization loss.

The consistency regularization is intended to ensure that the classifier assigns the same class labels to the unlabeled samples that inject the perturbations. Although widely used perturbation methods such as random noise, gaussian noise, and attenuated noise can be effectively applied to image and natural language data processing, they are not suitable for time-series data due to their intrinsic properties. For example, the image perturbation method is mainly to generate pixel level variation, and the time series data needs to be linearly varied because the time series data is a waveform that varies with time. Furthermore, the augmented data should have a similar style as the target sensor data, which facilitates the inference model learning characteristics of the target sensor data.

To do this, we reduce the labeled sample of the source sensor as a perturbation and inject the perturbation into the labeled sample of the target sensor to generate the enhanced data. The original sample (labeled sample of the target sensor) and the enhanced data are then input into the twin neural network, and the distance between the corresponding features of the original sample and the enhanced sample is minimized to train the twin neural network.

Specifically, to generate enhanced data having the style of target sensor data, we build the enhanced data according to two rules: (i) the labeled samples of the source sensor that are compressed as perturbations should be of the same class as the labeled samples of the target sensor; (ii) injecting the compressed labeled sample of the source sensor into the labeled sample of the target sensor according to the warp path. Here, the warp path generated using a Dynamic Time Warping (DTW) algorithm maps elements of two data sequences to minimize the distance between them. An example of warped paths and shortest paths for two motion sequence samples is shown in fig. 3. Here, the solid black and gray lines represent waveforms of two time-series samples, and the dashed gray lines represent the warp path in fig. 3(a) and the shortest path in fig. 3 (b). When the solid black line is used as the disturbance, if the disturbance is added to the solid gray line by the shortest path, the waveform will be distorted sharply, as shown in fig. 3 (c). The result obtained with the warp path in fig. 3(d) is that the basic shape is maintained and that there is some change in the waveform.

Thus, the data is enhanced

The calculation is as follows:

where Aggregate (x, x')) sums the two sequences according to the warp path. And h (x) is a function of the amplitude of the systolic sensor data, e.g., h (x) γ x. Where γ ∈ (0,1) may be a hyper-parameter that adjusts the degree of shrinkage.

Therefore, to penalize the original sample

And enhanced data

The consistency regularization loss is calculated by the following formula:

wherein f (x) refers to a feature vector through a twin neural network, D_tIs a marker data set from a target sensor.

C. Facilitating few sample learning through self-supervision

To further enable the inference model to learn characteristics of the target data, we herein incorporate an auto-supervised technique into the sample-less learning that utilizes a large amount of unlabeled target data for model training. To this end, we propose an auxiliary task that fits the time series data to build the auto-supervised loss, and adjust the importance of each training sample pair by designing the adaptive weights to further enhance this loss.

Loss of self-supervision. In order to train the twin network by using the self-supervision method, an auxiliary task based on unlabeled data needs to be designed for the twin network. Although there are some effective auxiliary tasks in the fields of computer vision and natural language processing (e.g., image rotation, warping, and cropping), they are not applicable to continuous time-series data. For example, a widely used image rotation task in computer vision aims to assign the same label to the rotated image as the original image. However, for active segmentation, when the sequence with the start state label is rotated by 180 °, it is easily confused with the sequence of the end state. As shown in fig. 1, the data with the start state tag is rotated by 180 deg. to have a shape very similar to the data with the end state.

To this end, we propose an auxiliary task adapted to the time series data, which constructs many pairs of positive and negative samples based on unlabeled target data to train the twin network. Here, a positive exemplar pair means that both exemplars have the same state label, while a negative exemplar pair is opposite. We consider two consecutive windows of similar shape as a pair of positive samples and two separate windows of different shape as a pair of negative samples. In particular, we discretize the sequence of actions into overlapping windows of size w using a sliding window, where the sliding step is l. Two windows are considered a positive sample pair if they meet the following constraints: (i) the two windows are adjacent; (ii) they contain the same number of change points and the difference of the two windows does not contain any change points. Accordingly, two windows are considered a negative example pair if they meet the following constraints: (i) the two windows are sufficiently separated from each other that they should be spaced apart by more than a given minimum distance in time (e.g. 2 x w); (ii) they contain a different number of change points, i.e. one window contains one change point and another window has no change point. Here, the change point is a time point at which the action sequence behavior abruptly changes. For a time series containing motion data, the activity transition may be considered a change point. Thus, if two consecutive windows contain the same number of change points, they should have the same state label and be considered as a positive sample pair. And two windows with different change points should have different state labels and be regarded as a negative sample pair. We used a density ratio-based method SEP (S.Aminikhanghahi, T.Wang, and D.J.Cook, "Real-time change point detection with application to smart home time series Data," IEEE Transactions on Knowledge and Data Engineering, vol.31, No.5, pp.1010-1023,2019) to detect the change points. The method determines the change point by comparing the probability metric and the alteration score with corresponding thresholds, achieving better performance.

Fig. 4 gives an example of positive and negative sample pairs, where the vertical dashed lines are the true start and end points of the campaign. As shown in fig. 4, sample pairs

Are positive because they are two adjacent windows and they have only one point of change, and two difference sets

And

without any change points. In this example, the vertical dashed lines are change points, as they are active transition points. Accordingly, the sample pair

Are pairs of negative samples because they are far apart and have different numbers of change points.

According to these rules, the label can be never markedA large number of positive and negative pairs are obtained in the target data. To improve the sample quality, we further eliminated samples with low confidence. Specifically, a larger SEP score means that there is a greater probability of a change point being present. For positive sample pairs, due to the difference set of sample pairs

And

there should be no change points, so we discard pairs of samples with higher SEP scores in the disparity set. To do this, we first compute the SEP scores for the difference set of a sample pair and then filter the sample pairs according to their scores. For a difference set of a sample pair, the difference set is divided equally into two parts: x is the number of_t-1And x_tEach length is s, then we calculate their density ratio as follows:

wherein f is_t-1(x) And f_t(x) The probability estimation densities corresponding to the two parts, respectively. Next, the structure of the SEP change point score is as follows:

in this way, a set of differences can be computed

And

SEP value of

And

then, to ensure the quality of these training samples and avoid overfitting, we eliminated 10% of the positive sample pairs:

wherein f is_dropThe situation of the rejection is reflected,

is the average of the two scores, and epsilon is a threshold determined by the ranking of the SEP values and the culling rate for all positive sample pairs.

For negative sample pairs, we expect one sample to have one change point, while the other sample has no change point. To meet this requirement, we filter out pairs of samples with smaller degrees of difference at the point of change. For this reason, we design a dissimilarity score based on the SEP score to eliminate negative sample pairs with lower confidence. Specifically, each sample of the negative sample pair is divided into h disjoint parts, and then the SEP scores for all two consecutive parts are calculated using equation (7). Let

Representing SEP scores for the jth and (j +1) th parts. The highest SEP score was:

thus, each sample from a negative sample pair can calculate a maximum SEP score,

suppose that

And

the maximum SEP score for samples with 1 and 0 change points, respectively. The dissimilarity score for this sample pair can be calculated as:

negative pairs with lower dissimilarity scores are removed except for

Replacement of

In addition, the formula (8) is still adopted as the filtering method.

After a 10% reduction of the negative and positive samples with low confidence, the remaining pairs of samples train the twin neural network with the following auto-supervised loss:

wherein d is_eIs a pair of input samples

And

feature vector pair of

And

is a distance therebetween, i.e.

y is the label assigned to this pair of samples, i.e. if

And

if the state labels are the same, y is 0, otherwise y is 1; m is an over-parameter with respect to the margin.

And (4) self-adaptive weighting. After the sample pairs are screened, the remaining positive and negative sample pairs have high confidence. However, different sample pairs may provide different clues to the learning data representation. In general, a sample without any activity data corresponds to a clue that contains fewer representations of learning data. Accordingly, the sample pairs containing activity data should provide more clues and play a more important role in model training. Fig. 4 shows an example of two sample pairs. In this figure, due to the positive sample pairs

Containing activity data, and negative sample pairs

There is no active data, so positive sample pairs are worth more attention when the model is trained.

Since the amplitude range of the sensor data is much larger when activity occurs than when there is no activity, the sample pairs with larger variation range may contain activity data and should be emphasized in the model training. We use the amplitude variance of the sample pairs to estimate the fluctuation amplitude. Thus, the amplitude of the fluctuation of the sample pair can be described by the following formula:

wherein

And

are respectively a sample pair

And

the variance of the amplitude of (c). Then V is put_pairAs a weight, the importance of the model is adjusted during training. Considering this weight, the auto-supervised loss in equation (11) becomes:

finally, after combining the classification penalty in equation (3), the consistency penalty in equation (5), and the weighted unsupervised penalty in equation (13), the final penalty function is illustrated as follows: :

in this loss, the consistency loss is based on the labeled data of the target sensor, while the self-supervised contrast loss is based on the unlabeled data of the target sensor. Therefore, models trained with these losses can effectively capture the features of the target sensor data and apply to the target sensor. The model adopts an Adam algorithm with default hyper-parameters as an optimization method.

D Activity segmentation

After obtaining the trained state inference model, we predict the state label for a given motion sequence from the target sensor, i.e. we compare the distance of the target sample vector to the sample vector of each class, and the target sample is labeled as that class if the distance of the target sample to the samples of that class is the smallest. The activity is then segmented according to the inferred state labels. Specifically, the start point and the end point of the activity are detected based on the following manner. First, a continuous motion sequence (sensor data stream) is divided into overlapping windows using a sliding window, each window having a length w, with a sliding step size of 1. Second, the state label for each window is inferred using a state inference model. Finally, the start and end points of the activity are identified by observing changes in the mode of a set of window labels, according to the state labels of the windows. Here the mode is the most frequently occurring number in the set. In other words, if the mode is changed from 1 (stationary state) to 2 (start state), the corresponding window is regarded as the start of activity. If the mode changes from 4 (end state) to 1 (quiescent state), then this window is considered the end of the activity.

For a more intuitive representation, fig. 5 gives an example of how to detect an activity start point by observing mode changes, where the length m of the window label list used to calculate the modes is set to 10. In the figure, there are 18 data points, points from 1 to 13 being static data and points from 14 to 18 being active data. The data is first divided into 11 overlapping windows, each window having a length w of 8. Second, for each window, its state label is inferred using a trained state inference model. The state labels from w1 to w6 are static states, 1. The state label from w7 to w11 is the starting state, 2. Finally, each window is traversed to detect the start and end points of the activity. When checking w10, the mode of the state tag list from w1 to w10 is 1. When checking w11, the index of the current data point i is 18, the mode from w2 to w11 becomes 2, which means that the mode changes from 1 (quiescent state) to 2 (start state), indicating that there is an active start point here. When the frequency of occurrence of the values is the same, the mode will be set to the largest of the values. Here, the start point t_startSet to i-m/2+ 1. Thus, in this example, when i is 18 and m is 10, t is_startEqual to the actual starting point 14. After human activities are segmented, these data can be used for activity classification.

To verify the effect of the present invention, the following experiment was performed:

we used four data sets and evaluated the effectiveness of SFTSeg based on different sensor devices, users and environments. In addition, the contribution of individual components and the impact of training data size on performance were also investigated.

A. Experimental data and settings

Experimental data. We performed experiments on four behavioral recognition datasets from different types of sensors, such as WiFi devices, smartphones and RFID tags.

HandGesture: the data set includes twelve hand motion activities performed by two experimental subjects and captured by an inertial measurement unit. The activities comprise opening and closing a window, drinking water, watering flowers, shearing, splitting, stirring, reading books, tennis forehand, tennis backhand, catching a ball and the like. And the activity is continuous.

USC-HAD: the data set consisted of twelve human activities, each activity recorded in 14 categories using a 3-axis accelerometer and a 3-axis gyroscope, respectively. Each category of activity repeats five times, including walk forward, walk left, walk right, go upstairs, go downstairs, run forward, jump up, sit down, stand, sleep, elevator up, and elevator down. Since these data are discontinuous, the active set is manually randomly stitched for segmentation.

RFID: the experimental data set contained data from six people, each posing between a wall and an RFID antenna, with nine passive RFID tags placed on the wall. The RFID data is a non-continuous data set because it is concatenated with twelve poses for each of the six subjects, and the data is still manually stitched for the experiment to be conducted.

WiFiAction: the data set consisted of ten activities by five people using WiFi equipment and collected by the Channel State Information (CSI) collection Tool d.halferin, w.hu, a.sheth, and d.wetall, "Tool release: heating 802.11n channels with channel state information," sigcom company.com.rev., vol.41, No.1, pp.53-53,2011 ]. These activities are continuous, including 1500 samples-about 5 fine-grained activities (hand swing, hand lift, push, draw O, and draw X) and 5 coarse-grained activities (boxing, pick, run, squat, walk).

Since source and target data may be collected under different sensor devices, personnel and environments, in the following experiments, WiFiAction data was considered source data when evaluating the performance of other data sets. In evaluating the WiFiAction performance, the HandGesture data is treated as source data. Furthermore, by default we chose three labeled samples from each category in the target data for model training, i.e. the following experiments were performed in the context of three-sample learning. For data from the target sensors, 80% of unlabeled data was selected for model training, and the remaining 20% was used as the test set.

And evaluating the index. We proposed SFTSeg and baseline models were evaluated using two indices: (i) f1-score: f1-score is the average of Precision and Recall. A predicted segmentation point is considered to be a true positive when it lies within a specified time window of a real boundary and a false negative when it falls outside the time windows of all real boundaries. According to the sampling rate of the sensor, the specified time windows of the WiFiAction and HandGesture data sets are set to be 0.3 and 0.5 seconds respectively, and other data sets are set to be 2 seconds. (ii) RMSE: the Root Mean Square Error (RMSE) is calculated from the deviation of the true boundary time from the predicted boundary time. Here RMSE is normalized to between 0,1 in terms of the duration of the time series.

Details of the experiment. The learning rate of the model was set to 0.001 and the mini-batch size of the data was 60. The shrinkage γ of h (x) calculated in equation (4), the margin m in equation (11), and the window size w are set to 0.05, 1, and 120, respectively, according to experimental experience. The CNN architecture in twin networks is the same as our previous work [ C.Xiao, Y.Lei, Y.Ma, F.Zhou, and Z.Qin, "deep-based activity segmentation framework for activity recognition wifi," IEEE Internet of Things Journal, vol.8, No.7, pp.5669-5681,2021 ].

B. Baseline method

In order to demonstrate the effectiveness and superiority of SFTSeg, eight different techniques of segmentation methods were chosen as baseline methods, including threshold-based WiAG, Wi-Multi, CPD-based AR1seg, SEPseg, IGTS, time-shape-based FLOSS, esponso, and supervised method DeepSeg.

WiAG: a typical threshold-based gesture extraction segmentation method. The method identifies the beginning and end of a gesture by comparing the amplitude of the principal component in the data stream to a given threshold.

Wi-Multi: a novel activity segmentation algorithm under a multi-subject complex environment. The algorithm can eliminate potential false detections by calculating the maximum eigenvalues of the amplitude and calibration phase correlation matrices and can improve accuracy in noisy environments or/and scenarios with multiple objects.

AR1 seg: a change point detection method typical of the statistical field. This method uses a first-order autoregressive process to infer the point of change [ S.Chakar, E.Lebarbier, C.Lvy-lead, and S.Robin, "A robust approach for simulating change-points in the mean of an AR (1) process," Bernoulli, vol.23, No.2, pp.1408-1447,052017 ].

SEPseg: an inventive method for detecting time series data change points. This algorithm has been used to efficiently identify activity boundaries and identify human daily activities [ S.Aminikhanghahi, T.Wang, and D.J.Cook, "Real-time change point detection with application to smart home time series Data," IEEE Transactions on Knowledge and Data Engineering, vol.31, No.5, pp.1010-1023,2019 ].

IGTS: a segmentation method based on information gain. This approach estimates motion boundaries by using a dynamic programming approach to maximize the information gain of the components.

FLOSS: a shape-based segmentation method. The method segments the activity data based on the fact that: similarly shaped patterns should be associated with the same category and occur in close temporal proximity to each other.

ESPRESSO: an entropy and shape aware time series partitioning method. The method utilizes the entropy and the time shape characteristic of the time series to perform activity segmentation on the multidimensional time series.

DeepSeg: an activity segmentation method based on supervised learning. The framework employs CNN as a state inference model to predict state labels for discrete data, and then identifies active boundaries from the state labels.

C. Performance of activity segmentation

Table 1 shows the partitioning performance of the different methods on the four datasets, with the best effect highlighted in bold. By analyzing the process performance, we have the following observations.

First, our proposed SFTSeg consistently yields better performance over the four datasets than the baseline segmentation approach. Specifically, SFTSeg has improved 2.45%, 5.82%, 8.23% and 1.92% respectively in the F1-score better baseline method deep of the HandGesture, USC-HAD, RFID and WiFiAction data sets. The results show that SFTSeg can capture the features of the target data through consistency regularization and self-supervision loss we propose, and further perform accurate activity segmentation on the target data based on several marker samples.

Secondly, the supervised method DeepSeg does not show significant advantages compared to the unsupervised method, especially for RFID data. The main reason is that the DeepSeg is designed for a scenario where the source data and the target data have the same distribution. However, in the case of only a few labeled target samples, the competitiveness of the DeepSeg is reduced. This also explains why most work uses unsupervised means to segment activities in the absence of tagged data. SFTSeg can solve the problem of limited tagged target data and achieve better performance than these supervised and unsupervised approaches.

Third, for an unsupervised baseline approach, different datasets should employ different unsupervised approaches to obtain better segmentation results. For example, for RFID data IGTS is superior to other unsupervised methods, but for HandGesture data the segmentation results are significantly worse than espress. These results provide the basis for our description: unsupervised segmentation methods are often affected by environmental related problems. In contrast, our proposed SFTSeg can consistently achieve better performance across all datasets.

Table 1: segmentation performance comparison

D. Ablation experiment

Here we focus on studying the contribution of the basic components we design to SFTSeg, i.e. the consistency regularization loss, the self-supervision loss and the adaptive weighting. We investigated the effects of different components: (i) SFTSeg-Base is the basic siemese network model, optimizing classification loss by labeling source data given in equation (3). (ii) SFTSeg-Consis is a twin model of consistency regularization loss we have designed, as shown in equation (5). (iii) SFTSeg-Self is a twin network model with an unsupervised loss, but without the adaptive weights in equation (11). (iv) SFTSeg-Weight is a twin network model with self-supervision loss and adaptive weights, as shown in equation (13). (v) SFTSeg-Full is a model we propose to contain all components. The results of the segmentation of the four datasets are shown in table 2, with the best results highlighted in bold. The observations in this table are as follows: first, SFTSeg-Full achieved the best performance. At the same time, SFTSeg-Base performs the worst, which indicates that the functionality of the major components we design can greatly improve the segmentation performance. Second, SFTSeg-Consis achieved with better results than SFTSeg-Base when combined with consistency regularization. This is because our designed approach enhances the limited labeled samples from the target domain, which is beneficial for the model to improve the generalization capability of the target domain. Third, SFTSeg-Self and SFTSeg-Weight are superior to SFTSeg-Base to a greater extent. This result verifies that our primary motivation for designing SFTSeg, i.e. the self-supervised loss based on unlabeled target data, can enable models to capture features of the target domain and further improve segmentation performance.

Table 2: performance when considering different components

E. Role of target data size

SFTSeg attempts to mitigate large offsets between the source domain and the target domain using the target data. Therefore, we discuss here the effect of target data size on the segmentation performance. In particular, when the amount of unlabeled target data changes, we investigated the results for 1-shot, 3-shot, and 5-shot (n in n-shot refers to the amount of labeled data per action category).

FIG. 6 illustrates the results of F1-score when selecting different ratios of unlabeled target data, where (a) is the result for the HandGesture data for F1-score, (b) is the result for the USC-HAD data for F1-score, (c) is the result for the RFID data for F1-score, and (d) is the result for the WiFiAction data for F1-score. The results of RMSE are not shown because it has the same trend. Fig. 6 shows that the performance of the segmentation of SFTSeg is gradually improved with increasing amounts of unmarked data of the four data sets. This indicates that the unmarked data size plays an important role in segmentation performance, and the self-supervision task we designed can effectively use unmarked target data to enhance model performance. In addition, the 5-shot performance is obviously better than that of the 1-shot. The reason is that more target marking samples are not only beneficial to the consistency regularization in the training stage, but also beneficial to the distance calculation between the test sample and the marking target sample in the testing stage. Overall, the above results indicate that SFTSeg can efficiently utilize marked and unmarked target data to improve segmentation performance.

In summary, the present invention proposes an auto-supervised, sample-less action sequence segmentation framework SFTSeg to segment the activity on the action sequence data. The traditional action segmentation method is usually aimed at the same sensor, the invention can realize that the segmentation accuracy of target sensor data is enhanced by using source sensor data, and can realize good activity segmentation and recognition effects by using few target sensor mark samples. The twin neural network is adopted as a main frame of the few-sample learning, and the few-sample activity segmentation technology is realized. Aiming at three different input data, different loss functions are respectively designed to enhance the training effect: aiming at a marked sample of a source sensor, constructing a cross entropy loss function to force an input sample to be classified into a corresponding category; in order to enhance the generalization capability of the target sensor data, a consistency regularization method is introduced, a marked sample of a source sensor is reduced to be used as disturbance, the disturbance is injected into the marked sample of the target sensor to be used as enhanced data, and the generalization capability of a model is improved by utilizing an enhanced data training model; in order to solve the problem that the data distribution of a source domain and a target domain has larger difference, self-supervision learning is introduced, and a positive sample pair and a negative sample pair are constructed based on unlabeled samples of a target sensor to train a twin neural network, so that the twin neural network can capture the characteristics of target data, and the inference performance is improved.

The invention solves the problems of environment dependence of non-supervision methods (such as detection based on change points and threshold value) and subjectivity of designers in the activity segmentation task, and has good activity segmentation effect under different sensors in different scenes. The invention also solves the problem that the supervision method in the activity segmentation task needs a large amount of marked target data (high cost and limited by various conditions), and realizes that good activity segmentation effect can be achieved only by few marked target sensor samples.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. An action sequence segmentation method aiming at behavior recognition and based on self-supervision few-sample learning is characterized by comprising the following steps:

step 1: constructing an automatic supervision small sample action sequence segmentation framework SFTSeg; the framework is based on a twin neural network, and takes a marked sample of a source sensor, a marked sample of a target sensor and an unmarked sample of the target sensor as input data; the marking sample of the source sensor and the marking sample of the target sensor correspond to four state labels which are respectively in a static state, a starting state, a motion state and an ending state; the samples refer to a sequence of actions derived from sensor data;

2. The method for motion sequence segmentation based on self-supervised sample-less learning for behavior recognition according to claim 1, wherein the step 3 comprises:

the enhancement data is constructed according to the following rules:

3. The method for motion sequence segmentation based on self-supervised sample-less learning for behavior recognition according to claim 1, wherein the step 4 comprises:

4. The method for motion sequence segmentation based on self-supervised sample-less learning for behavior recognition according to claim 3, wherein the step 4 further comprises: