CN113220960A

CN113220960A - Unbalanced time series data classification method based on autonomous learning

Info

Publication number: CN113220960A
Application number: CN202110515698.0A
Authority: CN
Inventors: 王晓峰; 胡姣姣; 郭小红; 习英卓; 周轩; 冯冰清
Original assignee: Xian University of Technology; China Xian Satellite Control Center
Current assignee: Xian University of Technology; China Xian Satellite Control Center
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-06

Abstract

The invention discloses an unbalanced time series data classification method based on autonomous learning, which specifically comprises the following steps: step 1, processing the unbalanced time sequence data to construct a new sample; step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1; step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2; and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification. The method solves the problem that the detection precision of a minority class is seriously reduced due to the fact that a general learner is absolutely biased to the majority class, and the classification precision of the unbalanced time sequence data set is remarkably improved.

Description

Unbalanced time series data classification method based on autonomous learning

Technical Field

The invention belongs to the technical field of time series data classification, and relates to an unbalanced time series data classification method based on autonomous learning.

Background

The time sequence refers to data arranged according to the time sequence, and the data can directly reflect the state or degree of a certain object or phenomenon changing along with the time; time series data mining is to extract useful information related to time attributes, which is unknown in advance, from a large amount of time series data, and to guide activities of people such as society, economy, and life. In the field of aerospace measurement and control, a large amount of telemetering data are presented in a time series mode, the engineering data can directly reflect the operation state of an aircraft, and the data are classified and information and rules contained in the data are mined out, so that the research on equipment fault diagnosis technology is very important. Therefore, the time series data classification problem becomes an important research topic in engineering and academia.

The unbalanced time series data refers to a data set with a small number of samples far smaller than that of a large number of samples, for example, in aerospace measurement and control engineering, most of measured time series data are within a normal range, and only few abnormal values exist, so that the unbalanced time series data set is a typical unbalanced time series data set. In the binary classification problem, the detection accuracy and performance of the classifier are seriously reduced due to the imbalance of the data distribution, so that the result of a general classifier is seriously biased to a normal class, and the false detection rate of an abnormal class is very high. In practical application, a few classes are the focus of attention, and if the 'fault' is misdiagnosed as 'normal' and the faulty system continues to work, unpredictable consequences and loss can be caused.

The time series data classification is an important branch of time series data mining, the problem is different from other data classifications, signal values of each time point of the time series data do not exist independently, and the whole time series data is used as one input in the processing.

Disclosure of Invention

The invention aims to provide an unbalanced time series data classification method based on autonomous learning, which solves the problem that the detection precision of a minority class is seriously reduced due to the fact that a general learner is absolutely biased to the majority class, and remarkably improves the classification precision of an unbalanced time series data set.

The technical scheme adopted by the invention is that the method for classifying the unbalanced time series data based on the autonomous learning specifically comprises the following steps:

step 1, processing the unbalanced time sequence data to construct a new sample;

step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1;

step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2;

and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification.

The invention is also characterized in that:

the specific process of the step 1 is as follows:

step 1.1, let the data set be denoted as Q { Q }_j(m_j,n_j) J-1, 2, …, u, where m_jDenotes the time of the jth sample, n_jSignal value representing the jth sample, u representing the total number of data in the data set; in order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types: aggregation points, critical points, isolated points;

and step 1.2, generating a new sample according to the data set obtained in the step 1.1.

The specific process of step 1.1 is as follows:

in order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }_j(m_j,n_j) 1,2, …, u clustering, dividing samples in the data set into 3 subsets: set of isolated pointsQ₁{q_1j(m_1j,n_1j)}，j＝1,2,…,u₁Critical point set Q₂{q_2j(m_2j,n_2j)}，j＝1,2,…,u₂And aggregation point set Q₃{q_3j(m_3j,n_3j)}，j＝1,2,…,u₃Wherein u is₁Denotes the number of outliers, u₂Representing the number of critical points, u₃Indicating the number of focal points, u₁+u₂+u₃U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r₁(m′₁,n′₁)、R₂(m′₂,n′₂)、R₃(m′₃,n′₃)。

The specific process of the step 1.2 is as follows:

step 1.2.1, order

Set of presentation points

Middle j₁Sample point to cluster center R₁(m′₁,n′₁) The distance of (a) to (b),

set of presentation points

Middle j₂Sample point to cluster center R₂(m′₂,n′₂) The distance of (a) to (b),

set of presentation points

Middle j₃Sample point to cluster center R₃(m′₃,n′₃) A distance of (1) to

Step 1.2.2, for point sets

A certain sample point q (m, n), the sample point q (m, n) to the point set

Cluster center R of₁(m′₁,n′₁) Is denoted as a, a ═ n-n'₁L, search for all sample points of the following equation (2):

and sequencing according to the morning and evening sequence of the time components, and recording the result as:

q₁₁(m₁₁,n₁₁),q₁₂(m₁₂,n₁₂),…,q_1g(m_1g,n_1g) (3)；

wherein g represents a set of points

The number of sample points satisfying equation (2).

At samples q (m, n) and q₁₁(m₁₁,n₁₁)、q₁₂(m₁₂,n₁₂)、…、q_1g(m_1g,n_1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sample

As shown in the following equation (4):

wherein rand (0,1) represents a random number within the interval (0, 1);

constructing time component values of new samples

As shown in the following equation (5):

wherein m is_1hH is 1,2, …, g is sample q₁₁(m₁₁,n₁₁)、q₁₂(m₁₂,n₁₂)、…、q_1g(m_1g,n_1g) To finally obtain a newly generated sample as

Step 1.2.3, repeatedly executing step 1.2.2 until the point set is traversed

All sample points in (a);

step 1.2.4, respectively aligning point sets

And

performing as a set of points

The step 1.2.2-1.2.4, respectively obtaining a set of points

And

a new sample is generated;

step 1.2.5, and the new sample obtained in step 1.2.3 and the step1.2.4 the new samples obtained are merged into the data set Q { Q } in step 1.1_j(m_j,n_j) J-1, 2, …, u, a new data set can be generated

U represents the total amount of data in the newly generated data set after unbalanced data processing.

The specific process of the step 2 is as follows:

step 2.1, scale transformation;

for data sets

Wherein m is_jTime stamp representing the jth sample, n_jSignal value representing the jth sample, U representing the total number of data in the data set;

is provided with

Representing the scaled signal value of the jth sample, and

wherein,

step 2.2, data segmentation;

dividing data into fixed-size segments, adopting a sliding window of overlapped segments, namely, the window length of a window function w is T, moving by a fixed step length T to divide the sequence into equally-spaced time sequence segments, expressing a set of segmented time sequence segments by L, and L_iRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,

for the number of segments after segmentation, then

The range of each segment is:

the specific process of the step 3 is as follows:

constructing a deep convolutional neural network model, wherein the model comprises an input layer, 4 hidden layers, 1 fully-connected layer, a multi-layer perceptron and a classifier softmax;

the hidden layer comprises a convolutional layer C1, a pooling layer S2, a convolutional layer C3 and a pooling layer S4;

an input layer: time series data fragment { l with length of T obtained after scale transformation and time slicing processing_i}，

Inputting into a network model;

the deep convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the class 1 or 2_r：

Here, the category 1 indicates a normal value, and the category 2 indicates an abnormal value.

The specific process of the step 4 is as follows:

training the data set by using the convolutional neural network model obtained in the step 3, outputting the probability that each time slice belongs to each category, and using the cross entropy as a cost function, as shown in the following formula (9):

H＝-∑y_klog p_k (9)；

wherein, y_kIndicates the desired tag type, p_kIs the actual output;

and performing error minimization training by taking an adaptive learning rate optimization algorithm Adam Optimizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model according to the optimal weight parameter to perform time series classification.

The invention has the following beneficial effects:

1. the invention provides an unbalanced time series data classification method based on autonomous learning aiming at unbalanced time series data from the data driving perspective, which comprises two stages of unbalanced time series data processing and time series data classification.

2. In the unbalanced time sequence data processing stage, a sampling method is adopted to divide a few types of samples into three types of aggregation points, critical points and isolated points, and then the time stamps and signal values are interpolated in each type.

3. In the time series data classification stage, a deep convolutional neural network model with 4 hidden layers is constructed, and feature extraction and classification are realized by utilizing the autonomous feature mapping capability of the convolutional neural network.

4. The method solves the problem that the detection precision of a minority class is seriously reduced because a general learner is absolutely biased to the majority class, and remarkably improves the classification precision of the unbalanced time sequence data set.

Drawings

FIG. 1 is a data generation process in an unbalanced time series data classification method based on autonomous learning according to the present invention;

FIG. 2 is a deep convolutional neural network model constructed in the unbalanced time series data classification method based on autonomous learning according to the present invention;

3(a) and 3(b) are the classification performance of the convolutional neural network under different hidden layer structures in the classification method of the unbalanced time series data based on the autonomous learning according to the present invention;

fig. 4(a) and 4(b) illustrate the classification performance of a convolutional neural network trained by using an original data set and an unbalanced data set after processing according to an unbalanced time series data classification method based on autonomous learning.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention discloses an unbalanced time series data classification method based on autonomous learning, which comprises the following specific steps:

step 1, processing unbalanced time series data;

and 1.1, processing a few kinds of data sets in the training data by adopting a sampling method. Let the data set be denoted as Q { Q }_j(m_j,n_j) J ═ 1,2, …, u), where m_jDenotes the time of the jth sample, n_jRepresents the signal value of the j-th sample and u represents the total amount of data in the data set. In order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types:

aggregation points: in the point set distribution, points that are distributed in the center of the point set and exhibit an aggregation state are distributed.

Critical point: in the point set distribution, points scattered at the edge of the range where the aggregation points are aggregated and limiting the range of the distribution of the aggregation points are called critical points.

Isolated points: in the point set distribution, the points are scattered at positions far from the aggregation range of the aggregation point, are located outside the edge formed by the critical points, and are in an isolated state.

Fig. 1 illustrates the distribution of 3 points.

In order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }_j(m_j,n_j) Clustering (j ═ 1,2, …, u), samples in the dataset are divided into 3 subsets: set of isolated points Q₁{q_1j(m_1j,n_1j)}(j＝1,2,…,u₁) Critical point set Q₂{q_2j(m_2j,n_2j)}(j＝1,2,…,u₂) And aggregation point set Q₃{q_3j(m_3j,n_3j)}(j＝1,2,…,u₃) Wherein u is₁Denotes the number of outliers, u₂Representing the number of critical points, u₃Indicating the number of focal points, u₁+u₂+u₃U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r₁(m′₁,n′₁)、R₂(m′₂,n′₂)、R₃(m′₃,n′₃)。

Step 1.2, generating a new sample;

order to

Set of presentation points

set of presentation points

set of presentation points

Middle j₃Sample point to cluster center R₃(m′₃,n′₃) The distance of (c). Then

For point sets

All of them are the sameThis point q (m, n), this sample point-to-point set

Cluster center R of₁(m′₁,n′₁) Is denoted as a, a ═ n-n'₁Searching for all sample points satisfying equation (2):

q₁₁(m₁₁,n₁₁),q₁₂(m₁₂,n₁₂),…,q_1g(m_1g,n_1g) (3)；

wherein g represents a set of points

The number of sample points satisfying the expression (2).

At samples q (m, n) and q₁₁(m₁₁,n₁₁),q₁₂(m₁₂,n₁₂),…,q_1g(m_1g,n_1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sample

Where rand (0,1) represents a random number within the interval (0, 1).

Constructing time component values of new samples

Wherein m is_1h(h-1, 2, …, g) is sample q₁₁(m₁₁,n₁₁),q₁₂(m₁₂,n₁₂),…,q_1g(m_1g,n_1g) To finally obtain a newly generated sample as

Point set

Until all sample points are traversed.

Point set

And

respectively repeating the point set

All procedures for producing a new sample (Point set)

Specific process and point set for generating new sample

The process of generating new samples is the same), all new generated samples are obtained, and the new sample points are combined into the original data set to generate a new data set

Step 2, preprocessing data;

step 2.1, scale transformation;

for dataCollection

Wherein m is_jTime stamp representing the jth sample, n_jRepresents the signal value of the j-th sample and U represents the total amount of data in the data set. The inconsistency of the data dimension affects the speed of the network learning, and in order to avoid the influence, the signal value needs to be subjected to scale transformation so as to realize dimension consistency. Is provided with

Representing the scaled signal value of the jth sample, and

wherein,

step 2.2, data segmentation;

time series data are mostly long sequences with time stamps, and signal values have a timing dependency. In order to enable the network model to learn this feature of time series data, thereby preserving the time-series dependency of the time series data, we divide the data into fixed-size segments. A sliding window of overlapping segments is used, i.e. the window function w has a window length T, and the sequence is divided into equally spaced time sequence segments with a fixed step T shift. The set of time-series fragments after segmentation is denoted by L, L_iRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,

the number of segments after segmentation. Then

The range of each segment is:

step 3, a deep convolution neural network model;

the method constructs a deep convolutional neural network model which comprises an input layer, 4 hidden layers, 1 fully-connected layer and a multi-layer perceptron, and softmax is used as a classifier. Model structure as shown in fig. 2, the hidden layer is used for feature extraction, and includes convolutional layer C1, pooling layer S2, convolutional layer C3 and pooling layer S4, two important operations of convolution and pooling, and the softmax classifier is mainly used for time series classification. The working process of the network model is described in detail below.

An input layer: time series data segments with the length of T and obtained after scale transformation and time slicing processing

Input into the network model.

We will describe the working process of the hidden layer by taking any time sequence segment l as an example.

Layer C1: the method of the invention uses a gaussian convolution kernel:

where σ denotes the convolution width, the radial range of action of the control function, and in this method we have found by experiment that σ is 0.1 optimal.

Let C1 layer have v₁Each size is n₁Of the convolution kernel

V is generated through convolution with C1 layers₁Length of c₁Feature vector of

c₁＝t-n₁+1

Wherein,

representing a feature vector, c₁Representing feature vectors

Length of (v)₁The number of the feature vectors is represented,

represents the bias of the C1 layer, conv (-) represents the convolution function, ReLU (-) represents the activation function.

Layer S2: assume that the S2 level has a step size of l₂Size a₂Pooling of windows, then feature vectors

V is generated after S2 layers₁Size of c₂Feature vector of

c₂＝(t-n₁+1-a₂)/t₂+1

Wherein,

representing a feature vector, c₂Representing feature vectors

Length of (v)₁The number of the feature vectors is represented,

represents the sharing weight of the S2 level,

represents the bias of the S2 layer, D (-) represents the downsampling function, ReLU (-) represents the activation function.

Layer C3: assume that the C3 layer has v₃Each size is n₃Of the convolution kernel

The feature vector W obtained at the S2 level_s ²V is generated by C3 layer convolution₃Size of c₃Feature vector of

c₃＝(t-n₁+1-a₂)/l₂-n₃+2

Wherein,

representing a feature vector, c₃Representing feature vectors

Length of (v)₃The number of the feature vectors is represented,

represents the bias of the Con3 layer, conv (-) represents the convolution function, and ReLU (-) represents the activation function.

Layer S4: assume that the S4 level has a step size of l₄Size a₄The pooling window of (1), then the feature vector obtained by the C3 layer

After passing through S4 layer, v is generated₃Size of c₄Feature vector of

c₄＝(t-n₁+1-a₂-n₃l₂+2l₂-a₄l₂)/l₂l₄+1

Wherein,

representing a feature vector, c₃Representing feature vectors

Length of (v)₃The number of the feature vectors is represented,

represents the sharing weight of the S4 level,

represents the bias of the S4 layer, D (-) represents the downsampling function, ReLU (-) represents the activation function.

Rasterization: finally will be

Sequentially generating a one-dimensional vector

Length c₅As shown in equation (13).

MP5 layer: the MP5 layer is a multi-layer perceptron that maps one set of vectors to another. We use here a three-layer sensor: one input layer, one hidden layer, and one output layer. Features after rasterization

Inputting the MP5 layer, performing feature mapping in hidden layer with its neuron number of o and v₅(o＝1,2,...,v₅) For the binary problem, the number of neurons in the output layer is 2 (i.e., equation (15); where r is 1 and 2).

Wherein,

represents the weight of the hidden layer in the MLP,

represents the bias of the hidden layer in MLP and tanh (-) represents the tanh activation function.

Wherein

Represents the weight of the output layer in the MLP,

indicating the bias of the output layer in MLP.

And (3) network output: the convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the category 1 (normal value) or 2 (abnormal value)_rWhere r is 1, 2.

Step 4, classifying;

training the data set using a trained convolutional neural network model, outputting the probability that each time slice belongs to each class, using cross entropy (cross) as a cost function (see equation (17)):

H＝-∑y_klog p_k (17)

y_kindicates the desired tag type, p_kIs the actual output.

And performing error minimization training by taking an adaptive learning rate optimization algorithm AdamaOptizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model for time series classification.

Examples

An experiment platform: the deep learning platform adopted in the experiment is tensiorflow1.3.0, the interface is python3.5, and the computer hardware is configured to be an i7 processor, an 8GB installation memory and a 64-bit operating system.

Data set: and taking the rotating speed data and the temperature data of certain equipment in the actual engineering as experimental data.

Data set 1: rotational speed data for a device. The training data set contains 140281 signal values, of which there are 35707 outlier data values; in the test data set, the balanced data set a1 contains 5312 signal values, where there are 2656 abnormal data; the unbalanced data set B1 contains 1087 signal values, 170 of which are anomalous data.

Data set 2: temperature data for a device. The training data set contains 50001 signal values, with 3901 anomalous data values; in the test data set, the equilibrium data set a2 contained 9615 signal values, with 4807 anomalous data; the unbalanced data set B2 contains 9158 signal values, of which there are 2313 outliers.

In the experiment, when supervised training is carried out, the label of the normal value is marked as 1, and the label of the abnormal value is marked as 0.

1. Setting the number of hidden layers;

in order to establish an optimal convolutional neural network structure, the classification performance of convolutional neural network models of different hidden layers on an experimental data set is explored through experiments.

Firstly, processing a data set 1 and a data set 2 by using the unbalanced time series data processing algorithm described in the step 1, secondly, carrying out scale transformation and time slicing operation on the processed data sets, then sending the processed data sets into convolutional neural network models of different hidden layers for training, and then testing on a data set a to obtain the identification precision and the training loss value of the network models of different hidden layers.

Table 1 and table 2 are specific parameter settings of the hidden layer when the network model is trained using data set 1 and data set 2, respectively.

TABLE 1 parameter settings when training a network with dataset 1

Table 2 parameter settings when training a network using dataset 2

For data set 1, the period of the time series data is 150 time stamps, and the length 150 of one period is taken as an input length. Characteristic dimensions obtained by finally learning in the fully-connected layers in the network structures of different hidden layers are 3600, iteration times are 1000, and experimental results show that the classification identification precision is highest when the number of the hidden layers is 4. For data set 2, the period of the time-series data is 326 time stamps, and the length 163 of a half period is taken as an input length. The feature dimensions obtained by learning at the full-connection layer in the network structures of different hidden layers are 6000, the iteration times are 1000, and the experimental result shows that the classification identification precision is highest when the number of the hidden layers is 4.

Fig. 3(a) and (b) show the classification accuracy acc and the training loss respectively of the convolutional neural network model of four structures obtained by training with the data set 1 (fig. 3(a) corresponds to the data set 1) and the data set 2 (fig. 3(b) corresponds to the data set 2), wherein the vertical axis of the coordinate on the left represents the variation value of the training loss, and the vertical axis of the coordinate on the right represents the classification accuracy on the test set (the data sets a1 and a 2). The training loss curves of the four structures tend to zero at different speeds, so that the constructed convolutional network has no overfitting phenomenon in the learning process and has better generalization capability for the learning of time series data. The classification precision reaches more than 90% after 1000 times of training and iteration on the data set 1, and the classification precision of the convolutional neural network model with 4 hidden layers is stable to the first after 400 times of iteration, so that the method has better classification performance. The classification accuracy of the 4 network models on the data set 2 has fluctuation of different degrees, wherein the classification accuracy of the network model with 4 hidden layers vibrates violently in the interval of iteration times of [0,100], and then is slowly improved, and the classification accuracy of the other three network models reaches higher classification accuracy after 1000 times of iteration. Combining the above results, the present invention determines a convolutional neural network model containing 4 hidden layers for time series data classification.

2. Evaluating the index;

the invention uses the classification precision and the confusion matrix to evaluate the performance of the method, and the indexes are defined as follows.

(1) And (3) classification precision: acc ═ N '/N'

(18)；

Where N 'represents the correctly classified time series segments in the test data set and N' represents the total number of time series segments in the test data set.

The confusion matrix, also called error matrix, is a standard format for representing the accuracy evaluation, and is represented in a matrix form. For the binary problem, it is finally necessary to determine whether the result of the sample is 0 or 1, or "positive" or "negative". Four basic indicators can be defined, called primary indicators (bottom-most):

the True value is "Positive", and the number of time series data fragments classified as "Positive" by the model is marked as True Positive (TP);

the true value is "positive", and the number of time series data fragments classified as "Negative" by the model is marked as true positive (FN);

the true value is negative, and the number of time series data fragments classified as Positive by the model is marked as true Positive (FP);

the True value is "Negative", and the number of time-series data fragments classified as "Negative" by the model is marked as True positive (TN);

the 4 indices were used to generate a Confusion Matrix (fusion Matrix):

TABLE 3 confusion matrix

3. Performance evaluation;

in order to perform performance analysis on the proposed time series data classification model, tests are firstly performed on both classification accuracy and a confusion matrix, and finally comparison is performed with a typical time series data classification algorithm in the field of fault diagnosis.

In comparison of recognition accuracy, the validation set is tested using a balanced data set (data set a), and the validation set is tested using an unbalanced data set (data set B) in the calculation of the confusion matrix.

Fig. 4 shows the classification accuracy acc and loss results of the CNN model trained with the datasets before and after processing of unbalanced data, respectively. Fig. 4(a) shows the experimental result on the data set 1, where the blue line represents the classification result of the convolutional neural network model obtained by training the original data set, and the red line represents the result on the data set processed by the unbalanced data, which is obviously improved significantly. When the CNN model trained by the data set after unbalanced data processing is used for classification, the classification precision reaches over 90% after 200 iterations, and reaches 98.633% after 1000 iterations; when the CNN model trained by the original data set is classified, the classification precision is gradually improved in an unstable state after 600 iterations, but the classification precision does not reach 80%, and the classification precision reaches 87.402% after 1000 iterations. When the data set after unbalanced data processing is used for training the CNN model, the loss value is converged at a higher speed, and the loss value is reduced to 0.00548 when the data set is iterated for 1000 times; when the CNN model is trained by using the original data set, the loss value is reduced slowly, and the loss value is reduced to 0.244 after 1000 iterations. Fig. 4(b) shows the experimental results on the data set 2, wherein the blue line is the classification result of the convolutional neural network trained on the original data set, and the red line is the result on the data set after the unbalanced data processing, and it is obvious that the experimental results after the unbalanced data processing are significantly improved. When a CNN model obtained by training a data set after unbalanced data processing is used for classification, the classification precision reaches over 90% after 200 iterations, and reaches 96.48% after 1000 iterations; when the CNN model trained by the original data set is used for classification, the classification precision does not reach about 76% after 200 times of iteration, the classification precision is in an unstable state after 600 times of iteration, and the classification precision is not obviously improved after 1000 times of iteration. When a CNN model trained by the data set after unbalanced data processing is used for training, the loss value is converged at a higher speed, and the loss value is reduced to 0.000054 when the CNN model is iterated for 1000 times; when the CNN model is trained by using the original data set, the loss value is slowly reduced when the number of iterations is below 600, and the loss value is reduced to 0.00063 after 1000 iterations. The experimental results on the two data sets are integrated, the learning of the network model on the original data set excessively depends on the training data, so that the classification precision is low, the unbalanced data processing algorithm makes up the defects, the distribution difference among the data is reduced, the learning capability of the classifier on abnormal data is enhanced, and the classification performance of the classifier is further improved.

TABLE 4 confusion matrix (%)

TABLE 5 confusion matrix (%)

TABLE 6 confusion matrix (%)

TABLE 7 confusion matrix (%)

For the unbalanced data set, the classification accuracy of the network model has certain limitation, abnormal data are easily mistakenly classified into normal data, and after the unbalanced data are processed in the data set, the learning capacity of the model for the abnormal data is improved, and the error rate is reduced. Therefore, the unbalanced time series data processing algorithm provided by the invention has a better correction effect on the unbalanced data set classification, and the good performance of the proposed time series data classification model is proved.

TABLE 8 Classification accuracy of different algorithms on dataset 1

TABLE 9 Classification accuracy of different algorithms on dataset 2

Tables 8 and 9 show the classification results of the different time series data classification algorithms on dataset 1 and dataset 2, respectively. The experiments are respectively carried out on an original data set and a data set processed by unbalanced data, the feature extraction algorithm in the comparison method is Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), and the classifier is a Support Vector Machine (SVM) and a Neural Network (NN). No matter which classifier is used, the classification precision of the data set after unbalanced data processing is obviously improved, compared with other classifiers, the method provided by the invention does not need to combine feature extraction and the classifiers, but completes the autonomous learning of features and classification at one time, and has better adaptability to the change of data regularity.

The invention provides an unbalanced time sequence data classification method based on autonomous learning from the viewpoint of data driving aiming at unbalanced time sequence data. The method mainly comprises two stages of unbalanced data processing and time series data classification. In the unbalanced data processing stage, a sampling method is adopted to divide a few types of samples into three types of aggregation points, critical points and isolated points, and then the time stamps and signal values are interpolated in each type. In the time series data classification stage, the invention constructs a convolutional neural network model with 4 hidden layers, and realizes feature extraction and classification by utilizing the autonomous feature mapping capability of the convolutional neural network. The method solves the problem that the detection precision of a few classes is seriously reduced because a general learning model is absolutely biased to the majority classes, and remarkably improves the classification precision of the unbalanced time sequence data set.

Claims

1. An unbalanced time series data classification method based on autonomous learning is characterized in that: the method specifically comprises the following steps:

step 1, processing the unbalanced time sequence data to construct a new sample;

2. The method for classifying unbalanced time-series data based on autonomous learning according to claim 1, wherein: the specific process of the step 1 is as follows:

3. The method for classifying unbalanced time-series data based on autonomous learning according to claim 2, wherein: the specific process of the step 1.1 is as follows:

in order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }_j(m_j,n_j) 1,2, …, u clustering, dividing samples in the data set into 3 subsets: set of isolated points Q₁{q_1j(m_1j,n_1j)}，j＝1,2,…,u₁Critical point set Q₂{q_2j(m_2j,n_2j)}，j＝1,2,…,u₂And aggregation point set Q₃{q_3j(m_3j,n_3j)}，j＝1,2,…,u₃Wherein u is₁Denotes the number of outliers, u₂Representing the number of critical points, u₃Indicating the number of focal points, u₁+u₂+u₃U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r₁(m′₁,n′₁)、R₂(m′₂,n′₂)、R₃(m′₃,n′₃)。

4. The method for classifying unbalanced time-series data based on autonomous learning according to claim 3, wherein: the specific process of the step 1.2 is as follows:

step 1.2.1, order

Set of presentation points

set of presentation points

set of presentation points

Step 1.2.2, for point sets

A certain sample point q (m, n), the sample point q (m, n) to the point set

q₁₁(m₁₁,n₁₁),q₁₂(m₁₂,n₁₂),…,q_1g(m_1g,n_1g) (3)；

wherein g represents a set of points

The number of sample points satisfying equation (2).

As shown in the following equation (4):

wherein rand (0,1) represents a random number within the interval (0, 1);

constructing time component values of new samples

As shown in the following equation (5):

Step 1.2.3, repeatedly executing step 1.2.2 until the point set is traversed

All sample points in (a);

step 1.2.4, respectively aligning point sets

And

performing as a set of points

The step 1.2.2-1.2.4, respectively obtaining a set of points

And

a new sample is generated;

step 1.2.5, merging the new sample obtained in step 1.2.3 and the new sample obtained in step 1.2.4 into the data set Q { Q } in step 1.1_j(m_j,n_j)}J-1, 2, …, u, a new data set can be generated

5. The method for classifying unbalanced time-series data based on autonomous learning according to claim 4, wherein: the specific process of the step 2 is as follows:

step 2.1, scale transformation;

for data sets

is provided with

Representing the scaled signal value of the jth sample, and

wherein,

step 2.2, data segmentation;

for the number of segments after segmentation, then

The range of each segment is:

6. the method for classifying unbalanced time-series data based on autonomous learning according to claim 5, wherein: the specific process of the step 3 is as follows:

Inputting into a network model;

7. The method for classifying unbalanced time-series data based on autonomous learning according to claim 6, wherein: the specific process of the step 4 is as follows:

H＝-∑y_klog p_k (9)；

wherein, y_kIndicates the desired tag type, p_kIs the actual output;