CN113220960A - Unbalanced time series data classification method based on autonomous learning - Google Patents
Unbalanced time series data classification method based on autonomous learning Download PDFInfo
- Publication number
- CN113220960A CN113220960A CN202110515698.0A CN202110515698A CN113220960A CN 113220960 A CN113220960 A CN 113220960A CN 202110515698 A CN202110515698 A CN 202110515698A CN 113220960 A CN113220960 A CN 113220960A
- Authority
- CN
- China
- Prior art keywords
- data
- sample
- points
- time
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 32
- 230000011218 segmentation Effects 0.000 claims abstract description 13
- 230000009466 transformation Effects 0.000 claims abstract description 11
- 238000013145 classification model Methods 0.000 claims abstract description 8
- 238000003062 neural network model Methods 0.000 claims abstract description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 18
- 230000002776 aggregation Effects 0.000 claims description 16
- 238000004220 aggregation Methods 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 16
- 230000002159 abnormal effect Effects 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 9
- 239000012634 fragment Substances 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000007477 logistic regression Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 6
- 239000013598 vector Substances 0.000 description 21
- 239000011159 matrix material Substances 0.000 description 13
- 238000012360 testing method Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 230000002547 anomalous effect Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an unbalanced time series data classification method based on autonomous learning, which specifically comprises the following steps: step 1, processing the unbalanced time sequence data to construct a new sample; step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1; step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2; and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification. The method solves the problem that the detection precision of a minority class is seriously reduced due to the fact that a general learner is absolutely biased to the majority class, and the classification precision of the unbalanced time sequence data set is remarkably improved.
Description
Technical Field
The invention belongs to the technical field of time series data classification, and relates to an unbalanced time series data classification method based on autonomous learning.
Background
The time sequence refers to data arranged according to the time sequence, and the data can directly reflect the state or degree of a certain object or phenomenon changing along with the time; time series data mining is to extract useful information related to time attributes, which is unknown in advance, from a large amount of time series data, and to guide activities of people such as society, economy, and life. In the field of aerospace measurement and control, a large amount of telemetering data are presented in a time series mode, the engineering data can directly reflect the operation state of an aircraft, and the data are classified and information and rules contained in the data are mined out, so that the research on equipment fault diagnosis technology is very important. Therefore, the time series data classification problem becomes an important research topic in engineering and academia.
The unbalanced time series data refers to a data set with a small number of samples far smaller than that of a large number of samples, for example, in aerospace measurement and control engineering, most of measured time series data are within a normal range, and only few abnormal values exist, so that the unbalanced time series data set is a typical unbalanced time series data set. In the binary classification problem, the detection accuracy and performance of the classifier are seriously reduced due to the imbalance of the data distribution, so that the result of a general classifier is seriously biased to a normal class, and the false detection rate of an abnormal class is very high. In practical application, a few classes are the focus of attention, and if the 'fault' is misdiagnosed as 'normal' and the faulty system continues to work, unpredictable consequences and loss can be caused.
The time series data classification is an important branch of time series data mining, the problem is different from other data classifications, signal values of each time point of the time series data do not exist independently, and the whole time series data is used as one input in the processing.
Disclosure of Invention
The invention aims to provide an unbalanced time series data classification method based on autonomous learning, which solves the problem that the detection precision of a minority class is seriously reduced due to the fact that a general learner is absolutely biased to the majority class, and remarkably improves the classification precision of an unbalanced time series data set.
The technical scheme adopted by the invention is that the method for classifying the unbalanced time series data based on the autonomous learning specifically comprises the following steps:
step 1, processing the unbalanced time sequence data to construct a new sample;
and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification.
The invention is also characterized in that:
the specific process of the step 1 is as follows:
step 1.1, let the data set be denoted as Q { Q }j(mj,nj) J-1, 2, …, u, where mjDenotes the time of the jth sample, njSignal value representing the jth sample, u representing the total number of data in the data set; in order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types: aggregation points, critical points, isolated points;
and step 1.2, generating a new sample according to the data set obtained in the step 1.1.
The specific process of step 1.1 is as follows:
in order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }j(mj,nj) 1,2, …, u clustering, dividing samples in the data set into 3 subsets: set of isolated pointsQ1{q1j(m1j,n1j)},j=1,2,…,u1Critical point set Q2{q2j(m2j,n2j)},j=1,2,…,u2And aggregation point set Q3{q3j(m3j,n3j)},j=1,2,…,u3Wherein u is1Denotes the number of outliers, u2Representing the number of critical points, u3Indicating the number of focal points, u1+u2+u3U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r1(m′1,n′1)、R2(m′2,n′2)、R3(m′3,n′3)。
The specific process of the step 1.2 is as follows:
step 1.2.1, orderSet of presentation pointsMiddle j1Sample point to cluster center R1(m′1,n′1) The distance of (a) to (b),set of presentation pointsMiddle j2Sample point to cluster center R2(m′2,n′2) The distance of (a) to (b),set of presentation pointsMiddle j3Sample point to cluster center R3(m′3,n′3) A distance of (1) to
Step 1.2.2, for point setsA certain sample point q (m, n), the sample point q (m, n) to the point setCluster center R of1(m′1,n′1) Is denoted as a, a ═ n-n'1L, search for all sample points of the following equation (2):
and sequencing according to the morning and evening sequence of the time components, and recording the result as:
q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) (3);
At samples q (m, n) and q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sampleAs shown in the following equation (4):
wherein rand (0,1) represents a random number within the interval (0, 1);
wherein m is1hH is 1,2, …, g is sample q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) To finally obtain a newly generated sample as
Step 1.2.3, repeatedly executing step 1.2.2 until the point set is traversedAll sample points in (a);
step 1.2.4, respectively aligning point setsAndperforming as a set of pointsThe step 1.2.2-1.2.4, respectively obtaining a set of pointsAnda new sample is generated;
step 1.2.5, and the new sample obtained in step 1.2.3 and the step1.2.4 the new samples obtained are merged into the data set Q { Q } in step 1.1j(mj,nj) J-1, 2, …, u, a new data set can be generatedU represents the total amount of data in the newly generated data set after unbalanced data processing.
The specific process of the step 2 is as follows:
step 2.1, scale transformation;
for data setsWherein m isjTime stamp representing the jth sample, njSignal value representing the jth sample, U representing the total number of data in the data set;
step 2.2, data segmentation;
dividing data into fixed-size segments, adopting a sliding window of overlapped segments, namely, the window length of a window function w is T, moving by a fixed step length T to divide the sequence into equally-spaced time sequence segments, expressing a set of segmented time sequence segments by L, and LiRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,for the number of segments after segmentation, then
The range of each segment is:
the specific process of the step 3 is as follows:
constructing a deep convolutional neural network model, wherein the model comprises an input layer, 4 hidden layers, 1 fully-connected layer, a multi-layer perceptron and a classifier softmax;
the hidden layer comprises a convolutional layer C1, a pooling layer S2, a convolutional layer C3 and a pooling layer S4;
an input layer: time series data fragment { l with length of T obtained after scale transformation and time slicing processingi},Inputting into a network model;
the deep convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the class 1 or 2r:
Here, the category 1 indicates a normal value, and the category 2 indicates an abnormal value.
The specific process of the step 4 is as follows:
training the data set by using the convolutional neural network model obtained in the step 3, outputting the probability that each time slice belongs to each category, and using the cross entropy as a cost function, as shown in the following formula (9):
H=-∑yklog pk (9);
wherein, ykIndicates the desired tag type, pkIs the actual output;
and performing error minimization training by taking an adaptive learning rate optimization algorithm Adam Optimizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model according to the optimal weight parameter to perform time series classification.
The invention has the following beneficial effects:
1. the invention provides an unbalanced time series data classification method based on autonomous learning aiming at unbalanced time series data from the data driving perspective, which comprises two stages of unbalanced time series data processing and time series data classification.
2. In the unbalanced time sequence data processing stage, a sampling method is adopted to divide a few types of samples into three types of aggregation points, critical points and isolated points, and then the time stamps and signal values are interpolated in each type.
3. In the time series data classification stage, a deep convolutional neural network model with 4 hidden layers is constructed, and feature extraction and classification are realized by utilizing the autonomous feature mapping capability of the convolutional neural network.
4. The method solves the problem that the detection precision of a minority class is seriously reduced because a general learner is absolutely biased to the majority class, and remarkably improves the classification precision of the unbalanced time sequence data set.
Drawings
FIG. 1 is a data generation process in an unbalanced time series data classification method based on autonomous learning according to the present invention;
FIG. 2 is a deep convolutional neural network model constructed in the unbalanced time series data classification method based on autonomous learning according to the present invention;
3(a) and 3(b) are the classification performance of the convolutional neural network under different hidden layer structures in the classification method of the unbalanced time series data based on the autonomous learning according to the present invention;
fig. 4(a) and 4(b) illustrate the classification performance of a convolutional neural network trained by using an original data set and an unbalanced data set after processing according to an unbalanced time series data classification method based on autonomous learning.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses an unbalanced time series data classification method based on autonomous learning, which comprises the following specific steps:
step 1, processing unbalanced time series data;
and 1.1, processing a few kinds of data sets in the training data by adopting a sampling method. Let the data set be denoted as Q { Q }j(mj,nj) J ═ 1,2, …, u), where mjDenotes the time of the jth sample, njRepresents the signal value of the j-th sample and u represents the total amount of data in the data set. In order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types:
aggregation points: in the point set distribution, points that are distributed in the center of the point set and exhibit an aggregation state are distributed.
Critical point: in the point set distribution, points scattered at the edge of the range where the aggregation points are aggregated and limiting the range of the distribution of the aggregation points are called critical points.
Isolated points: in the point set distribution, the points are scattered at positions far from the aggregation range of the aggregation point, are located outside the edge formed by the critical points, and are in an isolated state.
Fig. 1 illustrates the distribution of 3 points.
In order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }j(mj,nj) Clustering (j ═ 1,2, …, u), samples in the dataset are divided into 3 subsets: set of isolated points Q1{q1j(m1j,n1j)}(j=1,2,…,u1) Critical point set Q2{q2j(m2j,n2j)}(j=1,2,…,u2) And aggregation point set Q3{q3j(m3j,n3j)}(j=1,2,…,u3) Wherein u is1Denotes the number of outliers, u2Representing the number of critical points, u3Indicating the number of focal points, u1+u2+u3U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r1(m′1,n′1)、R2(m′2,n′2)、R3(m′3,n′3)。
Step 1.2, generating a new sample;
order toSet of presentation pointsMiddle j1Sample point to cluster center R1(m′1,n′1) The distance of (a) to (b),set of presentation pointsMiddle j2Sample point to cluster center R2(m′2,n′2) The distance of (a) to (b),set of presentation pointsMiddle j3Sample point to cluster center R3(m′3,n′3) The distance of (c). Then
For point setsAll of them are the sameThis point q (m, n), this sample point-to-point setCluster center R of1(m′1,n′1) Is denoted as a, a ═ n-n'1Searching for all sample points satisfying equation (2):
and sequencing according to the morning and evening sequence of the time components, and recording the result as:
q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) (3);
At samples q (m, n) and q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sample
Where rand (0,1) represents a random number within the interval (0, 1).
Wherein m is1h(h-1, 2, …, g) is sample q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) To finally obtain a newly generated sample asPoint setUntil all sample points are traversed.
Point setAndrespectively repeating the point setAll procedures for producing a new sample (Point set)Specific process and point set for generating new sampleThe process of generating new samples is the same), all new generated samples are obtained, and the new sample points are combined into the original data set to generate a new data setU represents the total amount of data in the newly generated data set after unbalanced data processing.
step 2.1, scale transformation;
for dataCollectionWherein m isjTime stamp representing the jth sample, njRepresents the signal value of the j-th sample and U represents the total amount of data in the data set. The inconsistency of the data dimension affects the speed of the network learning, and in order to avoid the influence, the signal value needs to be subjected to scale transformation so as to realize dimension consistency. Is provided withRepresenting the scaled signal value of the jth sample, and
step 2.2, data segmentation;
time series data are mostly long sequences with time stamps, and signal values have a timing dependency. In order to enable the network model to learn this feature of time series data, thereby preserving the time-series dependency of the time series data, we divide the data into fixed-size segments. A sliding window of overlapping segments is used, i.e. the window function w has a window length T, and the sequence is divided into equally spaced time sequence segments with a fixed step T shift. The set of time-series fragments after segmentation is denoted by L, LiRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,the number of segments after segmentation. Then
The range of each segment is:
the method constructs a deep convolutional neural network model which comprises an input layer, 4 hidden layers, 1 fully-connected layer and a multi-layer perceptron, and softmax is used as a classifier. Model structure as shown in fig. 2, the hidden layer is used for feature extraction, and includes convolutional layer C1, pooling layer S2, convolutional layer C3 and pooling layer S4, two important operations of convolution and pooling, and the softmax classifier is mainly used for time series classification. The working process of the network model is described in detail below.
An input layer: time series data segments with the length of T and obtained after scale transformation and time slicing processingInput into the network model.
We will describe the working process of the hidden layer by taking any time sequence segment l as an example.
Layer C1: the method of the invention uses a gaussian convolution kernel:
where σ denotes the convolution width, the radial range of action of the control function, and in this method we have found by experiment that σ is 0.1 optimal.
Let C1 layer have v1Each size is n1Of the convolution kernelV is generated through convolution with C1 layers1Length of c1Feature vector of
c1=t-n1+1
Wherein,representing a feature vector, c1Representing feature vectorsLength of (v)1The number of the feature vectors is represented,represents the bias of the C1 layer, conv (-) represents the convolution function, ReLU (-) represents the activation function.
Layer S2: assume that the S2 level has a step size of l2Size a2Pooling of windows, then feature vectorsV is generated after S2 layers1Size of c2Feature vector of
c2=(t-n1+1-a2)/t2+1
Wherein,representing a feature vector, c2Representing feature vectorsLength of (v)1The number of the feature vectors is represented,represents the sharing weight of the S2 level,represents the bias of the S2 layer, D (-) represents the downsampling function, ReLU (-) represents the activation function.
Layer C3: assume that the C3 layer has v3Each size is n3Of the convolution kernelThe feature vector W obtained at the S2 levels 2V is generated by C3 layer convolution3Size of c3Feature vector of
c3=(t-n1+1-a2)/l2-n3+2
Wherein,representing a feature vector, c3Representing feature vectorsLength of (v)3The number of the feature vectors is represented,represents the bias of the Con3 layer, conv (-) represents the convolution function, and ReLU (-) represents the activation function.
Layer S4: assume that the S4 level has a step size of l4Size a4The pooling window of (1), then the feature vector obtained by the C3 layerAfter passing through S4 layer, v is generated3Size of c4Feature vector of
c4=(t-n1+1-a2-n3l2+2l2-a4l2)/l2l4+1
Wherein,representing a feature vector, c3Representing feature vectorsLength of (v)3The number of the feature vectors is represented,represents the sharing weight of the S4 level,represents the bias of the S4 layer, D (-) represents the downsampling function, ReLU (-) represents the activation function.
Rasterization: finally will beSequentially generating a one-dimensional vectorLength c5As shown in equation (13).
MP5 layer: the MP5 layer is a multi-layer perceptron that maps one set of vectors to another. We use here a three-layer sensor: one input layer, one hidden layer, and one output layer. Features after rasterizationInputting the MP5 layer, performing feature mapping in hidden layer with its neuron number of o and v5(o=1,2,...,v5) For the binary problem, the number of neurons in the output layer is 2 (i.e., equation (15); where r is 1 and 2).
Wherein,represents the weight of the hidden layer in the MLP,represents the bias of the hidden layer in MLP and tanh (-) represents the tanh activation function.
WhereinRepresents the weight of the output layer in the MLP,indicating the bias of the output layer in MLP.
And (3) network output: the convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the category 1 (normal value) or 2 (abnormal value)rWhere r is 1, 2.
training the data set using a trained convolutional neural network model, outputting the probability that each time slice belongs to each class, using cross entropy (cross) as a cost function (see equation (17)):
H=-∑yklog pk (17)
ykindicates the desired tag type, pkIs the actual output.
And performing error minimization training by taking an adaptive learning rate optimization algorithm AdamaOptizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model for time series classification.
Examples
An experiment platform: the deep learning platform adopted in the experiment is tensiorflow1.3.0, the interface is python3.5, and the computer hardware is configured to be an i7 processor, an 8GB installation memory and a 64-bit operating system.
Data set: and taking the rotating speed data and the temperature data of certain equipment in the actual engineering as experimental data.
Data set 1: rotational speed data for a device. The training data set contains 140281 signal values, of which there are 35707 outlier data values; in the test data set, the balanced data set a1 contains 5312 signal values, where there are 2656 abnormal data; the unbalanced data set B1 contains 1087 signal values, 170 of which are anomalous data.
Data set 2: temperature data for a device. The training data set contains 50001 signal values, with 3901 anomalous data values; in the test data set, the equilibrium data set a2 contained 9615 signal values, with 4807 anomalous data; the unbalanced data set B2 contains 9158 signal values, of which there are 2313 outliers.
In the experiment, when supervised training is carried out, the label of the normal value is marked as 1, and the label of the abnormal value is marked as 0.
1. Setting the number of hidden layers;
in order to establish an optimal convolutional neural network structure, the classification performance of convolutional neural network models of different hidden layers on an experimental data set is explored through experiments.
Firstly, processing a data set 1 and a data set 2 by using the unbalanced time series data processing algorithm described in the step 1, secondly, carrying out scale transformation and time slicing operation on the processed data sets, then sending the processed data sets into convolutional neural network models of different hidden layers for training, and then testing on a data set a to obtain the identification precision and the training loss value of the network models of different hidden layers.
Table 1 and table 2 are specific parameter settings of the hidden layer when the network model is trained using data set 1 and data set 2, respectively.
TABLE 1 parameter settings when training a network with dataset 1
Table 2 parameter settings when training a network using dataset 2
For data set 1, the period of the time series data is 150 time stamps, and the length 150 of one period is taken as an input length. Characteristic dimensions obtained by finally learning in the fully-connected layers in the network structures of different hidden layers are 3600, iteration times are 1000, and experimental results show that the classification identification precision is highest when the number of the hidden layers is 4. For data set 2, the period of the time-series data is 326 time stamps, and the length 163 of a half period is taken as an input length. The feature dimensions obtained by learning at the full-connection layer in the network structures of different hidden layers are 6000, the iteration times are 1000, and the experimental result shows that the classification identification precision is highest when the number of the hidden layers is 4.
Fig. 3(a) and (b) show the classification accuracy acc and the training loss respectively of the convolutional neural network model of four structures obtained by training with the data set 1 (fig. 3(a) corresponds to the data set 1) and the data set 2 (fig. 3(b) corresponds to the data set 2), wherein the vertical axis of the coordinate on the left represents the variation value of the training loss, and the vertical axis of the coordinate on the right represents the classification accuracy on the test set (the data sets a1 and a 2). The training loss curves of the four structures tend to zero at different speeds, so that the constructed convolutional network has no overfitting phenomenon in the learning process and has better generalization capability for the learning of time series data. The classification precision reaches more than 90% after 1000 times of training and iteration on the data set 1, and the classification precision of the convolutional neural network model with 4 hidden layers is stable to the first after 400 times of iteration, so that the method has better classification performance. The classification accuracy of the 4 network models on the data set 2 has fluctuation of different degrees, wherein the classification accuracy of the network model with 4 hidden layers vibrates violently in the interval of iteration times of [0,100], and then is slowly improved, and the classification accuracy of the other three network models reaches higher classification accuracy after 1000 times of iteration. Combining the above results, the present invention determines a convolutional neural network model containing 4 hidden layers for time series data classification.
2. Evaluating the index;
the invention uses the classification precision and the confusion matrix to evaluate the performance of the method, and the indexes are defined as follows.
(1) And (3) classification precision: acc ═ N '/N'
(18);
Where N 'represents the correctly classified time series segments in the test data set and N' represents the total number of time series segments in the test data set.
The confusion matrix, also called error matrix, is a standard format for representing the accuracy evaluation, and is represented in a matrix form. For the binary problem, it is finally necessary to determine whether the result of the sample is 0 or 1, or "positive" or "negative". Four basic indicators can be defined, called primary indicators (bottom-most):
the True value is "Positive", and the number of time series data fragments classified as "Positive" by the model is marked as True Positive (TP);
the true value is "positive", and the number of time series data fragments classified as "Negative" by the model is marked as true positive (FN);
the true value is negative, and the number of time series data fragments classified as Positive by the model is marked as true Positive (FP);
the True value is "Negative", and the number of time-series data fragments classified as "Negative" by the model is marked as True positive (TN);
the 4 indices were used to generate a Confusion Matrix (fusion Matrix):
TABLE 3 confusion matrix
3. Performance evaluation;
in order to perform performance analysis on the proposed time series data classification model, tests are firstly performed on both classification accuracy and a confusion matrix, and finally comparison is performed with a typical time series data classification algorithm in the field of fault diagnosis.
In comparison of recognition accuracy, the validation set is tested using a balanced data set (data set a), and the validation set is tested using an unbalanced data set (data set B) in the calculation of the confusion matrix.
Fig. 4 shows the classification accuracy acc and loss results of the CNN model trained with the datasets before and after processing of unbalanced data, respectively. Fig. 4(a) shows the experimental result on the data set 1, where the blue line represents the classification result of the convolutional neural network model obtained by training the original data set, and the red line represents the result on the data set processed by the unbalanced data, which is obviously improved significantly. When the CNN model trained by the data set after unbalanced data processing is used for classification, the classification precision reaches over 90% after 200 iterations, and reaches 98.633% after 1000 iterations; when the CNN model trained by the original data set is classified, the classification precision is gradually improved in an unstable state after 600 iterations, but the classification precision does not reach 80%, and the classification precision reaches 87.402% after 1000 iterations. When the data set after unbalanced data processing is used for training the CNN model, the loss value is converged at a higher speed, and the loss value is reduced to 0.00548 when the data set is iterated for 1000 times; when the CNN model is trained by using the original data set, the loss value is reduced slowly, and the loss value is reduced to 0.244 after 1000 iterations. Fig. 4(b) shows the experimental results on the data set 2, wherein the blue line is the classification result of the convolutional neural network trained on the original data set, and the red line is the result on the data set after the unbalanced data processing, and it is obvious that the experimental results after the unbalanced data processing are significantly improved. When a CNN model obtained by training a data set after unbalanced data processing is used for classification, the classification precision reaches over 90% after 200 iterations, and reaches 96.48% after 1000 iterations; when the CNN model trained by the original data set is used for classification, the classification precision does not reach about 76% after 200 times of iteration, the classification precision is in an unstable state after 600 times of iteration, and the classification precision is not obviously improved after 1000 times of iteration. When a CNN model trained by the data set after unbalanced data processing is used for training, the loss value is converged at a higher speed, and the loss value is reduced to 0.000054 when the CNN model is iterated for 1000 times; when the CNN model is trained by using the original data set, the loss value is slowly reduced when the number of iterations is below 600, and the loss value is reduced to 0.00063 after 1000 iterations. The experimental results on the two data sets are integrated, the learning of the network model on the original data set excessively depends on the training data, so that the classification precision is low, the unbalanced data processing algorithm makes up the defects, the distribution difference among the data is reduced, the learning capability of the classifier on abnormal data is enhanced, and the classification performance of the classifier is further improved.
TABLE 4 confusion matrix (%)
TABLE 5 confusion matrix (%)
TABLE 6 confusion matrix (%)
TABLE 7 confusion matrix (%)
For the unbalanced data set, the classification accuracy of the network model has certain limitation, abnormal data are easily mistakenly classified into normal data, and after the unbalanced data are processed in the data set, the learning capacity of the model for the abnormal data is improved, and the error rate is reduced. Therefore, the unbalanced time series data processing algorithm provided by the invention has a better correction effect on the unbalanced data set classification, and the good performance of the proposed time series data classification model is proved.
TABLE 8 Classification accuracy of different algorithms on dataset 1
TABLE 9 Classification accuracy of different algorithms on dataset 2
Tables 8 and 9 show the classification results of the different time series data classification algorithms on dataset 1 and dataset 2, respectively. The experiments are respectively carried out on an original data set and a data set processed by unbalanced data, the feature extraction algorithm in the comparison method is Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), and the classifier is a Support Vector Machine (SVM) and a Neural Network (NN). No matter which classifier is used, the classification precision of the data set after unbalanced data processing is obviously improved, compared with other classifiers, the method provided by the invention does not need to combine feature extraction and the classifiers, but completes the autonomous learning of features and classification at one time, and has better adaptability to the change of data regularity.
The invention provides an unbalanced time sequence data classification method based on autonomous learning from the viewpoint of data driving aiming at unbalanced time sequence data. The method mainly comprises two stages of unbalanced data processing and time series data classification. In the unbalanced data processing stage, a sampling method is adopted to divide a few types of samples into three types of aggregation points, critical points and isolated points, and then the time stamps and signal values are interpolated in each type. In the time series data classification stage, the invention constructs a convolutional neural network model with 4 hidden layers, and realizes feature extraction and classification by utilizing the autonomous feature mapping capability of the convolutional neural network. The method solves the problem that the detection precision of a few classes is seriously reduced because a general learning model is absolutely biased to the majority classes, and remarkably improves the classification precision of the unbalanced time sequence data set.
Claims (7)
1. An unbalanced time series data classification method based on autonomous learning is characterized in that: the method specifically comprises the following steps:
step 1, processing the unbalanced time sequence data to construct a new sample;
step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1;
step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2;
and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification.
2. The method for classifying unbalanced time-series data based on autonomous learning according to claim 1, wherein: the specific process of the step 1 is as follows:
step 1.1, let the data set be denoted as Q { Q }j(mj,nj) J-1, 2, …, u, where mjDenotes the time of the jth sample, njSignal value representing the jth sample, u representing the total number of data in the data set; in order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types: aggregation points, critical points, isolated points;
and step 1.2, generating a new sample according to the data set obtained in the step 1.1.
3. The method for classifying unbalanced time-series data based on autonomous learning according to claim 2, wherein: the specific process of the step 1.1 is as follows:
in order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }j(mj,nj) 1,2, …, u clustering, dividing samples in the data set into 3 subsets: set of isolated points Q1{q1j(m1j,n1j)},j=1,2,…,u1Critical point set Q2{q2j(m2j,n2j)},j=1,2,…,u2And aggregation point set Q3{q3j(m3j,n3j)},j=1,2,…,u3Wherein u is1Denotes the number of outliers, u2Representing the number of critical points, u3Indicating the number of focal points, u1+u2+u3U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r1(m′1,n′1)、R2(m′2,n′2)、R3(m′3,n′3)。
4. The method for classifying unbalanced time-series data based on autonomous learning according to claim 3, wherein: the specific process of the step 1.2 is as follows:
step 1.2.1, orderSet of presentation pointsMiddle j1Sample point to cluster center R1(m′1,n′1) The distance of (a) to (b),set of presentation pointsMiddle j2Sample point to cluster center R2(m′2,n′2) The distance of (a) to (b),set of presentation pointsMiddle j3Sample point to cluster center R3(m′3,n′3) A distance of (1) to
Step 1.2.2, for point setsA certain sample point q (m, n), the sample point q (m, n) to the point setCluster center R of1(m′1,n′1) Is denoted as a, a ═ n-n'1L, search for all sample points of the following equation (2):
and sequencing according to the morning and evening sequence of the time components, and recording the result as:
q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) (3);
At samples q (m, n) and q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sampleAs shown in the following equation (4):
wherein rand (0,1) represents a random number within the interval (0, 1);
wherein m is1hH is 1,2, …, g is sample q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) To finally obtain a newly generated sample as
Step 1.2.3, repeatedly executing step 1.2.2 until the point set is traversedAll sample points in (a);
step 1.2.4, respectively aligning point setsAndperforming as a set of pointsThe step 1.2.2-1.2.4, respectively obtaining a set of pointsAnda new sample is generated;
5. The method for classifying unbalanced time-series data based on autonomous learning according to claim 4, wherein: the specific process of the step 2 is as follows:
step 2.1, scale transformation;
for data setsWherein m isjTime stamp representing the jth sample, njSignal value representing the jth sample, U representing the total number of data in the data set;
step 2.2, data segmentation;
dividing data into fixed-size segments, adopting a sliding window of overlapped segments, namely, the window length of a window function w is T, moving by a fixed step length T to divide the sequence into equally-spaced time sequence segments, expressing a set of segmented time sequence segments by L, and LiRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,for the number of segments after segmentation, then
The range of each segment is:
6. the method for classifying unbalanced time-series data based on autonomous learning according to claim 5, wherein: the specific process of the step 3 is as follows:
constructing a deep convolutional neural network model, wherein the model comprises an input layer, 4 hidden layers, 1 fully-connected layer, a multi-layer perceptron and a classifier softmax;
the hidden layer comprises a convolutional layer C1, a pooling layer S2, a convolutional layer C3 and a pooling layer S4;
an input layer: time series data fragment { l with length of T obtained after scale transformation and time slicing processingi},Inputting into a network model;
the deep convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the class 1 or 2r:
Here, the category 1 indicates a normal value, and the category 2 indicates an abnormal value.
7. The method for classifying unbalanced time-series data based on autonomous learning according to claim 6, wherein: the specific process of the step 4 is as follows:
training the data set by using the convolutional neural network model obtained in the step 3, outputting the probability that each time slice belongs to each category, and using the cross entropy as a cost function, as shown in the following formula (9):
H=-∑yklog pk (9);
wherein, ykIndicates the desired tag type, pkIs the actual output;
and performing error minimization training by taking an adaptive learning rate optimization algorithm Adam Optimizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model according to the optimal weight parameter to perform time series classification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110515698.0A CN113220960A (en) | 2021-05-12 | 2021-05-12 | Unbalanced time series data classification method based on autonomous learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110515698.0A CN113220960A (en) | 2021-05-12 | 2021-05-12 | Unbalanced time series data classification method based on autonomous learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113220960A true CN113220960A (en) | 2021-08-06 |
Family
ID=77094989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110515698.0A Pending CN113220960A (en) | 2021-05-12 | 2021-05-12 | Unbalanced time series data classification method based on autonomous learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113220960A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114327045A (en) * | 2021-11-30 | 2022-04-12 | 中国科学院微电子研究所 | Fall detection method and system based on category unbalanced signals |
CN115374859A (en) * | 2022-08-24 | 2022-11-22 | 东北大学 | Method for classifying unbalanced and multi-class complex industrial data |
-
2021
- 2021-05-12 CN CN202110515698.0A patent/CN113220960A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114327045A (en) * | 2021-11-30 | 2022-04-12 | 中国科学院微电子研究所 | Fall detection method and system based on category unbalanced signals |
CN115374859A (en) * | 2022-08-24 | 2022-11-22 | 东北大学 | Method for classifying unbalanced and multi-class complex industrial data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hsu et al. | Multiple time-series convolutional neural network for fault detection and diagnosis and empirical study in semiconductor manufacturing | |
Chadha et al. | Time series based fault detection in industrial processes using convolutional neural networks | |
CN113220960A (en) | Unbalanced time series data classification method based on autonomous learning | |
CN111325264A (en) | Multi-label data classification method based on entropy | |
Cheriguene et al. | A new hybrid classifier selection model based on mRMR method and diversity measures | |
Bommert | Integration of feature selection stability in model fitting | |
Nafis et al. | Facial expression recognition on video data with various face poses using deep learning | |
Karankar et al. | Comparative study of various machine learning classifiers on medical data | |
Li et al. | A two-phase filtering of discriminative shapelets learning for time series classification | |
Gomiasti et al. | Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine | |
Dubey et al. | Hybrid classification model of correlation-based feature selection and support vector machine | |
Liu et al. | MRD-NETS: multi-scale residual networks with dilated convolutions for classification and clustering analysis of spacecraft electrical signal | |
Bandyopadhyay et al. | Automated label generation for time series classification with representation learning: Reduction of label cost for training | |
Singh et al. | Dimensionality reduction for classification and clustering | |
Akar et al. | Open set recognition for time series classification | |
Oh et al. | Multivariate time series open-set recognition using multi-feature extraction and reconstruction | |
Singh et al. | SMOTE-LASSO-DeepNet Framework for Cancer Subtyping from Gene Expression Data | |
Tamura et al. | Time series classification using macd-histogram-based recurrence plot | |
Bandyopadhyay et al. | Hierarchical clustering using auto-encoded compact representation for time-series analysis | |
Chen et al. | TimeMIL: Advancing Multivariate Time Series Classification via a Time-aware Multiple Instance Learning | |
Sengupta et al. | A scoring scheme for online feature selection: Simulating model performance without retraining | |
Azmer et al. | Comparative analysis of classification techniques for leaves and land cover texture. | |
Jiang et al. | A novel feature extraction approach for microarray data based on multi-algorithm fusion | |
Baraniya et al. | Breast Cancer Classification and Recurrence Prediction Using Artificial Neural Networks and Machine Learning Techniques | |
Li et al. | CNN-LDNF: an image feature representation approach with multi-space mapping |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210806 |