CN113220960A - Unbalanced time series data classification method based on autonomous learning - Google Patents

Unbalanced time series data classification method based on autonomous learning Download PDF

Info

Publication number
CN113220960A
CN113220960A CN202110515698.0A CN202110515698A CN113220960A CN 113220960 A CN113220960 A CN 113220960A CN 202110515698 A CN202110515698 A CN 202110515698A CN 113220960 A CN113220960 A CN 113220960A
Authority
CN
China
Prior art keywords
data
sample
points
time
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110515698.0A
Other languages
Chinese (zh)
Inventor
王晓峰
胡姣姣
郭小红
习英卓
周轩
冯冰清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
China Xian Satellite Control Center
Original Assignee
Xian University of Technology
China Xian Satellite Control Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology, China Xian Satellite Control Center filed Critical Xian University of Technology
Priority to CN202110515698.0A priority Critical patent/CN113220960A/en
Publication of CN113220960A publication Critical patent/CN113220960A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an unbalanced time series data classification method based on autonomous learning, which specifically comprises the following steps: step 1, processing the unbalanced time sequence data to construct a new sample; step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1; step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2; and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification. The method solves the problem that the detection precision of a minority class is seriously reduced due to the fact that a general learner is absolutely biased to the majority class, and the classification precision of the unbalanced time sequence data set is remarkably improved.

Description

Unbalanced time series data classification method based on autonomous learning
Technical Field
The invention belongs to the technical field of time series data classification, and relates to an unbalanced time series data classification method based on autonomous learning.
Background
The time sequence refers to data arranged according to the time sequence, and the data can directly reflect the state or degree of a certain object or phenomenon changing along with the time; time series data mining is to extract useful information related to time attributes, which is unknown in advance, from a large amount of time series data, and to guide activities of people such as society, economy, and life. In the field of aerospace measurement and control, a large amount of telemetering data are presented in a time series mode, the engineering data can directly reflect the operation state of an aircraft, and the data are classified and information and rules contained in the data are mined out, so that the research on equipment fault diagnosis technology is very important. Therefore, the time series data classification problem becomes an important research topic in engineering and academia.
The unbalanced time series data refers to a data set with a small number of samples far smaller than that of a large number of samples, for example, in aerospace measurement and control engineering, most of measured time series data are within a normal range, and only few abnormal values exist, so that the unbalanced time series data set is a typical unbalanced time series data set. In the binary classification problem, the detection accuracy and performance of the classifier are seriously reduced due to the imbalance of the data distribution, so that the result of a general classifier is seriously biased to a normal class, and the false detection rate of an abnormal class is very high. In practical application, a few classes are the focus of attention, and if the 'fault' is misdiagnosed as 'normal' and the faulty system continues to work, unpredictable consequences and loss can be caused.
The time series data classification is an important branch of time series data mining, the problem is different from other data classifications, signal values of each time point of the time series data do not exist independently, and the whole time series data is used as one input in the processing.
Disclosure of Invention
The invention aims to provide an unbalanced time series data classification method based on autonomous learning, which solves the problem that the detection precision of a minority class is seriously reduced due to the fact that a general learner is absolutely biased to the majority class, and remarkably improves the classification precision of an unbalanced time series data set.
The technical scheme adopted by the invention is that the method for classifying the unbalanced time series data based on the autonomous learning specifically comprises the following steps:
step 1, processing the unbalanced time sequence data to construct a new sample;
step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1;
step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2;
and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification.
The invention is also characterized in that:
the specific process of the step 1 is as follows:
step 1.1, let the data set be denoted as Q { Q }j(mj,nj) J-1, 2, …, u, where mjDenotes the time of the jth sample, njSignal value representing the jth sample, u representing the total number of data in the data set; in order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types: aggregation points, critical points, isolated points;
and step 1.2, generating a new sample according to the data set obtained in the step 1.1.
The specific process of step 1.1 is as follows:
in order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }j(mj,nj) 1,2, …, u clustering, dividing samples in the data set into 3 subsets: set of isolated pointsQ1{q1j(m1j,n1j)},j=1,2,…,u1Critical point set Q2{q2j(m2j,n2j)},j=1,2,…,u2And aggregation point set Q3{q3j(m3j,n3j)},j=1,2,…,u3Wherein u is1Denotes the number of outliers, u2Representing the number of critical points, u3Indicating the number of focal points, u1+u2+u3U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r1(m′1,n′1)、R2(m′2,n′2)、R3(m′3,n′3)。
The specific process of the step 1.2 is as follows:
step 1.2.1, order
Figure BDA0003061926390000031
Set of presentation points
Figure BDA0003061926390000032
Middle j1Sample point to cluster center R1(m′1,n′1) The distance of (a) to (b),
Figure BDA0003061926390000033
set of presentation points
Figure BDA0003061926390000034
Middle j2Sample point to cluster center R2(m′2,n′2) The distance of (a) to (b),
Figure BDA0003061926390000035
set of presentation points
Figure BDA0003061926390000036
Middle j3Sample point to cluster center R3(m′3,n′3) A distance of (1) to
Figure BDA0003061926390000037
Step 1.2.2, for point sets
Figure BDA0003061926390000038
A certain sample point q (m, n), the sample point q (m, n) to the point set
Figure BDA0003061926390000039
Cluster center R of1(m′1,n′1) Is denoted as a, a ═ n-n'1L, search for all sample points of the following equation (2):
Figure BDA00030619263900000310
and sequencing according to the morning and evening sequence of the time components, and recording the result as:
q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) (3);
wherein g represents a set of points
Figure BDA0003061926390000041
The number of sample points satisfying equation (2).
At samples q (m, n) and q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sample
Figure BDA0003061926390000042
As shown in the following equation (4):
Figure BDA0003061926390000043
wherein rand (0,1) represents a random number within the interval (0, 1);
constructing time component values of new samples
Figure BDA0003061926390000044
As shown in the following equation (5):
Figure BDA0003061926390000045
wherein m is1hH is 1,2, …, g is sample q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) To finally obtain a newly generated sample as
Figure BDA0003061926390000046
Step 1.2.3, repeatedly executing step 1.2.2 until the point set is traversed
Figure BDA0003061926390000047
All sample points in (a);
step 1.2.4, respectively aligning point sets
Figure BDA0003061926390000048
And
Figure BDA0003061926390000049
performing as a set of points
Figure BDA00030619263900000410
The step 1.2.2-1.2.4, respectively obtaining a set of points
Figure BDA00030619263900000411
And
Figure BDA00030619263900000412
a new sample is generated;
step 1.2.5, and the new sample obtained in step 1.2.3 and the step1.2.4 the new samples obtained are merged into the data set Q { Q } in step 1.1j(mj,nj) J-1, 2, …, u, a new data set can be generated
Figure BDA00030619263900000413
U represents the total amount of data in the newly generated data set after unbalanced data processing.
The specific process of the step 2 is as follows:
step 2.1, scale transformation;
for data sets
Figure BDA00030619263900000414
Wherein m isjTime stamp representing the jth sample, njSignal value representing the jth sample, U representing the total number of data in the data set;
is provided with
Figure BDA0003061926390000051
Representing the scaled signal value of the jth sample, and
Figure BDA0003061926390000052
wherein,
Figure BDA0003061926390000053
step 2.2, data segmentation;
dividing data into fixed-size segments, adopting a sliding window of overlapped segments, namely, the window length of a window function w is T, moving by a fixed step length T to divide the sequence into equally-spaced time sequence segments, expressing a set of segmented time sequence segments by L, and LiRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,
Figure BDA0003061926390000054
for the number of segments after segmentation, then
Figure BDA0003061926390000055
The range of each segment is:
Figure BDA0003061926390000056
the specific process of the step 3 is as follows:
constructing a deep convolutional neural network model, wherein the model comprises an input layer, 4 hidden layers, 1 fully-connected layer, a multi-layer perceptron and a classifier softmax;
the hidden layer comprises a convolutional layer C1, a pooling layer S2, a convolutional layer C3 and a pooling layer S4;
an input layer: time series data fragment { l with length of T obtained after scale transformation and time slicing processingi},
Figure BDA0003061926390000057
Inputting into a network model;
the deep convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the class 1 or 2r
Figure BDA0003061926390000058
Here, the category 1 indicates a normal value, and the category 2 indicates an abnormal value.
The specific process of the step 4 is as follows:
training the data set by using the convolutional neural network model obtained in the step 3, outputting the probability that each time slice belongs to each category, and using the cross entropy as a cost function, as shown in the following formula (9):
H=-∑yklog pk (9);
wherein, ykIndicates the desired tag type, pkIs the actual output;
and performing error minimization training by taking an adaptive learning rate optimization algorithm Adam Optimizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model according to the optimal weight parameter to perform time series classification.
The invention has the following beneficial effects:
1. the invention provides an unbalanced time series data classification method based on autonomous learning aiming at unbalanced time series data from the data driving perspective, which comprises two stages of unbalanced time series data processing and time series data classification.
2. In the unbalanced time sequence data processing stage, a sampling method is adopted to divide a few types of samples into three types of aggregation points, critical points and isolated points, and then the time stamps and signal values are interpolated in each type.
3. In the time series data classification stage, a deep convolutional neural network model with 4 hidden layers is constructed, and feature extraction and classification are realized by utilizing the autonomous feature mapping capability of the convolutional neural network.
4. The method solves the problem that the detection precision of a minority class is seriously reduced because a general learner is absolutely biased to the majority class, and remarkably improves the classification precision of the unbalanced time sequence data set.
Drawings
FIG. 1 is a data generation process in an unbalanced time series data classification method based on autonomous learning according to the present invention;
FIG. 2 is a deep convolutional neural network model constructed in the unbalanced time series data classification method based on autonomous learning according to the present invention;
3(a) and 3(b) are the classification performance of the convolutional neural network under different hidden layer structures in the classification method of the unbalanced time series data based on the autonomous learning according to the present invention;
fig. 4(a) and 4(b) illustrate the classification performance of a convolutional neural network trained by using an original data set and an unbalanced data set after processing according to an unbalanced time series data classification method based on autonomous learning.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention discloses an unbalanced time series data classification method based on autonomous learning, which comprises the following specific steps:
step 1, processing unbalanced time series data;
and 1.1, processing a few kinds of data sets in the training data by adopting a sampling method. Let the data set be denoted as Q { Q }j(mj,nj) J ═ 1,2, …, u), where mjDenotes the time of the jth sample, njRepresents the signal value of the j-th sample and u represents the total amount of data in the data set. In order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types:
aggregation points: in the point set distribution, points that are distributed in the center of the point set and exhibit an aggregation state are distributed.
Critical point: in the point set distribution, points scattered at the edge of the range where the aggregation points are aggregated and limiting the range of the distribution of the aggregation points are called critical points.
Isolated points: in the point set distribution, the points are scattered at positions far from the aggregation range of the aggregation point, are located outside the edge formed by the critical points, and are in an isolated state.
Fig. 1 illustrates the distribution of 3 points.
In order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }j(mj,nj) Clustering (j ═ 1,2, …, u), samples in the dataset are divided into 3 subsets: set of isolated points Q1{q1j(m1j,n1j)}(j=1,2,…,u1) Critical point set Q2{q2j(m2j,n2j)}(j=1,2,…,u2) And aggregation point set Q3{q3j(m3j,n3j)}(j=1,2,…,u3) Wherein u is1Denotes the number of outliers, u2Representing the number of critical points, u3Indicating the number of focal points, u1+u2+u3U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r1(m′1,n′1)、R2(m′2,n′2)、R3(m′3,n′3)。
Step 1.2, generating a new sample;
order to
Figure BDA0003061926390000081
Set of presentation points
Figure BDA0003061926390000082
Middle j1Sample point to cluster center R1(m′1,n′1) The distance of (a) to (b),
Figure BDA0003061926390000083
set of presentation points
Figure BDA0003061926390000084
Middle j2Sample point to cluster center R2(m′2,n′2) The distance of (a) to (b),
Figure BDA0003061926390000085
set of presentation points
Figure BDA0003061926390000086
Middle j3Sample point to cluster center R3(m′3,n′3) The distance of (c). Then
Figure BDA0003061926390000087
For point sets
Figure BDA0003061926390000088
All of them are the sameThis point q (m, n), this sample point-to-point set
Figure BDA0003061926390000089
Cluster center R of1(m′1,n′1) Is denoted as a, a ═ n-n'1Searching for all sample points satisfying equation (2):
Figure BDA00030619263900000810
and sequencing according to the morning and evening sequence of the time components, and recording the result as:
q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) (3);
wherein g represents a set of points
Figure BDA0003061926390000091
The number of sample points satisfying the expression (2).
At samples q (m, n) and q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sample
Figure BDA0003061926390000092
Figure BDA0003061926390000093
Where rand (0,1) represents a random number within the interval (0, 1).
Constructing time component values of new samples
Figure BDA0003061926390000094
Figure BDA0003061926390000095
Wherein m is1h(h-1, 2, …, g) is sample q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) To finally obtain a newly generated sample as
Figure BDA0003061926390000096
Point set
Figure BDA0003061926390000097
Until all sample points are traversed.
Point set
Figure BDA0003061926390000098
And
Figure BDA0003061926390000099
respectively repeating the point set
Figure BDA00030619263900000910
All procedures for producing a new sample (Point set)
Figure BDA00030619263900000911
Specific process and point set for generating new sample
Figure BDA00030619263900000912
The process of generating new samples is the same), all new generated samples are obtained, and the new sample points are combined into the original data set to generate a new data set
Figure BDA00030619263900000913
U represents the total amount of data in the newly generated data set after unbalanced data processing.
Step 2, preprocessing data;
step 2.1, scale transformation;
for dataCollection
Figure BDA00030619263900000914
Wherein m isjTime stamp representing the jth sample, njRepresents the signal value of the j-th sample and U represents the total amount of data in the data set. The inconsistency of the data dimension affects the speed of the network learning, and in order to avoid the influence, the signal value needs to be subjected to scale transformation so as to realize dimension consistency. Is provided with
Figure BDA00030619263900000915
Representing the scaled signal value of the jth sample, and
Figure BDA0003061926390000101
wherein,
Figure BDA0003061926390000102
step 2.2, data segmentation;
time series data are mostly long sequences with time stamps, and signal values have a timing dependency. In order to enable the network model to learn this feature of time series data, thereby preserving the time-series dependency of the time series data, we divide the data into fixed-size segments. A sliding window of overlapping segments is used, i.e. the window function w has a window length T, and the sequence is divided into equally spaced time sequence segments with a fixed step T shift. The set of time-series fragments after segmentation is denoted by L, LiRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,
Figure BDA0003061926390000104
the number of segments after segmentation. Then
Figure BDA0003061926390000105
The range of each segment is:
Figure BDA0003061926390000103
step 3, a deep convolution neural network model;
the method constructs a deep convolutional neural network model which comprises an input layer, 4 hidden layers, 1 fully-connected layer and a multi-layer perceptron, and softmax is used as a classifier. Model structure as shown in fig. 2, the hidden layer is used for feature extraction, and includes convolutional layer C1, pooling layer S2, convolutional layer C3 and pooling layer S4, two important operations of convolution and pooling, and the softmax classifier is mainly used for time series classification. The working process of the network model is described in detail below.
An input layer: time series data segments with the length of T and obtained after scale transformation and time slicing processing
Figure BDA0003061926390000111
Input into the network model.
We will describe the working process of the hidden layer by taking any time sequence segment l as an example.
Layer C1: the method of the invention uses a gaussian convolution kernel:
Figure BDA0003061926390000112
where σ denotes the convolution width, the radial range of action of the control function, and in this method we have found by experiment that σ is 0.1 optimal.
Let C1 layer have v1Each size is n1Of the convolution kernel
Figure BDA0003061926390000113
V is generated through convolution with C1 layers1Length of c1Feature vector of
Figure BDA0003061926390000114
Figure BDA0003061926390000115
c1=t-n1+1
Wherein,
Figure BDA0003061926390000116
representing a feature vector, c1Representing feature vectors
Figure BDA0003061926390000117
Length of (v)1The number of the feature vectors is represented,
Figure BDA0003061926390000118
represents the bias of the C1 layer, conv (-) represents the convolution function, ReLU (-) represents the activation function.
Layer S2: assume that the S2 level has a step size of l2Size a2Pooling of windows, then feature vectors
Figure BDA0003061926390000119
V is generated after S2 layers1Size of c2Feature vector of
Figure BDA00030619263900001110
Figure BDA00030619263900001111
c2=(t-n1+1-a2)/t2+1
Wherein,
Figure BDA00030619263900001112
representing a feature vector, c2Representing feature vectors
Figure BDA00030619263900001113
Length of (v)1The number of the feature vectors is represented,
Figure BDA00030619263900001114
represents the sharing weight of the S2 level,
Figure BDA00030619263900001115
represents the bias of the S2 layer, D (-) represents the downsampling function, ReLU (-) represents the activation function.
Layer C3: assume that the C3 layer has v3Each size is n3Of the convolution kernel
Figure BDA0003061926390000121
The feature vector W obtained at the S2 levels 2V is generated by C3 layer convolution3Size of c3Feature vector of
Figure BDA0003061926390000122
Figure BDA0003061926390000123
c3=(t-n1+1-a2)/l2-n3+2
Wherein,
Figure BDA0003061926390000124
representing a feature vector, c3Representing feature vectors
Figure BDA0003061926390000125
Length of (v)3The number of the feature vectors is represented,
Figure BDA0003061926390000126
represents the bias of the Con3 layer, conv (-) represents the convolution function, and ReLU (-) represents the activation function.
Layer S4: assume that the S4 level has a step size of l4Size a4The pooling window of (1), then the feature vector obtained by the C3 layer
Figure BDA0003061926390000127
After passing through S4 layer, v is generated3Size of c4Feature vector of
Figure BDA0003061926390000128
Figure BDA0003061926390000129
c4=(t-n1+1-a2-n3l2+2l2-a4l2)/l2l4+1
Wherein,
Figure BDA00030619263900001210
representing a feature vector, c3Representing feature vectors
Figure BDA00030619263900001211
Length of (v)3The number of the feature vectors is represented,
Figure BDA00030619263900001212
represents the sharing weight of the S4 level,
Figure BDA00030619263900001213
represents the bias of the S4 layer, D (-) represents the downsampling function, ReLU (-) represents the activation function.
Rasterization: finally will be
Figure BDA00030619263900001214
Sequentially generating a one-dimensional vector
Figure BDA00030619263900001215
Length c5As shown in equation (13).
Figure BDA00030619263900001216
MP5 layer: the MP5 layer is a multi-layer perceptron that maps one set of vectors to another. We use here a three-layer sensor: one input layer, one hidden layer, and one output layer. Features after rasterization
Figure BDA00030619263900001217
Inputting the MP5 layer, performing feature mapping in hidden layer with its neuron number of o and v5(o=1,2,...,v5) For the binary problem, the number of neurons in the output layer is 2 (i.e., equation (15); where r is 1 and 2).
Figure BDA0003061926390000131
Wherein,
Figure BDA0003061926390000132
represents the weight of the hidden layer in the MLP,
Figure BDA0003061926390000133
represents the bias of the hidden layer in MLP and tanh (-) represents the tanh activation function.
Figure BDA0003061926390000134
Wherein
Figure BDA0003061926390000135
Represents the weight of the output layer in the MLP,
Figure BDA0003061926390000136
indicating the bias of the output layer in MLP.
And (3) network output: the convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the category 1 (normal value) or 2 (abnormal value)rWhere r is 1, 2.
Figure BDA0003061926390000137
Step 4, classifying;
training the data set using a trained convolutional neural network model, outputting the probability that each time slice belongs to each class, using cross entropy (cross) as a cost function (see equation (17)):
H=-∑yklog pk (17)
ykindicates the desired tag type, pkIs the actual output.
And performing error minimization training by taking an adaptive learning rate optimization algorithm AdamaOptizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model for time series classification.
Examples
An experiment platform: the deep learning platform adopted in the experiment is tensiorflow1.3.0, the interface is python3.5, and the computer hardware is configured to be an i7 processor, an 8GB installation memory and a 64-bit operating system.
Data set: and taking the rotating speed data and the temperature data of certain equipment in the actual engineering as experimental data.
Data set 1: rotational speed data for a device. The training data set contains 140281 signal values, of which there are 35707 outlier data values; in the test data set, the balanced data set a1 contains 5312 signal values, where there are 2656 abnormal data; the unbalanced data set B1 contains 1087 signal values, 170 of which are anomalous data.
Data set 2: temperature data for a device. The training data set contains 50001 signal values, with 3901 anomalous data values; in the test data set, the equilibrium data set a2 contained 9615 signal values, with 4807 anomalous data; the unbalanced data set B2 contains 9158 signal values, of which there are 2313 outliers.
In the experiment, when supervised training is carried out, the label of the normal value is marked as 1, and the label of the abnormal value is marked as 0.
1. Setting the number of hidden layers;
in order to establish an optimal convolutional neural network structure, the classification performance of convolutional neural network models of different hidden layers on an experimental data set is explored through experiments.
Firstly, processing a data set 1 and a data set 2 by using the unbalanced time series data processing algorithm described in the step 1, secondly, carrying out scale transformation and time slicing operation on the processed data sets, then sending the processed data sets into convolutional neural network models of different hidden layers for training, and then testing on a data set a to obtain the identification precision and the training loss value of the network models of different hidden layers.
Table 1 and table 2 are specific parameter settings of the hidden layer when the network model is trained using data set 1 and data set 2, respectively.
TABLE 1 parameter settings when training a network with dataset 1
Figure BDA0003061926390000151
Table 2 parameter settings when training a network using dataset 2
Figure BDA0003061926390000152
For data set 1, the period of the time series data is 150 time stamps, and the length 150 of one period is taken as an input length. Characteristic dimensions obtained by finally learning in the fully-connected layers in the network structures of different hidden layers are 3600, iteration times are 1000, and experimental results show that the classification identification precision is highest when the number of the hidden layers is 4. For data set 2, the period of the time-series data is 326 time stamps, and the length 163 of a half period is taken as an input length. The feature dimensions obtained by learning at the full-connection layer in the network structures of different hidden layers are 6000, the iteration times are 1000, and the experimental result shows that the classification identification precision is highest when the number of the hidden layers is 4.
Fig. 3(a) and (b) show the classification accuracy acc and the training loss respectively of the convolutional neural network model of four structures obtained by training with the data set 1 (fig. 3(a) corresponds to the data set 1) and the data set 2 (fig. 3(b) corresponds to the data set 2), wherein the vertical axis of the coordinate on the left represents the variation value of the training loss, and the vertical axis of the coordinate on the right represents the classification accuracy on the test set (the data sets a1 and a 2). The training loss curves of the four structures tend to zero at different speeds, so that the constructed convolutional network has no overfitting phenomenon in the learning process and has better generalization capability for the learning of time series data. The classification precision reaches more than 90% after 1000 times of training and iteration on the data set 1, and the classification precision of the convolutional neural network model with 4 hidden layers is stable to the first after 400 times of iteration, so that the method has better classification performance. The classification accuracy of the 4 network models on the data set 2 has fluctuation of different degrees, wherein the classification accuracy of the network model with 4 hidden layers vibrates violently in the interval of iteration times of [0,100], and then is slowly improved, and the classification accuracy of the other three network models reaches higher classification accuracy after 1000 times of iteration. Combining the above results, the present invention determines a convolutional neural network model containing 4 hidden layers for time series data classification.
2. Evaluating the index;
the invention uses the classification precision and the confusion matrix to evaluate the performance of the method, and the indexes are defined as follows.
(1) And (3) classification precision: acc ═ N '/N'
(18);
Where N 'represents the correctly classified time series segments in the test data set and N' represents the total number of time series segments in the test data set.
The confusion matrix, also called error matrix, is a standard format for representing the accuracy evaluation, and is represented in a matrix form. For the binary problem, it is finally necessary to determine whether the result of the sample is 0 or 1, or "positive" or "negative". Four basic indicators can be defined, called primary indicators (bottom-most):
the True value is "Positive", and the number of time series data fragments classified as "Positive" by the model is marked as True Positive (TP);
the true value is "positive", and the number of time series data fragments classified as "Negative" by the model is marked as true positive (FN);
the true value is negative, and the number of time series data fragments classified as Positive by the model is marked as true Positive (FP);
the True value is "Negative", and the number of time-series data fragments classified as "Negative" by the model is marked as True positive (TN);
the 4 indices were used to generate a Confusion Matrix (fusion Matrix):
TABLE 3 confusion matrix
Figure BDA0003061926390000171
3. Performance evaluation;
in order to perform performance analysis on the proposed time series data classification model, tests are firstly performed on both classification accuracy and a confusion matrix, and finally comparison is performed with a typical time series data classification algorithm in the field of fault diagnosis.
In comparison of recognition accuracy, the validation set is tested using a balanced data set (data set a), and the validation set is tested using an unbalanced data set (data set B) in the calculation of the confusion matrix.
Fig. 4 shows the classification accuracy acc and loss results of the CNN model trained with the datasets before and after processing of unbalanced data, respectively. Fig. 4(a) shows the experimental result on the data set 1, where the blue line represents the classification result of the convolutional neural network model obtained by training the original data set, and the red line represents the result on the data set processed by the unbalanced data, which is obviously improved significantly. When the CNN model trained by the data set after unbalanced data processing is used for classification, the classification precision reaches over 90% after 200 iterations, and reaches 98.633% after 1000 iterations; when the CNN model trained by the original data set is classified, the classification precision is gradually improved in an unstable state after 600 iterations, but the classification precision does not reach 80%, and the classification precision reaches 87.402% after 1000 iterations. When the data set after unbalanced data processing is used for training the CNN model, the loss value is converged at a higher speed, and the loss value is reduced to 0.00548 when the data set is iterated for 1000 times; when the CNN model is trained by using the original data set, the loss value is reduced slowly, and the loss value is reduced to 0.244 after 1000 iterations. Fig. 4(b) shows the experimental results on the data set 2, wherein the blue line is the classification result of the convolutional neural network trained on the original data set, and the red line is the result on the data set after the unbalanced data processing, and it is obvious that the experimental results after the unbalanced data processing are significantly improved. When a CNN model obtained by training a data set after unbalanced data processing is used for classification, the classification precision reaches over 90% after 200 iterations, and reaches 96.48% after 1000 iterations; when the CNN model trained by the original data set is used for classification, the classification precision does not reach about 76% after 200 times of iteration, the classification precision is in an unstable state after 600 times of iteration, and the classification precision is not obviously improved after 1000 times of iteration. When a CNN model trained by the data set after unbalanced data processing is used for training, the loss value is converged at a higher speed, and the loss value is reduced to 0.000054 when the CNN model is iterated for 1000 times; when the CNN model is trained by using the original data set, the loss value is slowly reduced when the number of iterations is below 600, and the loss value is reduced to 0.00063 after 1000 iterations. The experimental results on the two data sets are integrated, the learning of the network model on the original data set excessively depends on the training data, so that the classification precision is low, the unbalanced data processing algorithm makes up the defects, the distribution difference among the data is reduced, the learning capability of the classifier on abnormal data is enhanced, and the classification performance of the classifier is further improved.
TABLE 4 confusion matrix (%)
Figure BDA0003061926390000191
TABLE 5 confusion matrix (%)
Figure BDA0003061926390000192
TABLE 6 confusion matrix (%)
Figure BDA0003061926390000201
TABLE 7 confusion matrix (%)
Figure BDA0003061926390000202
For the unbalanced data set, the classification accuracy of the network model has certain limitation, abnormal data are easily mistakenly classified into normal data, and after the unbalanced data are processed in the data set, the learning capacity of the model for the abnormal data is improved, and the error rate is reduced. Therefore, the unbalanced time series data processing algorithm provided by the invention has a better correction effect on the unbalanced data set classification, and the good performance of the proposed time series data classification model is proved.
TABLE 8 Classification accuracy of different algorithms on dataset 1
Figure BDA0003061926390000203
TABLE 9 Classification accuracy of different algorithms on dataset 2
Figure BDA0003061926390000211
Tables 8 and 9 show the classification results of the different time series data classification algorithms on dataset 1 and dataset 2, respectively. The experiments are respectively carried out on an original data set and a data set processed by unbalanced data, the feature extraction algorithm in the comparison method is Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), and the classifier is a Support Vector Machine (SVM) and a Neural Network (NN). No matter which classifier is used, the classification precision of the data set after unbalanced data processing is obviously improved, compared with other classifiers, the method provided by the invention does not need to combine feature extraction and the classifiers, but completes the autonomous learning of features and classification at one time, and has better adaptability to the change of data regularity.
The invention provides an unbalanced time sequence data classification method based on autonomous learning from the viewpoint of data driving aiming at unbalanced time sequence data. The method mainly comprises two stages of unbalanced data processing and time series data classification. In the unbalanced data processing stage, a sampling method is adopted to divide a few types of samples into three types of aggregation points, critical points and isolated points, and then the time stamps and signal values are interpolated in each type. In the time series data classification stage, the invention constructs a convolutional neural network model with 4 hidden layers, and realizes feature extraction and classification by utilizing the autonomous feature mapping capability of the convolutional neural network. The method solves the problem that the detection precision of a few classes is seriously reduced because a general learning model is absolutely biased to the majority classes, and remarkably improves the classification precision of the unbalanced time sequence data set.

Claims (7)

1. An unbalanced time series data classification method based on autonomous learning is characterized in that: the method specifically comprises the following steps:
step 1, processing the unbalanced time sequence data to construct a new sample;
step 2, sequentially carrying out scale transformation and data segmentation on the new sample constructed in the step 1;
step 3, constructing a deep convolutional neural network model based on the result obtained in the step 2;
and 4, training the neural network model constructed in the step 3, and establishing an optimal time series data classification model according to the training result to perform time series classification.
2. The method for classifying unbalanced time-series data based on autonomous learning according to claim 1, wherein: the specific process of the step 1 is as follows:
step 1.1, let the data set be denoted as Q { Q }j(mj,nj) J-1, 2, …, u, where mjDenotes the time of the jth sample, njSignal value representing the jth sample, u representing the total number of data in the data set; in order to ensure that the distribution state of the data set is unchanged after unbalanced data processing, points in the data set are defined as the following 3 types: aggregation points, critical points, isolated points;
and step 1.2, generating a new sample according to the data set obtained in the step 1.1.
3. The method for classifying unbalanced time-series data based on autonomous learning according to claim 2, wherein: the specific process of the step 1.1 is as follows:
in order to maintain the distribution state of the data set, a fuzzy clustering algorithm is adopted to carry out on the data set Q { Q }j(mj,nj) 1,2, …, u clustering, dividing samples in the data set into 3 subsets: set of isolated points Q1{q1j(m1j,n1j)},j=1,2,…,u1Critical point set Q2{q2j(m2j,n2j)},j=1,2,…,u2And aggregation point set Q3{q3j(m3j,n3j)},j=1,2,…,u3Wherein u is1Denotes the number of outliers, u2Representing the number of critical points, u3Indicating the number of focal points, u1+u2+u3U, the clustering centers of the isolated point set, the critical point set and the aggregation point set obtained by the clustering algorithm are respectively: r1(m′1,n′1)、R2(m′2,n′2)、R3(m′3,n′3)。
4. The method for classifying unbalanced time-series data based on autonomous learning according to claim 3, wherein: the specific process of the step 1.2 is as follows:
step 1.2.1, order
Figure FDA0003061926380000021
Set of presentation points
Figure FDA0003061926380000022
Middle j1Sample point to cluster center R1(m′1,n′1) The distance of (a) to (b),
Figure FDA0003061926380000023
set of presentation points
Figure FDA0003061926380000024
Middle j2Sample point to cluster center R2(m′2,n′2) The distance of (a) to (b),
Figure FDA0003061926380000025
set of presentation points
Figure FDA0003061926380000026
Middle j3Sample point to cluster center R3(m′3,n′3) A distance of (1) to
Figure FDA0003061926380000027
Step 1.2.2, for point sets
Figure FDA0003061926380000028
A certain sample point q (m, n), the sample point q (m, n) to the point set
Figure FDA0003061926380000029
Cluster center R of1(m′1,n′1) Is denoted as a, a ═ n-n'1L, search for all sample points of the following equation (2):
Figure FDA00030619263800000210
and sequencing according to the morning and evening sequence of the time components, and recording the result as:
q11(m11,n11),q12(m12,n12),…,q1g(m1g,n1g) (3);
wherein g represents a set of points
Figure FDA00030619263800000211
The number of sample points satisfying equation (2).
At samples q (m, n) and q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) Respectively carrying out random linear interpolation between the signal component values to construct the signal component value of a new sample
Figure FDA00030619263800000212
As shown in the following equation (4):
Figure FDA00030619263800000213
wherein rand (0,1) represents a random number within the interval (0, 1);
constructing time component values of new samples
Figure FDA0003061926380000031
As shown in the following equation (5):
Figure FDA0003061926380000032
wherein m is1hH is 1,2, …, g is sample q11(m11,n11)、q12(m12,n12)、…、q1g(m1g,n1g) To finally obtain a newly generated sample as
Figure FDA0003061926380000033
Step 1.2.3, repeatedly executing step 1.2.2 until the point set is traversed
Figure FDA0003061926380000034
All sample points in (a);
step 1.2.4, respectively aligning point sets
Figure FDA0003061926380000035
And
Figure FDA0003061926380000036
performing as a set of points
Figure FDA0003061926380000037
The step 1.2.2-1.2.4, respectively obtaining a set of points
Figure FDA0003061926380000038
And
Figure FDA0003061926380000039
a new sample is generated;
step 1.2.5, merging the new sample obtained in step 1.2.3 and the new sample obtained in step 1.2.4 into the data set Q { Q } in step 1.1j(mj,nj)}J-1, 2, …, u, a new data set can be generated
Figure FDA00030619263800000310
U represents the total amount of data in the newly generated data set after unbalanced data processing.
5. The method for classifying unbalanced time-series data based on autonomous learning according to claim 4, wherein: the specific process of the step 2 is as follows:
step 2.1, scale transformation;
for data sets
Figure FDA00030619263800000311
Wherein m isjTime stamp representing the jth sample, njSignal value representing the jth sample, U representing the total number of data in the data set;
is provided with
Figure FDA00030619263800000312
Representing the scaled signal value of the jth sample, and
Figure FDA00030619263800000313
wherein,
Figure FDA00030619263800000314
step 2.2, data segmentation;
dividing data into fixed-size segments, adopting a sliding window of overlapped segments, namely, the window length of a window function w is T, moving by a fixed step length T to divide the sequence into equally-spaced time sequence segments, expressing a set of segmented time sequence segments by L, and LiRepresenting the ith time sequence segment after segmentation, U is the total amount of data in the data set,
Figure FDA0003061926380000041
for the number of segments after segmentation, then
Figure FDA0003061926380000045
The range of each segment is:
Figure FDA0003061926380000042
6. the method for classifying unbalanced time-series data based on autonomous learning according to claim 5, wherein: the specific process of the step 3 is as follows:
constructing a deep convolutional neural network model, wherein the model comprises an input layer, 4 hidden layers, 1 fully-connected layer, a multi-layer perceptron and a classifier softmax;
the hidden layer comprises a convolutional layer C1, a pooling layer S2, a convolutional layer C3 and a pooling layer S4;
an input layer: time series data fragment { l with length of T obtained after scale transformation and time slicing processingi},
Figure FDA0003061926380000043
Inputting into a network model;
the deep convolutional neural network finally uses a softmax classifier to carry out logistic regression, and the probability value P of the output signal belonging to the class 1 or 2r
Figure FDA0003061926380000044
Here, the category 1 indicates a normal value, and the category 2 indicates an abnormal value.
7. The method for classifying unbalanced time-series data based on autonomous learning according to claim 6, wherein: the specific process of the step 4 is as follows:
training the data set by using the convolutional neural network model obtained in the step 3, outputting the probability that each time slice belongs to each category, and using the cross entropy as a cost function, as shown in the following formula (9):
H=-∑yklog pk (9);
wherein, ykIndicates the desired tag type, pkIs the actual output;
and performing error minimization training by taking an adaptive learning rate optimization algorithm Adam Optimizer as a back propagation training algorithm to obtain an optimal weight parameter, and establishing an optimal time series data classification model according to the optimal weight parameter to perform time series classification.
CN202110515698.0A 2021-05-12 2021-05-12 Unbalanced time series data classification method based on autonomous learning Pending CN113220960A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110515698.0A CN113220960A (en) 2021-05-12 2021-05-12 Unbalanced time series data classification method based on autonomous learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110515698.0A CN113220960A (en) 2021-05-12 2021-05-12 Unbalanced time series data classification method based on autonomous learning

Publications (1)

Publication Number Publication Date
CN113220960A true CN113220960A (en) 2021-08-06

Family

ID=77094989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110515698.0A Pending CN113220960A (en) 2021-05-12 2021-05-12 Unbalanced time series data classification method based on autonomous learning

Country Status (1)

Country Link
CN (1) CN113220960A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327045A (en) * 2021-11-30 2022-04-12 中国科学院微电子研究所 Fall detection method and system based on category unbalanced signals
CN115374859A (en) * 2022-08-24 2022-11-22 东北大学 Method for classifying unbalanced and multi-class complex industrial data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327045A (en) * 2021-11-30 2022-04-12 中国科学院微电子研究所 Fall detection method and system based on category unbalanced signals
CN115374859A (en) * 2022-08-24 2022-11-22 东北大学 Method for classifying unbalanced and multi-class complex industrial data

Similar Documents

Publication Publication Date Title
Hsu et al. Multiple time-series convolutional neural network for fault detection and diagnosis and empirical study in semiconductor manufacturing
Chadha et al. Time series based fault detection in industrial processes using convolutional neural networks
CN113220960A (en) Unbalanced time series data classification method based on autonomous learning
CN111325264A (en) Multi-label data classification method based on entropy
Cheriguene et al. A new hybrid classifier selection model based on mRMR method and diversity measures
Bommert Integration of feature selection stability in model fitting
Nafis et al. Facial expression recognition on video data with various face poses using deep learning
Karankar et al. Comparative study of various machine learning classifiers on medical data
Li et al. A two-phase filtering of discriminative shapelets learning for time series classification
Gomiasti et al. Enhancing Lung Cancer Classification Effectiveness Through Hyperparameter-Tuned Support Vector Machine
Dubey et al. Hybrid classification model of correlation-based feature selection and support vector machine
Liu et al. MRD-NETS: multi-scale residual networks with dilated convolutions for classification and clustering analysis of spacecraft electrical signal
Bandyopadhyay et al. Automated label generation for time series classification with representation learning: Reduction of label cost for training
Singh et al. Dimensionality reduction for classification and clustering
Akar et al. Open set recognition for time series classification
Oh et al. Multivariate time series open-set recognition using multi-feature extraction and reconstruction
Singh et al. SMOTE-LASSO-DeepNet Framework for Cancer Subtyping from Gene Expression Data
Tamura et al. Time series classification using macd-histogram-based recurrence plot
Bandyopadhyay et al. Hierarchical clustering using auto-encoded compact representation for time-series analysis
Chen et al. TimeMIL: Advancing Multivariate Time Series Classification via a Time-aware Multiple Instance Learning
Sengupta et al. A scoring scheme for online feature selection: Simulating model performance without retraining
Azmer et al. Comparative analysis of classification techniques for leaves and land cover texture.
Jiang et al. A novel feature extraction approach for microarray data based on multi-algorithm fusion
Baraniya et al. Breast Cancer Classification and Recurrence Prediction Using Artificial Neural Networks and Machine Learning Techniques
Li et al. CNN-LDNF: an image feature representation approach with multi-space mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210806