CN107239789A

CN107239789A - A kind of industrial Fault Classification of the unbalanced data based on k means

Info

Publication number: CN107239789A
Application number: CN201710321424.1A
Authority: CN
Inventors: 葛志强; 陈革成
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-05-09
Filing date: 2017-05-09
Publication date: 2017-10-10

Abstract

The invention discloses a kind of method of the unbalanced data classification based on k means.This method is first by k means, and clustered according to the degree of unbalancedness class more to data, it is divided into N number of subclass by more several classes of, then with M minority class altogether, as many classification problems of (M+N) class, learnt finally according to Naive Bayes Classifier.Compared to other existing methods, the present invention is farthest remaining the information of former data, and prevents from preferably resolving the problem of uneven class data are classified on the premise of over-fitting, compared to other methods, nicety of grading is added, and reduces the generation of over-fitting.

Description

A kind of industrial Fault Classification of the unbalanced data based on k-means

Technical field

The invention belongs to the industrial process failure modes side of industrial process control field, more particularly to uneven class data Method.

Background technology

In the work of industrial failure modes, some conventional sorting techniques can all have a use premise, i.e., in training Concentrate the data volume of Various types of data suitable.But the situation of reality is frequently not so, when a certain class data are many or a certain Class data seldom, i.e., when uneven class data occur, directly can then produce very big error in classification using traditional sorting technique.

In recent years, the research of uneven class data is always a focus, and existing method is mainly gone from both direction Solve, one is that, from algorithm aspect, one is, from sampling aspect, to enter present invention is generally directed to sample level in face of conventional sorting methods Row is improved.Improved method for sampling is broadly divided into two classes, and a class is over-sampling, i.e., to minority class resampling to reach data Balance, a big drawback of such a method can exactly produce increase systems approach, produce over-fitting, practical application effect is not It is highly desirable；Another kind of is lack sampling, i.e., according to certain rule choose it is more several classes of in a part as training data, other Data then give up without, reach the balance of data with this, such a method due to have ignored the more several classes of data messages of a part, It can then cause the grader precision trained inadequate.It is an advantage of the present invention that both not changing former data sample Structure, also do not give up or artificially increase sample data on the premise of, train the preferable grader of effect.

The content of the invention

In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to provide a kind of unbalanced data work based on k-means Industry Fault Classification.

The purpose of the present invention is achieved through the following technical solutions：A kind of unbalanced data work based on k-means Industry Fault Classification, comprises the following steps：

(1) there is label instruction using what the data of systematic collection process nominal situation and various fault datas constituted modeling Practice sample set：Assuming that fault category is C, it is C+1 in total classification plus a normal class, each sample data, i.e.,Whereinn_iFor number of training, m is process variable number, and R is set of real numbers. So the complete label training sample set that has is, X_l=[X₁；X₂；...；X_C+1], record the label information of all data, normal work It is 1 that label is marked under condition, and the label of failure 1 is 2, by that analogy, i.e. Y_i=[i, i ... i], i=1,2 ..., C+1, complete Tally set is Y=[Y₁；Y₂；...；Y_C+1].Wherein normal class data x₁To be more several classes of, remainder data is minority class, uneven Spend for N=100, and assume the data volume difference of failure classes data less, i.e.,

(2) k-means clustering methods are used, by X₁It is divided into N number of subset that quantity is more or less the same, i.e. X₁=[X₁₁； X₁₂；...；X_1N], and new label Y is assigned respectively₁=[Y₁₁；Y₁₂；...；Y_1N]；

(3) N number of subclass in (2) is combined with C failure classes data, many classification as (N+C) class are asked The training set of topic, grader is set up using Nae Bayesianmethod.

(4) grader in (3) is tested using test set, and label is belonged into Y₁Whole be classified as normal class.

The beneficial effects of the invention are as follows：Method of the present invention by being clustered to most classes, i.e., to data sample process it Afterwards, can preferably solve the problem of unbalanced data is classified, while the not internal structure of change data, also do not increase or Data are reduced, the characteristic information of former data sample is farthest ensure that, compared to other methods, classification essence are added Degree, and reduce the generation of over-fitting.

Brief description of the drawings

Fig. 1 is the result schematic diagram that Bayes is directly handled；

Fig. 2 is the Bayes result schematic diagrames based on k-means.

Embodiment

The present invention is directed to the failure modes problem of industrial process, and this method is first by k-means, and according to degree of unbalancedness The class more to data is clustered, and is divided into N number of subclass by more several classes of, then with M minority class altogether, as (a M+ N) many classification problems of class, are learnt finally according to Naive Bayes Classifier.

The key step difference of the technical solution adopted by the present invention is as follows：

The first step：There is mark using the data of systematic collection process nominal situation and the composition modeling of various fault datas Sign training sample set：Assuming that fault category is C, it is C+1 in total classification plus a normal class, each sample data, I.e.I=1,2...C+1 is whereinn_iFor number of training, m is process variable number, and R is real Manifold.So the complete label training sample set that has is, X_l=[X₁；X₂；...；X_C+1], the label information of all data is recorded, It is 1 that label is marked under nominal situation, and the label of failure 1 is 2, by that analogy, i.e. Y_i=[i, i ... i], i=1,2 ..., C+1, Complete tally set is Y=[Y₁；Y₂；...；Y_C+1].Wherein normal class data x₁To be more several classes of, remainder data is minority class, Degree of unbalancedness is N=100, and assumes that the data volume difference of failure classes data is little, i.e.,

Second step：Using k-means clustering methods, by X₁It is divided into N number of subset that quantity is more or less the same, i.e. X₁=[X₁₁； X₁₂；...；X_1N], and new label Y is assigned respectively₁=[Y₁₁；Y₁₂；...；Y_1N]。

(a) in order to by X₁It is divided into N number of class, chooses N number of suitable initial mean value vector as the initial mean value of each classification Vector, i.e.,OrderWherein a=1,2 ..., N.

(b) each sample of calculating is calculated as follows respectivelyThe distance between with these mean vectors, Wherein j=1,2 ..., n₁, the Euclidean distance between j-th of sample and a-th of mean vector is：

Wherein j=1,2 ..., n₁, a=1,2 ..., N.For sample x_jIf, d_jaMinimum, then by x_jIt is included in a classes.

(c) in order to avoid the result data difference for occurring clustering is larger, it is impossible to reach that the purpose situation of cluster occurs, we A threshold k is added in (b), after the data amount check of a classes has reached K, the distance after this wheel relatively in by d_ja Remove, do not consider, then this wheel will not increase data to a classes again, until the calculating of next round.

(d) after G iteration, N number of subclass, i.e. X are obtained₁=[X₁₁；X₁₂；...；X_1N], and successively by each subclass Sample label be replaced by 1,2 .., N, obtain Y₁=[1,2 ..., N].And simultaneously successively change failure classes data label, Make Y_i=[b, b ..., b], wherein b=N+1, N+2..., N+C.Then training set now is X=[X₁；X₂；...；X_N+C], and IfI=1,2...C+N, wherein n_iFor the number of samples of the i-th class sample.Equally make each sample dataI=1,2 ..., C+N.

3rd step：N number of subclass in second is combined with C failure classes data, as many of (N+C) class The training set of classification problem, grader is set up using Nae Bayesianmethod.

(a) the average Mean of each dimension data in each classification is calculated respectively_icAnd variance Var_icAll kinds of priori is general Rate p_i, calculating formula is as follows：

Wherein i=1,2 ..., C+N, c=1,2 ..., m.

(b) according to Naive Bayes Classification principle, for a test set containing U sample each sample z therein_k =[z_k1,z_k2,...,z_km], calculate its posterior probability p for belonging to each classification_ki, calculating formula is as follows：

Wherein k=1,2 ..., U；I=1,2 ..., C+N.

According to the posterior probability calculated, and to the class label of sample imparting maximum of which probability.

4th step：Classification training set for having divided label in the 3rd step, is 1 data sample for arriving N by label Label is changed to 1 again, i.e., normal class classification, and label is changed into 2 respectively for N+1 to N+C data sample label arrives C+1, Complete the test of grader.

Illustrate effectiveness of the invention below in conjunction with the example of a specific industrial process.The data of the process come from U.S. TE (Tennessee Eastman --- Tennessee-Yi Siman) chemical process is tested, and prototype is Eastman chemical companies An actual process flow.At present, TE processes oneself through extensive as typical chemical process fault detection and diagnosis object Research.Whole TE processes include 41 measurands and 12 performance variables (control variable), wherein 41 measurands include 22 continuous measurands and 19 composition measurement values, they are sampled once for every 3 minutes.Including 21 batches of fault datas. In these failures, 16 are that oneself knows, 5 are unknown.Failure 1-7 is relevant with the Spline smoothing of process variable, such as cooling water Inlet temperature or feed constituents change.The changeability of failure 8-12 and some process variables, which increases, to matter a lot.Failure 13 It is the slow drift in kinetics, failure 14,15 and 21 is relevant with sticking valve.Failure 16-20 is unknown.In order to The process is monitored, 44 process variables are have chosen altogether, as shown in table 1.Next the detailed process is combined to this hair Bright implementation steps are set forth in：

1st, collection normal data and 4 kinds of fault datas carry out data prediction and normalization as training sample data. Nominal situation and failure 1,2,6,14 are have selected in this experiment respectively as training sample, failure 1 and failure 2 are flowed in 4 Composition transfer.Failure 6 is the A compositions generation shadow as caused by the A charging losses in stream 1, but eventually in convection current 4 Ring.14 product separator bottom of towe flows of failure.Sampling time is 3min, and wherein nominal situation contains 1000 samples of exemplar This, remaining failure modes has selected exemplar 10 respectively.

2nd, it is 100 classes according to k-means points by nominal situation data sample, and ensures the quantitative difference between class and class not Greatly.Then Nae Bayesianmethod is used, the training set for having 104 classes altogether plus 4 class fault datas is learnt.

3rd, online classification is tested, and the sample data for belonging to preceding 100 class is regular for normal class, and resets 4 failure classes Data label.

Table 1：Monitor variable declaration

Variable is numbered	Measurand	Variable is numbered	Measurand
				1	A feed rates	22	Separator cooling water outlet temperature
2	D feed rates	23	A molar contents in logistics 6
				3	E feed rates	24	B molar contents in logistics 6
4	A+C feed rates	25	C molar contents in logistics 6
				5	Recirculating mass	26	D molar contents in logistics 6
6	Reactor feed flow velocity	27	E molar contents in logistics 6
				7	Reactor pressure	28	F molar contents in logistics 6
8	Reactor grade	29	A molar contents in logistics 9
				9	Temperature of reactor	30	B molar contents in logistics 9
10	Mass rate of emission	31	C molar contents in logistics 9
				11	Product separator temperature	32	D molar contents in logistics 9
12	Product separator grade	33	E molar contents in logistics 9
				13	Product separator temperature	34	F molar contents in logistics 9
14	Product separator bottom of towe flow	35	G molar contents in logistics 9
				15	Stripper grade	36	H molar contents in logistics 9
16	Pressure of stripping tower	37	D molar contents in logistics 11
				17	Stripper bottom of towe flow	38	E molar contents in logistics 11
18	Stripper temperature	39	F molar contents in logistics 11
				19	Stripper flow	40	G molar contents in logistics 11
20	Compressor horsepower	41	H molar contents in logistics 11
				21	Reactor cooling water outlet temperature

Above-described embodiment is used for illustrating the present invention, rather than limits the invention, the present invention spirit and In scope of the claims, any modifications and changes made to the present invention both fall within protection scope of the present invention.

Claims

1. a kind of industrial Fault Classification of the unbalanced data based on k-means, it is characterised in that comprise the following steps：

(1) there is label training sample using what the data of systematic collection process nominal situation and various fault datas constituted modeling This collection：Assuming that fault category is C, along with a normal class, total classification of each sample data is C+1, i.e.,Wherein,n_iFor number of training, m is process variable number, and R is real number Collection.So complete has label training sample set X=[X₁；X₂；...；X_C+1], record the label information of all data.Normal work It is 1 that label is marked under condition, and the label of failure 1 is 2, by that analogy, i.e. tally set Y_i=[i, i ... i], i=1,2 ..., C+1, Complete tally set is Y=[Y₁；Y₂；...；Y_C+1].Wherein normal class data X₁To be more several classes of, remainder data is minority class, Degree of unbalancedness is N=100, and assumes that the data volume difference of failure classes data is little, i.e.,

(2) k-means clustering methods are used, by X₁It is divided into N number of subset that quantity is more or less the same, i.e. X₁=[X₁₁；X₁₂；...； X_1N], and new label Y is assigned respectively₁=[Y₁₁；Y₁₂；...；Y_1N]。

(3) N number of subclass in step 2 is combined with C failure classes data, as many classification problems of (N+C) class Training set, set up grader using Nae Bayesianmethod.

(4) grader in step 3 is tested using test set, and label is belonged into Y₁Whole be classified as normal class.

2. the industrial Fault Classification of unbalanced data based on k-means according to claim 1, it is characterised in that institute Stating step (2) is specially：First in normal class X₁A suitable initial mean value vector is chosen, each sample is calculatedThe distance between with these mean vectors, wherein j=1,2 ..., n₁.And according to each sample distance most Near mean vector is determinedCluster mark λ_j, then recalculate the mean vector of each cluster, and repeat above-mentioned work G times. In order to control to allow all kinds of data volumes in last cluster result to be more or less the same, therefore a threshold k is designed during iteration, Just no longer sample is added after the sample size of each cluster reaches threshold value to this cluster.Threshold value k-means specific method is such as Under：

(2.1) in order to by X₁Be divided into N number of class, choose N number of suitable initial mean value vector as each classification initial mean value to Amount, the N number of sample value of general random selection is vectorial as initial mean value, i.e.,Make x_Na=[q_a1；...； q_am], wherein a=1,2 ..., N.

(2.2) be calculated as follows the distance of each sample and N number of mean vector respectively, j-th of sample and a-th mean vector it Between Euclidean distance be：

<mrow> <msub> <mi>d</mi> <mrow> <mi>j</mi> <mi>a</mi> </mrow> </msub> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>q</mi> <mrow> <mi>a</mi> <mi>k</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein j=1,2 ..., n₁, a=1,2 ..., N.For sample x_jIf, d_jaMinimum, then by x_jIt is included in a classes, i.e. λ_j= a。

(2.3) in order to avoid the result data difference for occurring clustering is larger, it is impossible to reach that the purpose situation of cluster occurs, in step (2.2) threshold k is added in, after the data amount check of a classes has reached K, the distance after this wheel relatively in by d_ji Remove, do not consider, then this wheel will not increase data to a classes again, until the calculating of next round.

(2.4) after G iteration, N number of subclass, i.e. X are obtained₁=[X₁₁；X₁₂；...；X_1N], and successively by each subclass Sample label is replaced by 1,2 .., N, obtains Y₁=[1,2 ..., N].And change the label of failure classes data successively simultaneously, make Y_b =[b, b ..., b], wherein b=N+1, N+2..., N+C.Then training set now is X=[X₁；X₂；...；X_N+C], and setWherein n_iFor the number of samples of the i-th class sample.Equally make each sample data

3. the industrial Fault Classification of unbalanced data based on k-means according to claim 1, it is characterised in that institute Stating step (3) is specially：The average and variance of each dimension of (N+C) class are calculated respectively.Then for the sample data of test set, Its posterior probability for belonging to each classification is calculated respectively, is chosen the wherein maximum classification of posterior probability, is assigned the sample corresponding Label.Comprise the following steps that：

(3.1) the average Mean of each dimension data in each classification is calculated respectively_icAnd variance Var_icAll kinds of prior probabilities p_i, calculating formula is as follows

<mrow> <msub> <mi>Mean</mi> <mrow> <mi>i</mi> <mi>c</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <msub> <mi>p</mi> <mrow> <mi>t</mi> <mi>c</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <msub> <mi>Var</mi> <mrow> <mi>i</mi> <mi>c</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mrow> <mi>t</mi> <mi>c</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>Mean</mi> <mrow> <mi>i</mi> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <msub> <mi>n</mi> <mi>i</mi> </msub> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>C</mi> <mo>+</mo> <mi>N</mi> </mrow> </munderover> <msub> <mi>n</mi> <mi>t</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow>

Wherein i=1,2 ..., C+N, c=1,2 ..., m

(3.2) according to Naive Bayes Classification principle, for a test set containing U sample each sample z therein_k= [z_k1,z_k2,...,z_km], calculate its posterior probability p for belonging to each classification_ki, calculating formula is as follows：

<mrow> <msub> <mi>p</mi> <mrow> <mi>k</mi> <mi>i</mi> </mrow> </msub> <mo>=</mo> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>&times;</mo> <munderover> <mo>&Pi;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mfrac> <mn>1</mn> <mrow> <msqrt> <mrow> <mn>2</mn> <mi>&pi;</mi> </mrow> </msqrt> <msub> <mi>Var</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mfrac> <msup> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mrow> <mi>k</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>Mean</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mrow> <mn>2</mn> <msubsup> <mi>var</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </mfrac> </mrow> </msup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

Wherein k=1,2 ..., U；I=1,2 ..., C+N.

4. the industrial Fault Classification of unbalanced data based on k-means according to claim 1, it is characterised in that institute Belonging to step (4) is specially：Classification training set for having divided label in step (3), is 1 data sample for arriving N by label Label be changed to 1 again, i.e., label is changed to 2 to C+ by normal class classification respectively for N+1 to N+C data sample label 1, that is, complete the test of grader.