CN108197280B

CN108197280B - Mining ability evaluation method based on industrial equipment data

Info

Publication number: CN108197280B
Application number: CN201810023192.6A
Authority: CN
Inventors: 董亚明; 许伟; 马贺贺
Original assignee: Shanghai Electric Group Corp
Current assignee: Shanghai Electric Group Corp
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2022-05-13
Anticipated expiration: 2038-01-10
Also published as: CN108197280A

Abstract

The invention provides an excavation evaluation method based on industrial equipment data, which comprises a data evaluation index library and a data evaluation index rule library; the data evaluation index rule base comprises a plurality of data evaluation rules; the data evaluation index library comprises a plurality of data evaluation indexes, and each data evaluation index corresponds to a weight coefficient; providing a data set to be evaluated; selecting a corresponding data evaluation index from a data evaluation index library according to the data set; acquiring data evaluation rules corresponding to the data evaluation indexes one by one in a data evaluation index rule base; acquiring a weight coefficient corresponding to each data evaluation index; and establishing a data minerability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index. The technical scheme has the beneficial effect of overcoming the defect that the quality evaluation of industrial data is only related to and the mining analysis is lacked in the prior art.

Description

Mining ability evaluation method based on industrial equipment data

Technical Field

The invention relates to the technical field of industrial equipment data processing, in particular to an excavation evaluation method based on industrial equipment data.

Background

With the continuous development of economy, the informatization of enterprises is gradually realized, so that the enterprises can adapt to the constantly changing competitive environment of market economy, and the maximum economic benefit is obtained. Data in industrial equipment plays a very important role in the construction process of accelerating enterprise informatization, however, with the continuous aging of enterprise equipment, sensor failure, instability of a transmission network and other problems, the data quality problem becomes increasingly prominent. The problems are mainly reflected in the aspects of inaccurate data, incomplete data, inconsistent data, unreliable data and the like, and the data with poor quality becomes an important factor influencing the subsequent mining and analyzing work of the enterprise on the data.

The mining and analysis of the industrial equipment data mainly comprises data regression, data classification and data clustering according to different mining purposes, the overall quality and the mineability of the data are effectively evaluated before the data mining and analyzing work is carried out, so that enterprises and users are helped to know the overall quality level and the mineability level of the data, whether the data set is suitable for subsequent data mining and analyzing work is judged based on the result obtained by the data mineability evaluation, and guidance and suggestion are made on the development of the subsequent data mining and analyzing work.

However, in the existing data mining process, only attention is paid to data quality evaluation, most of the data mining processes are applied to information data governance, data transaction evaluation and the like of various social organizations such as government organs, enterprises and public institutions, and the like, and the main problem to be solved is that the maintenance and management workload of data quality check rules is reduced, data mineability evaluation is not related, even if the data mining processes are related to evaluation, the data quality check rules are only common data quality evaluation indexes, the indexes are not perfect for subsequent data mining and analysis work, and due to the lack of indexes related to the data mineability evaluation, a user cannot accurately guide the subsequent data mining and analysis work based on an evaluation result.

Disclosure of Invention

In view of the above-mentioned problems in the prior art of mining availability evaluation of industrial data of industrial equipment, a mining availability evaluation method is provided for performing a dual rule of data quality evaluation and data mining availability evaluation on industrial data so that a user can know the overall quality level and mining availability level of the data.

The specific technical scheme is as follows:

a mining ability evaluation method based on industrial equipment data is provided, wherein a data evaluation index database and a data evaluation index rule database are provided;

the data evaluation index rule base comprises a plurality of data evaluation rules;

the data evaluation index library comprises a plurality of data evaluation indexes, and each data evaluation index corresponds to a weight coefficient;

the method specifically comprises the following steps:

step S1, providing a data set to be evaluated;

step S2, selecting the corresponding data evaluation index from the data evaluation index library according to the data set;

step S3, obtaining the data evaluation rules corresponding to the data evaluation indexes one to one from the data evaluation index rule base;

step S4, acquiring the weight coefficient corresponding to each data evaluation index;

step S5, establishing a data mineability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index.

Preferably, the data set is a matrix of m × n;

where m represents the number of data pieces and n represents the number of variables.

Preferably, the data evaluation indexes in the data evaluation index library include a data quality evaluation index and a data mineability evaluation index;

the evaluation rules in the data evaluation index rule base comprise data quality evaluation rules and data mineability evaluation rules;

the data quality evaluation index corresponds to the evaluation rule which is a data quality evaluation rule;

and the data minerability index corresponds to the evaluation rule which is a data minerability evaluation rule.

Preferably, the data quality evaluation rule corresponding to the data quality evaluation index includes:

when the data quality evaluation index is an accuracy index, the corresponding data quality evaluation rule is an accuracy rate R₁Accuracy rate R₁The calculation formula is as follows:

wherein, a₁Indicates the data anomaly rate, a₂Indicates the data out-of-compliance rate, h_oNumber of abnormal data in data set, h_cIndicates the number of non-compliant data in the data set, r indicates the total number of data in the data set, p indicates the number of indices used,

and/or

When the data quality evaluation index is an integrity index, the corresponding data quality evaluation rule is an integrity rate R₂Complete rate R₂The calculation formula is as follows:

wherein, b₁Representing the missing value rate, b₂Represents the rate of loss variation, b₃Indicates missing timestamp rate, h_mIndicates the number of missing data in the data set, h_vIndicates the number of missing variables in the dataset, h_tIndicating the number of missing timestamps in the data set.

when the data quality evaluation index is a reliability index, the corresponding data quality evaluation rule is a reliability rate R₃Reliability R₃The calculation formula of (a) is as follows:

R₃＝1-c₁；

wherein, c₁Indicating data out of range rate, h_rIndicating the number of data out of a range in the data set,

and/or

When the data quality evaluation index is a redundancy index, the corresponding data quality evaluation rule is a non-redundancy rate R₄Non-redundancy rate R₄The calculation formula of (a) is as follows:

wherein, cc_uvRepresenting the correlation between a variable u and a variable v in a data set, u_iRepresents the value of the ith row in the variable u,

denotes the mean value, σ, of the variable u_uDenotes the standard deviation, v, of the variable u_iRepresents the value of the ith row in the variable v,

denotes the mean value, σ, of the variable v_vDenotes the standard deviation of the variable v, d₁Representing the data repetition rate, d₂Representing the correlation rate of the variables, d₃Representing the variable inefficiency, h_qRepresenting the number of duplicate data in the data set, h_sIndicating the number of invalid variables in the data set.

Preferably, the data mineability evaluation rule corresponding to the data mineability index includes:

when the data minerability index is a regression index, the data minerability evaluation rule corresponding to the data minerability index is a regressive rate R₅The regressive ratio R₅The calculation formula of (a) is as follows:

R₅＝e₁；

wherein e is₁Representing the regression fit, y represents the dependent variable in the dataset,

denotes the predicted value of y, y_iIs the value of the dependent variable y in row i,

is the average value of the values of y,

is composed of

At the value of the (i) th row,

is composed of

Average value of (a).

Preferably, the data minerability evaluation rule corresponding to the data minerability index includes:

when the data minerability index is a classification index, the corresponding data minerability evaluation rule is a classifiable rate R₆Classifiable rate R₆The calculation formula of (c) is as follows:

R₆＝f₁；

wherein f is₁Indicating classification accuracy, y indicates class labels in the dataset,

the y prediction category is represented by the number y,

y_iis the value of y in the ith row,

is composed of

The value in row i.

when the data minerability index is a clustering index, the corresponding data minerability evaluation rule is a clustering rate R₇The clusterable rate R₇The calculation formula of (a) is as follows:

R₇＝g₁；

wherein, g₁Representing the clustering accuracy, k being the total number of classes of data in the data set, V_iIndicating the number of data pieces that cluster accurately for the ith class of samples.

Preferably, a plurality of scoring units are adopted to respectively obtain the score of each data evaluation index;

each of the scoring units is independent of the data evaluation index;

the method for obtaining the weight coefficient corresponding to each scoring unit is as follows:

wherein the content of the first and second substances,

means, σ, representing the score of the ith scoring unit on the data evaluation index_iRepresents the standard deviation, CV, of the score of the ith scoring unit for each data evaluation index_iCoefficient of variation, WE, representing the scoring result of the scoring unit_iWeight coefficient, k, representing normalization of each scoring unit_eIndicates the number of scoring units.

Preferably, the method of obtaining the composite score of each data evaluation index according to the weighting coefficient corresponding to each scoring unit is as follows:

wherein, SC_jA composite score, S, representing each data evaluation index_ijAnd the evaluation value of the ith evaluation unit on the jth data evaluation index is shown.

Preferably, the normalized weight coefficient of each data evaluation index is obtained according to the data evaluation index comprehensive score, as follows:

wherein, w_jNormalized weight coefficient, k, representing each data evaluation index_sThe number of data evaluation indexes is indicated.

Preferably, the data minerability assessment model is as follows:

where S is the data-mineability composite score of the data set, R_jScore representing the j-th data evaluation index, B₅And B₁To set the parameters.

The technical scheme has the following advantages or beneficial effects: the dual rules of data quality evaluation and data mining performance evaluation are carried out on the industrial data, so that a user can know the whole quality level and mining performance level of the data, the user can be accurately guided to carry out subsequent mining and analysis work on the data, and the defect that mining analysis is lacked in the prior art because only the quality evaluation on the industrial data is involved is overcome.

Drawings

Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The drawings are, however, to be regarded as illustrative and explanatory only and are not restrictive of the scope of the invention.

Fig. 1 is a flowchart of an embodiment of a method for mining availability evaluation based on industrial equipment data according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.

The technical scheme of the invention comprises a method for evaluating the mineability based on the industrial equipment data.

An embodiment of a mining ability evaluation method based on industrial equipment data is provided, wherein a data evaluation index database and a data evaluation index rule database are provided;

as shown in fig. 1, the method specifically comprises the following steps:

step S1, providing a data set to be evaluated;

Aiming at the defects that in the prior art, when industrial data are analyzed, only the evaluation of data quality is involved, and the data mining performance is lacked.

According to the method, based on a data set provided by industrial data, evaluation rules corresponding to multiple evaluation indexes can be selected to obtain weight coefficients corresponding to the data evaluation indexes, then a data mining ability evaluation model is established according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index, and a user can know the overall quality level and the mining ability level of the data through the analysis of the data set by the data mining ability evaluation model.

In a preferred embodiment, the data set is a matrix of m x n;

In a preferred embodiment, the data evaluation indexes in the data evaluation index library include a data quality evaluation index and a data mineability evaluation index;

and the corresponding evaluation rule is a data minerability evaluation rule.

In a preferred embodiment, the data quality evaluation rule corresponding to the data quality evaluation index includes:

when the data quality assessment index is an accuracy index, the corresponding data quality assessment rule is an accuracy rate R₁Accuracy rate R₁The calculation formula is as follows:

wherein, a₁Indicating the data anomaly rate, a₂Indicates data out-of-compliance rate, h_oNumber of abnormal data in data set, h_cIndicating the number of non-compliant data in the data set, r indicating the total number of data in the data set, p indicating the number of indices used, and/or

when the data quality evaluation index is a reliability index, the corresponding data quality evaluation rule is a reliability rate R₃Reliability R₃The calculation formula of (c) is as follows:

R₃＝1-c₁；

wherein, c₁Indicating data out of range rate, h_rIndicating the number of data out of range in the data set.

denotes the mean value, σ, of the variable v_vDenotes the standard deviation of the variable v, d₁Representing the data repetition rate, d₂Denotes the correlation rate of the variables, d₃Representing the variable inefficiency, h_qRepresenting the number of duplicate data in the data set, h_sIndicating the number of invalid variables in the data set.

In a preferred embodiment, the data minerability evaluation rule corresponding to the data minerability index includes:

R₅＝e₁；

denotes the predicted value of y, y_iFor the value of the dependent variable y at row i,

is the average value of the y and is,

is composed of

At the value of the (i) th row,

is composed of

Average value of (a).

In the above technical solution, it should be noted that the dependent variable y to be classified is specified in advance by the user, the remaining variable is set as x, and the data is roughly classified by using a data classification algorithm, which includes, but is not limited to, a logistic regression method, a k-nearest neighbor method, a bayesian method, a neural network method, a support vector machine method, a gaussian process method, a decision tree method, a random forest method, an ensemble learning method, and the likeThe data classification algorithm used. The prediction category of the dependent variable y obtained after the data are classified by the algorithm is set as

when the data minerability index is a classification index, the corresponding data minerability evaluation rule is a classifiable rate R₆Classifiable rate R₆The calculation formula of (a) is as follows:

R₆＝f₁；

the y prediction category is represented by the number y,

y_iis the value of y in the ith row,

is composed of

The value in row i.

In the above technical solution, it should be noted that the dependent variable y to be classified is specified in advance by the user, the remaining variable is set as x, and the data is roughly classified by using a data classification algorithm, which includes, but is not limited to, common data classification algorithms such as logistic regression method, k nearest neighbor method, bayesian method, neural network method, support vector machine method, gaussian process method, decision tree method, random forest method, ensemble learning method, and the like. The prediction category of the dependent variable y obtained after the data are classified by the algorithm is set as

R₇＝g₁；

In the above technical solution, it should be noted that the user designates the category label of the data in advance, and uses a data clustering algorithm to perform coarse clustering on the data, where the data clustering algorithm includes, but is not limited to, k-means algorithm, DBSCAN algorithm, Birch algorithm, Mean-shift algorithm, Affinity prediction algorithm, Spectral clustering algorithm, geometric clustering algorithm, gaussian mixture algorithm, and other commonly used data clustering algorithms. And (3) clustering the data through an algorithm to obtain accurate data number Vi of the ith sample cluster.

In a preferred embodiment, a plurality of scoring units are adopted to respectively obtain the score of each data evaluation index;

each of the scoring units is independent of the data evaluation index;

wherein the content of the first and second substances,

In the technical proposal, the device comprises a base,

the scoring unit is mainly used for scoring of data mining, wherein the number N of the scoring units is preferably more than or equal to 5.

The scores for the different indices may be set as follows:

the scores of the unimportant indexes are [ A1, A2 ], the scores of the secondary indexes are [ A2, A3), the scores of the general indexes are [ A3, A4), the scores of the important indexes are [ A4, A5), the scores of the indispensable indexes are [ A5, A6],

wherein A6> A5> A4> A3> A2> A1.

Specific scores may be set as a1 ═ 0, a _2 ═ 20, A3 ═ 40, a4 ═ 60, a5 ═ 80, and a6 ═ 100.

In a preferred embodiment, the method for obtaining the composite score of each data evaluation index according to the weight coefficient corresponding to each scoring unit is as follows:

In a preferred embodiment, the normalized weight coefficient of each of the data evaluation indexes is obtained according to the data evaluation index comprehensive score, as follows:

Preferably, the data minerability assessment model is as follows:

In the technical scheme, the comprehensive scoring result of the data mining performance evaluation can be classified into four grades, the scoring value of the fourth grade is [ B1, B2 ], and the comprehensive scoring result represents that the data quality is poor and is not suitable for subsequent data mining and analysis work or a useful analysis result is difficult to obtain after the subsequent data mining and analysis work;

the third grade score value is [ B2, B3 ], which indicates that the data quality is general, and subsequent data mining and analysis work can be carried out only after normalized data preprocessing work is required;

the second grade score value is [ B3, B4 ], which indicates that the data quality is good, and subsequent data mining and analysis work can be carried out after simple data preprocessing work;

the first grade score value is [ B4, B5], which indicates that the data quality is good, and subsequent data mining and analyzing work can be directly carried out without data preprocessing work, wherein B5> B4> B3> B2> B1.

Wherein, B1, B2, B3, B4 may set B1 to 0, B2 to 60, B3 to 80, B4 to 90, and B5 to 100.

The following is a description of a specific embodiment:

firstly, a sample data set is obtained, the sample data set is assumed to be obtained from DCS (distributed control system) data of operating parameters of a steam turbine generator unit of a power plant in a certain two months, the time interval of the data is 1min, the total line number of the data is 87769 lines due to reasons including missing values and the like, and the data comprises 362 variables such as active power of a generator, reactive power of the generator, frequency of the generator, temperature of outlet water of a stator coil, cooling water flow of a stator coil, temperature of cold hydrogen of an excitation end, purity of hydrogen and the like.

The subsequent data mining task of the sample data mainly analyzes the trend change condition of the water outlet temperature of the stator coil of the generator.

The specific implementation flow is as follows:

confirming that the data set to be evaluated is 87769 × 362 data set, wherein the number of data pieces is 87769, and the number of variables is 362;

and selecting a proper data evaluation index from the data evaluation index library aiming at a specific data mining and analyzing task.

Let us assume that the evaluation index selected in this example is; accuracy index, integrity index, reliability index, redundancy index and regression index;

and selecting and calculating an evaluation rule corresponding to the index in the data evaluation index rule base. The results of calculating the data in this example are as follows:

the accuracy rate R1 ═ 93.15%, the integrity rate R2 ═ 87.63%, the reliability rate R3 ═ 88.01%, the redundancy rate R4 ═ 80.39%, and the regression rate R5 ═ 91.66%.

Calculating a weight coefficient of the evaluation index by using a scoring unit, wherein N is set to be 5 in a weighting strategy, and N represents the number of the scoring units;

a1 ═ 0, a2 ═ 20, A3 ═ 40, a4 ═ 60, a5 ═ 80, a _6 ═ 100, and a denotes the score value corresponding to the data evaluation index;

b1 ═ 0, B2 ═ 60, B3 ═ 80, B4 ═ 90, B5 ═ 100, and represent the set parameters;

the weight coefficients of the above 5 evaluation indexes were calculated to be w 1-0.20, w 2-0.13, w 3-0.19, w 4-0.15, and w 5-0.33, respectively.

The data minerability evaluation model is set to be S ═ (B5-B1) × (w1 × R _ + w2 × R2+ w3 × R3+ w4 × R4+ w5 × R5), and the comprehensive score of the minerability evaluation of the target data set calculated based on the model is S ═ 89.05.

And based on the calculated data minerability evaluation comprehensive score, comparing with a four-gear division rule (a first gear [0-60], a second gear [60-80], a third gear [80-90] and a fourth gear [80-90]) of the data minerability evaluation comprehensive score result, judging that the S in the example is 89.05 and belongs to the second gear [80,90], and referring to a suggestion for subsequently mining and analyzing the data at the grade, wherein the suggestion indicates that the data at the grade has better quality, and the subsequent data mining and analyzing work can be carried out after simple data preprocessing work.

Although the calculation results of the integrity rate R _2 and the non-redundancy rate R _4 of the data set are not high as 87.63% and 80.39%, compared with other indexes, the final data minerability evaluation score of the target data set is high due to the weighting strategy, so that the data missing situation of the integrity index and the data duplication situation of the redundancy index do not need to be over-processed, and the subsequent data mining and analysis work can be performed only by simply deleting missing rows and performing standardization operation on the data set.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A mining ability evaluation method based on industrial equipment data is characterized in that a data evaluation index database and a data evaluation index rule database are provided;

the method specifically comprises the following steps:

step S1, providing a data set to be evaluated;

step S5, establishing a data mineability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index;

the data evaluation index comprises a data minerability evaluation index, and the data minerability evaluation rule corresponding to the data minerability evaluation index comprises:

R₅＝e₁；

is the average value of the values of y,

is composed of

At the value of the (i) th row,

is composed of

Average value of (a).

2. The method of claim 1, wherein the dataset is a matrix of m x n;

3. The method of claim 1, wherein the data evaluation indicators in the database of data evaluation indicators comprise data quality evaluation indicators

The evaluation rules in the data evaluation index rule base include data quality evaluation rules and data quality evaluation indexes, and the corresponding evaluation rules are data quality evaluation rules.

4. The method according to claim 3, wherein the data quality evaluation rule corresponding to the data quality evaluation index includes:

wherein, a₁Indicates the data anomaly rate, a₂Indicates data out-of-compliance rate, h_oNumber of abnormal data in data set, h_cIndicates the number of non-compliant data in the data set, and r indicates the number in the data setThe total number of data, p represents the number of indices used, and/or

5. The method according to claim 3, wherein the data quality evaluation rule corresponding to the data quality evaluation index includes:

R₃＝1-c₁；

and/or

When the data quality evaluation index is a redundancy index, the corresponding data quality evaluation rule is notRedundancy rate R₄Non-redundancy rate R₄The calculation formula of (a) is as follows:

6. The method according to claim 1, wherein the data minerability assessment rule corresponding to the data minerability index includes:

when saidWhen the data minerability index is the classification index, the corresponding data minerability evaluation rule is the classifiability rate R₆Classifiable rate R₆The calculation formula of (a) is as follows:

R₆＝f₁；

indicates y prediction class, y_iIs the value of y in the ith row,

is composed of

The value in row i.

7. The method according to claim 1, wherein the data minerability assessment rule corresponding to the data minerability index includes:

R₇＝g₁；

wherein, g₁Representing the clustering accuracy, k is the total number of categories of data in the data set, V_iIndicating the number of data pieces that cluster accurately for the ith class of samples.

8. The data minerability assessment method according to claim 1, wherein a plurality of scoring units are used to obtain a score for each of the data assessment indexes;

each of the scoring units is independent of the data evaluation index;

wherein the content of the first and second substances,

9. The method of claim 8, wherein the method of obtaining the composite score of each data evaluation index according to the weighting coefficient corresponding to each scoring unit is as follows:

10. The method of claim 8, wherein the normalized weight coefficient of each of the data evaluation indexes is obtained according to the data evaluation index composite score as follows:

11. The method of claim 8, wherein the data minerability assessment model is as follows: