CN108197280B - Mining ability evaluation method based on industrial equipment data - Google Patents

Mining ability evaluation method based on industrial equipment data Download PDF

Info

Publication number
CN108197280B
CN108197280B CN201810023192.6A CN201810023192A CN108197280B CN 108197280 B CN108197280 B CN 108197280B CN 201810023192 A CN201810023192 A CN 201810023192A CN 108197280 B CN108197280 B CN 108197280B
Authority
CN
China
Prior art keywords
data
evaluation
index
evaluation index
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810023192.6A
Other languages
Chinese (zh)
Other versions
CN108197280A (en
Inventor
董亚明
许伟
马贺贺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Electric Group Corp
Original Assignee
Shanghai Electric Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Electric Group Corp filed Critical Shanghai Electric Group Corp
Priority to CN201810023192.6A priority Critical patent/CN108197280B/en
Publication of CN108197280A publication Critical patent/CN108197280A/en
Application granted granted Critical
Publication of CN108197280B publication Critical patent/CN108197280B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

The invention provides an excavation evaluation method based on industrial equipment data, which comprises a data evaluation index library and a data evaluation index rule library; the data evaluation index rule base comprises a plurality of data evaluation rules; the data evaluation index library comprises a plurality of data evaluation indexes, and each data evaluation index corresponds to a weight coefficient; providing a data set to be evaluated; selecting a corresponding data evaluation index from a data evaluation index library according to the data set; acquiring data evaluation rules corresponding to the data evaluation indexes one by one in a data evaluation index rule base; acquiring a weight coefficient corresponding to each data evaluation index; and establishing a data minerability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index. The technical scheme has the beneficial effect of overcoming the defect that the quality evaluation of industrial data is only related to and the mining analysis is lacked in the prior art.

Description

Mining ability evaluation method based on industrial equipment data
Technical Field
The invention relates to the technical field of industrial equipment data processing, in particular to an excavation evaluation method based on industrial equipment data.
Background
With the continuous development of economy, the informatization of enterprises is gradually realized, so that the enterprises can adapt to the constantly changing competitive environment of market economy, and the maximum economic benefit is obtained. Data in industrial equipment plays a very important role in the construction process of accelerating enterprise informatization, however, with the continuous aging of enterprise equipment, sensor failure, instability of a transmission network and other problems, the data quality problem becomes increasingly prominent. The problems are mainly reflected in the aspects of inaccurate data, incomplete data, inconsistent data, unreliable data and the like, and the data with poor quality becomes an important factor influencing the subsequent mining and analyzing work of the enterprise on the data.
The mining and analysis of the industrial equipment data mainly comprises data regression, data classification and data clustering according to different mining purposes, the overall quality and the mineability of the data are effectively evaluated before the data mining and analyzing work is carried out, so that enterprises and users are helped to know the overall quality level and the mineability level of the data, whether the data set is suitable for subsequent data mining and analyzing work is judged based on the result obtained by the data mineability evaluation, and guidance and suggestion are made on the development of the subsequent data mining and analyzing work.
However, in the existing data mining process, only attention is paid to data quality evaluation, most of the data mining processes are applied to information data governance, data transaction evaluation and the like of various social organizations such as government organs, enterprises and public institutions, and the like, and the main problem to be solved is that the maintenance and management workload of data quality check rules is reduced, data mineability evaluation is not related, even if the data mining processes are related to evaluation, the data quality check rules are only common data quality evaluation indexes, the indexes are not perfect for subsequent data mining and analysis work, and due to the lack of indexes related to the data mineability evaluation, a user cannot accurately guide the subsequent data mining and analysis work based on an evaluation result.
Disclosure of Invention
In view of the above-mentioned problems in the prior art of mining availability evaluation of industrial data of industrial equipment, a mining availability evaluation method is provided for performing a dual rule of data quality evaluation and data mining availability evaluation on industrial data so that a user can know the overall quality level and mining availability level of the data.
The specific technical scheme is as follows:
a mining ability evaluation method based on industrial equipment data is provided, wherein a data evaluation index database and a data evaluation index rule database are provided;
the data evaluation index rule base comprises a plurality of data evaluation rules;
the data evaluation index library comprises a plurality of data evaluation indexes, and each data evaluation index corresponds to a weight coefficient;
the method specifically comprises the following steps:
step S1, providing a data set to be evaluated;
step S2, selecting the corresponding data evaluation index from the data evaluation index library according to the data set;
step S3, obtaining the data evaluation rules corresponding to the data evaluation indexes one to one from the data evaluation index rule base;
step S4, acquiring the weight coefficient corresponding to each data evaluation index;
step S5, establishing a data mineability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index.
Preferably, the data set is a matrix of m × n;
where m represents the number of data pieces and n represents the number of variables.
Preferably, the data evaluation indexes in the data evaluation index library include a data quality evaluation index and a data mineability evaluation index;
the evaluation rules in the data evaluation index rule base comprise data quality evaluation rules and data mineability evaluation rules;
the data quality evaluation index corresponds to the evaluation rule which is a data quality evaluation rule;
and the data minerability index corresponds to the evaluation rule which is a data minerability evaluation rule.
Preferably, the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality evaluation index is an accuracy index, the corresponding data quality evaluation rule is an accuracy rate R1Accuracy rate R1The calculation formula is as follows:
Figure BDA0001544188630000031
wherein, a1Indicates the data anomaly rate, a2Indicates the data out-of-compliance rate, hoNumber of abnormal data in data set, hcIndicates the number of non-compliant data in the data set, r indicates the total number of data in the data set, p indicates the number of indices used,
and/or
When the data quality evaluation index is an integrity index, the corresponding data quality evaluation rule is an integrity rate R2Complete rate R2The calculation formula is as follows:
Figure BDA0001544188630000032
Figure BDA0001544188630000033
wherein, b1Representing the missing value rate, b2Represents the rate of loss variation, b3Indicates missing timestamp rate, hmIndicates the number of missing data in the data set, hvIndicates the number of missing variables in the dataset, htIndicating the number of missing timestamps in the data set.
Preferably, the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality evaluation index is a reliability index, the corresponding data quality evaluation rule is a reliability rate R3Reliability R3The calculation formula of (a) is as follows:
Figure BDA0001544188630000034
R3=1-c1
wherein, c1Indicating data out of range rate, hrIndicating the number of data out of a range in the data set,
and/or
When the data quality evaluation index is a redundancy index, the corresponding data quality evaluation rule is a non-redundancy rate R4Non-redundancy rate R4The calculation formula of (a) is as follows:
Figure BDA0001544188630000041
Figure BDA0001544188630000042
Figure BDA0001544188630000043
Figure BDA0001544188630000044
wherein, ccuvRepresenting the correlation between a variable u and a variable v in a data set, uiRepresents the value of the ith row in the variable u,
Figure BDA0001544188630000049
denotes the mean value, σ, of the variable uuDenotes the standard deviation, v, of the variable uiRepresents the value of the ith row in the variable v,
Figure BDA00015441886300000410
denotes the mean value, σ, of the variable vvDenotes the standard deviation of the variable v, d1Representing the data repetition rate, d2Representing the correlation rate of the variables, d3Representing the variable inefficiency, hqRepresenting the number of duplicate data in the data set, hsIndicating the number of invalid variables in the data set.
Preferably, the data mineability evaluation rule corresponding to the data mineability index includes:
when the data minerability index is a regression index, the data minerability evaluation rule corresponding to the data minerability index is a regressive rate R5The regressive ratio R5The calculation formula of (a) is as follows:
Figure BDA0001544188630000045
R5=e1
wherein e is1Representing the regression fit, y represents the dependent variable in the dataset,
Figure BDA00015441886300000411
denotes the predicted value of y, yiIs the value of the dependent variable y in row i,
Figure BDA00015441886300000412
is the average value of the values of y,
Figure BDA00015441886300000413
is composed of
Figure BDA00015441886300000414
At the value of the (i) th row,
Figure BDA00015441886300000415
is composed of
Figure BDA00015441886300000416
Average value of (a).
Preferably, the data minerability evaluation rule corresponding to the data minerability index includes:
when the data minerability index is a classification index, the corresponding data minerability evaluation rule is a classifiable rate R6Classifiable rate R6The calculation formula of (c) is as follows:
Figure BDA0001544188630000046
R6=f1
wherein f is1Indicating classification accuracy, y indicates class labels in the dataset,
Figure BDA0001544188630000047
the y prediction category is represented by the number y,
yiis the value of y in the ith row,
Figure BDA0001544188630000051
is composed of
Figure BDA0001544188630000052
The value in row i.
Preferably, the data minerability evaluation rule corresponding to the data minerability index includes:
when the data minerability index is a clustering index, the corresponding data minerability evaluation rule is a clustering rate R7The clusterable rate R7The calculation formula of (a) is as follows:
Figure BDA0001544188630000053
R7=g1
wherein, g1Representing the clustering accuracy, k being the total number of classes of data in the data set, ViIndicating the number of data pieces that cluster accurately for the ith class of samples.
Preferably, a plurality of scoring units are adopted to respectively obtain the score of each data evaluation index;
each of the scoring units is independent of the data evaluation index;
the method for obtaining the weight coefficient corresponding to each scoring unit is as follows:
Figure BDA0001544188630000054
Figure BDA0001544188630000055
wherein the content of the first and second substances,
Figure BDA0001544188630000056
means, σ, representing the score of the ith scoring unit on the data evaluation indexiRepresents the standard deviation, CV, of the score of the ith scoring unit for each data evaluation indexiCoefficient of variation, WE, representing the scoring result of the scoring unitiWeight coefficient, k, representing normalization of each scoring uniteIndicates the number of scoring units.
Preferably, the method of obtaining the composite score of each data evaluation index according to the weighting coefficient corresponding to each scoring unit is as follows:
Figure BDA0001544188630000057
wherein, SCjA composite score, S, representing each data evaluation indexijAnd the evaluation value of the ith evaluation unit on the jth data evaluation index is shown.
Preferably, the normalized weight coefficient of each data evaluation index is obtained according to the data evaluation index comprehensive score, as follows:
Figure BDA0001544188630000061
wherein, wjNormalized weight coefficient, k, representing each data evaluation indexsThe number of data evaluation indexes is indicated.
Preferably, the data minerability assessment model is as follows:
Figure BDA0001544188630000062
where S is the data-mineability composite score of the data set, RjScore representing the j-th data evaluation index, B5And B1To set the parameters.
The technical scheme has the following advantages or beneficial effects: the dual rules of data quality evaluation and data mining performance evaluation are carried out on the industrial data, so that a user can know the whole quality level and mining performance level of the data, the user can be accurately guided to carry out subsequent mining and analysis work on the data, and the defect that mining analysis is lacked in the prior art because only the quality evaluation on the industrial data is involved is overcome.
Drawings
Embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings. The drawings are, however, to be regarded as illustrative and explanatory only and are not restrictive of the scope of the invention.
Fig. 1 is a flowchart of an embodiment of a method for mining availability evaluation based on industrial equipment data according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The invention is further described with reference to the following drawings and specific examples, which are not intended to be limiting.
The technical scheme of the invention comprises a method for evaluating the mineability based on the industrial equipment data.
An embodiment of a mining ability evaluation method based on industrial equipment data is provided, wherein a data evaluation index database and a data evaluation index rule database are provided;
the data evaluation index rule base comprises a plurality of data evaluation rules;
the data evaluation index library comprises a plurality of data evaluation indexes, and each data evaluation index corresponds to a weight coefficient;
as shown in fig. 1, the method specifically comprises the following steps:
step S1, providing a data set to be evaluated;
step S2, selecting the corresponding data evaluation index from the data evaluation index library according to the data set;
step S3, obtaining the data evaluation rules corresponding to the data evaluation indexes one to one from the data evaluation index rule base;
step S4, acquiring the weight coefficient corresponding to each data evaluation index;
step S5, establishing a data mineability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index.
Aiming at the defects that in the prior art, when industrial data are analyzed, only the evaluation of data quality is involved, and the data mining performance is lacked.
According to the method, based on a data set provided by industrial data, evaluation rules corresponding to multiple evaluation indexes can be selected to obtain weight coefficients corresponding to the data evaluation indexes, then a data mining ability evaluation model is established according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index, and a user can know the overall quality level and the mining ability level of the data through the analysis of the data set by the data mining ability evaluation model.
In a preferred embodiment, the data set is a matrix of m x n;
where m represents the number of data pieces and n represents the number of variables.
In a preferred embodiment, the data evaluation indexes in the data evaluation index library include a data quality evaluation index and a data mineability evaluation index;
the evaluation rules in the data evaluation index rule base comprise data quality evaluation rules and data mineability evaluation rules;
the data quality evaluation index corresponds to the evaluation rule which is a data quality evaluation rule;
and the corresponding evaluation rule is a data minerability evaluation rule.
In a preferred embodiment, the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality assessment index is an accuracy index, the corresponding data quality assessment rule is an accuracy rate R1Accuracy rate R1The calculation formula is as follows:
Figure BDA0001544188630000081
wherein, a1Indicating the data anomaly rate, a2Indicates data out-of-compliance rate, hoNumber of abnormal data in data set, hcIndicating the number of non-compliant data in the data set, r indicating the total number of data in the data set, p indicating the number of indices used, and/or
When the data quality evaluation index is an integrity index, the corresponding data quality evaluation rule is an integrity rate R2Complete rate R2The calculation formula is as follows:
Figure BDA0001544188630000082
Figure BDA0001544188630000083
wherein, b1Representing the missing value rate, b2Represents the rate of loss variation, b3Indicates missing timestamp rate, hmIndicates the number of missing data in the data set, hvIndicates the number of missing variables in the dataset, htIndicating the number of missing timestamps in the data set.
In a preferred embodiment, the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality evaluation index is a reliability index, the corresponding data quality evaluation rule is a reliability rate R3Reliability R3The calculation formula of (c) is as follows:
Figure BDA0001544188630000084
R3=1-c1
wherein, c1Indicating data out of range rate, hrIndicating the number of data out of range in the data set.
In a preferred embodiment, the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality evaluation index is a redundancy index, the corresponding data quality evaluation rule is a non-redundancy rate R4Non-redundancy rate R4The calculation formula of (a) is as follows:
Figure BDA0001544188630000091
Figure BDA0001544188630000092
Figure BDA0001544188630000093
Figure BDA0001544188630000094
wherein, ccuvRepresenting the correlation between a variable u and a variable v in a data set, uiRepresents the value of the ith row in the variable u,
Figure BDA00015441886300000912
denotes the mean value, σ, of the variable uuDenotes the standard deviation, v, of the variable uiRepresents the value of the ith row in the variable v,
Figure BDA00015441886300000913
denotes the mean value, σ, of the variable vvDenotes the standard deviation of the variable v, d1Representing the data repetition rate, d2Denotes the correlation rate of the variables, d3Representing the variable inefficiency, hqRepresenting the number of duplicate data in the data set, hsIndicating the number of invalid variables in the data set.
In a preferred embodiment, the data minerability evaluation rule corresponding to the data minerability index includes:
when the data minerability index is a regression index, the data minerability evaluation rule corresponding to the data minerability index is a regressive rate R5The regressive ratio R5The calculation formula of (a) is as follows:
Figure BDA0001544188630000095
R5=e1
wherein e is1Representing the regression fit, y represents the dependent variable in the dataset,
Figure BDA0001544188630000096
denotes the predicted value of y, yiFor the value of the dependent variable y at row i,
Figure BDA0001544188630000097
is the average value of the y and is,
Figure BDA0001544188630000098
is composed of
Figure BDA0001544188630000099
At the value of the (i) th row,
Figure BDA00015441886300000910
is composed of
Figure BDA00015441886300000911
Average value of (a).
In the above technical solution, it should be noted that the dependent variable y to be classified is specified in advance by the user, the remaining variable is set as x, and the data is roughly classified by using a data classification algorithm, which includes, but is not limited to, a logistic regression method, a k-nearest neighbor method, a bayesian method, a neural network method, a support vector machine method, a gaussian process method, a decision tree method, a random forest method, an ensemble learning method, and the likeThe data classification algorithm used. The prediction category of the dependent variable y obtained after the data are classified by the algorithm is set as
Figure BDA0001544188630000101
In a preferred embodiment, the data minerability evaluation rule corresponding to the data minerability index includes:
when the data minerability index is a classification index, the corresponding data minerability evaluation rule is a classifiable rate R6Classifiable rate R6The calculation formula of (a) is as follows:
Figure BDA0001544188630000102
R6=f1
wherein f is1Indicating classification accuracy, y indicates class labels in the dataset,
Figure BDA0001544188630000103
the y prediction category is represented by the number y,
yiis the value of y in the ith row,
Figure BDA0001544188630000104
is composed of
Figure BDA0001544188630000105
The value in row i.
In the above technical solution, it should be noted that the dependent variable y to be classified is specified in advance by the user, the remaining variable is set as x, and the data is roughly classified by using a data classification algorithm, which includes, but is not limited to, common data classification algorithms such as logistic regression method, k nearest neighbor method, bayesian method, neural network method, support vector machine method, gaussian process method, decision tree method, random forest method, ensemble learning method, and the like. The prediction category of the dependent variable y obtained after the data are classified by the algorithm is set as
Figure BDA0001544188630000106
In a preferred embodiment, the data minerability evaluation rule corresponding to the data minerability index includes:
when the data minerability index is a clustering index, the corresponding data minerability evaluation rule is a clustering rate R7The clusterable rate R7The calculation formula of (a) is as follows:
Figure BDA0001544188630000107
R7=g1
wherein, g1Representing the clustering accuracy, k being the total number of classes of data in the data set, ViIndicating the number of data pieces that cluster accurately for the ith class of samples.
In the above technical solution, it should be noted that the user designates the category label of the data in advance, and uses a data clustering algorithm to perform coarse clustering on the data, where the data clustering algorithm includes, but is not limited to, k-means algorithm, DBSCAN algorithm, Birch algorithm, Mean-shift algorithm, Affinity prediction algorithm, Spectral clustering algorithm, geometric clustering algorithm, gaussian mixture algorithm, and other commonly used data clustering algorithms. And (3) clustering the data through an algorithm to obtain accurate data number Vi of the ith sample cluster.
In a preferred embodiment, a plurality of scoring units are adopted to respectively obtain the score of each data evaluation index;
each of the scoring units is independent of the data evaluation index;
the method for obtaining the weight coefficient corresponding to each scoring unit is as follows:
Figure BDA0001544188630000111
Figure BDA0001544188630000112
wherein the content of the first and second substances,
Figure BDA0001544188630000113
means, σ, representing the score of the ith scoring unit on the data evaluation indexiRepresents the standard deviation, CV, of the score of the ith scoring unit for each data evaluation indexiCoefficient of variation, WE, representing the scoring result of the scoring unitiWeight coefficient, k, representing normalization of each scoring uniteIndicates the number of scoring units.
In the technical proposal, the device comprises a base,
the scoring unit is mainly used for scoring of data mining, wherein the number N of the scoring units is preferably more than or equal to 5.
The scores for the different indices may be set as follows:
the scores of the unimportant indexes are [ A1, A2 ], the scores of the secondary indexes are [ A2, A3), the scores of the general indexes are [ A3, A4), the scores of the important indexes are [ A4, A5), the scores of the indispensable indexes are [ A5, A6],
wherein A6> A5> A4> A3> A2> A1.
Specific scores may be set as a1 ═ 0, a _2 ═ 20, A3 ═ 40, a4 ═ 60, a5 ═ 80, and a6 ═ 100.
In a preferred embodiment, the method for obtaining the composite score of each data evaluation index according to the weight coefficient corresponding to each scoring unit is as follows:
Figure BDA0001544188630000114
wherein, SCjA composite score, S, representing each data evaluation indexijAnd the evaluation value of the ith evaluation unit on the jth data evaluation index is shown.
In a preferred embodiment, the normalized weight coefficient of each of the data evaluation indexes is obtained according to the data evaluation index comprehensive score, as follows:
Figure BDA0001544188630000121
wherein, wjNormalized weight coefficient, k, representing each data evaluation indexsThe number of data evaluation indexes is indicated.
Preferably, the data minerability assessment model is as follows:
Figure BDA0001544188630000122
where S is the data-mineability composite score of the data set, RjScore representing the j-th data evaluation index, B5And B1To set the parameters.
In the technical scheme, the comprehensive scoring result of the data mining performance evaluation can be classified into four grades, the scoring value of the fourth grade is [ B1, B2 ], and the comprehensive scoring result represents that the data quality is poor and is not suitable for subsequent data mining and analysis work or a useful analysis result is difficult to obtain after the subsequent data mining and analysis work;
the third grade score value is [ B2, B3 ], which indicates that the data quality is general, and subsequent data mining and analysis work can be carried out only after normalized data preprocessing work is required;
the second grade score value is [ B3, B4 ], which indicates that the data quality is good, and subsequent data mining and analysis work can be carried out after simple data preprocessing work;
the first grade score value is [ B4, B5], which indicates that the data quality is good, and subsequent data mining and analyzing work can be directly carried out without data preprocessing work, wherein B5> B4> B3> B2> B1.
Wherein, B1, B2, B3, B4 may set B1 to 0, B2 to 60, B3 to 80, B4 to 90, and B5 to 100.
The following is a description of a specific embodiment:
firstly, a sample data set is obtained, the sample data set is assumed to be obtained from DCS (distributed control system) data of operating parameters of a steam turbine generator unit of a power plant in a certain two months, the time interval of the data is 1min, the total line number of the data is 87769 lines due to reasons including missing values and the like, and the data comprises 362 variables such as active power of a generator, reactive power of the generator, frequency of the generator, temperature of outlet water of a stator coil, cooling water flow of a stator coil, temperature of cold hydrogen of an excitation end, purity of hydrogen and the like.
The subsequent data mining task of the sample data mainly analyzes the trend change condition of the water outlet temperature of the stator coil of the generator.
The specific implementation flow is as follows:
confirming that the data set to be evaluated is 87769 × 362 data set, wherein the number of data pieces is 87769, and the number of variables is 362;
and selecting a proper data evaluation index from the data evaluation index library aiming at a specific data mining and analyzing task.
Let us assume that the evaluation index selected in this example is; accuracy index, integrity index, reliability index, redundancy index and regression index;
and selecting and calculating an evaluation rule corresponding to the index in the data evaluation index rule base. The results of calculating the data in this example are as follows:
the accuracy rate R1 ═ 93.15%, the integrity rate R2 ═ 87.63%, the reliability rate R3 ═ 88.01%, the redundancy rate R4 ═ 80.39%, and the regression rate R5 ═ 91.66%.
Calculating a weight coefficient of the evaluation index by using a scoring unit, wherein N is set to be 5 in a weighting strategy, and N represents the number of the scoring units;
a1 ═ 0, a2 ═ 20, A3 ═ 40, a4 ═ 60, a5 ═ 80, a _6 ═ 100, and a denotes the score value corresponding to the data evaluation index;
b1 ═ 0, B2 ═ 60, B3 ═ 80, B4 ═ 90, B5 ═ 100, and represent the set parameters;
the weight coefficients of the above 5 evaluation indexes were calculated to be w 1-0.20, w 2-0.13, w 3-0.19, w 4-0.15, and w 5-0.33, respectively.
The data minerability evaluation model is set to be S ═ (B5-B1) × (w1 × R _ + w2 × R2+ w3 × R3+ w4 × R4+ w5 × R5), and the comprehensive score of the minerability evaluation of the target data set calculated based on the model is S ═ 89.05.
And based on the calculated data minerability evaluation comprehensive score, comparing with a four-gear division rule (a first gear [0-60], a second gear [60-80], a third gear [80-90] and a fourth gear [80-90]) of the data minerability evaluation comprehensive score result, judging that the S in the example is 89.05 and belongs to the second gear [80,90], and referring to a suggestion for subsequently mining and analyzing the data at the grade, wherein the suggestion indicates that the data at the grade has better quality, and the subsequent data mining and analyzing work can be carried out after simple data preprocessing work.
Although the calculation results of the integrity rate R _2 and the non-redundancy rate R _4 of the data set are not high as 87.63% and 80.39%, compared with other indexes, the final data minerability evaluation score of the target data set is high due to the weighting strategy, so that the data missing situation of the integrity index and the data duplication situation of the redundancy index do not need to be over-processed, and the subsequent data mining and analysis work can be performed only by simply deleting missing rows and performing standardization operation on the data set.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims (11)

1. A mining ability evaluation method based on industrial equipment data is characterized in that a data evaluation index database and a data evaluation index rule database are provided;
the data evaluation index rule base comprises a plurality of data evaluation rules;
the data evaluation index library comprises a plurality of data evaluation indexes, and each data evaluation index corresponds to a weight coefficient;
the method specifically comprises the following steps:
step S1, providing a data set to be evaluated;
step S2, selecting the corresponding data evaluation index from the data evaluation index library according to the data set;
step S3, obtaining the data evaluation rules corresponding to the data evaluation indexes one to one from the data evaluation index rule base;
step S4, acquiring the weight coefficient corresponding to each data evaluation index;
step S5, establishing a data mineability evaluation model according to the weight coefficient corresponding to each data evaluation index and the data evaluation rule corresponding to each data evaluation index;
the data evaluation index comprises a data minerability evaluation index, and the data minerability evaluation rule corresponding to the data minerability evaluation index comprises:
when the data minerability index is a regression index, the data minerability evaluation rule corresponding to the data minerability index is a regressive rate R5The regressive ratio R5The calculation formula of (a) is as follows:
Figure FDA0003373149220000011
R5=e1
wherein e is1Representing the regression fit, y represents the dependent variable in the dataset,
Figure FDA0003373149220000012
denotes the predicted value of y, yiIs the value of the dependent variable y in row i,
Figure FDA0003373149220000013
is the average value of the values of y,
Figure FDA0003373149220000014
is composed of
Figure FDA0003373149220000015
At the value of the (i) th row,
Figure FDA0003373149220000016
is composed of
Figure FDA0003373149220000017
Average value of (a).
2. The method of claim 1, wherein the dataset is a matrix of m x n;
where m represents the number of data pieces and n represents the number of variables.
3. The method of claim 1, wherein the data evaluation indicators in the database of data evaluation indicators comprise data quality evaluation indicators
The evaluation rules in the data evaluation index rule base include data quality evaluation rules and data quality evaluation indexes, and the corresponding evaluation rules are data quality evaluation rules.
4. The method according to claim 3, wherein the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality assessment index is an accuracy index, the corresponding data quality assessment rule is an accuracy rate R1Accuracy rate R1The calculation formula is as follows:
Figure FDA0003373149220000021
wherein, a1Indicates the data anomaly rate, a2Indicates data out-of-compliance rate, hoNumber of abnormal data in data set, hcIndicates the number of non-compliant data in the data set, and r indicates the number in the data setThe total number of data, p represents the number of indices used, and/or
When the data quality evaluation index is an integrity index, the corresponding data quality evaluation rule is an integrity rate R2Complete rate R2The calculation formula is as follows:
Figure FDA0003373149220000022
Figure FDA0003373149220000023
wherein, b1Representing the missing value rate, b2Represents the rate of loss variation, b3Indicates missing timestamp rate, hmIndicates the number of missing data in the data set, hvIndicates the number of missing variables in the dataset, htIndicating the number of missing timestamps in the data set.
5. The method according to claim 3, wherein the data quality evaluation rule corresponding to the data quality evaluation index includes:
when the data quality evaluation index is a reliability index, the corresponding data quality evaluation rule is a reliability rate R3Reliability R3The calculation formula of (a) is as follows:
Figure FDA0003373149220000024
R3=1-c1
wherein, c1Indicating data out of range rate, hrIndicating the number of data out of a range in the data set,
and/or
When the data quality evaluation index is a redundancy index, the corresponding data quality evaluation rule is notRedundancy rate R4Non-redundancy rate R4The calculation formula of (a) is as follows:
Figure FDA0003373149220000031
Figure FDA0003373149220000032
Figure FDA0003373149220000033
Figure FDA0003373149220000034
wherein, ccuvRepresenting the correlation between a variable u and a variable v in a data set, uiRepresents the value of the ith row in the variable u,
Figure FDA0003373149220000035
denotes the mean value, σ, of the variable uuDenotes the standard deviation, v, of the variable uiRepresents the value of the ith row in the variable v,
Figure FDA0003373149220000036
denotes the mean value, σ, of the variable vvDenotes the standard deviation of the variable v, d1Representing the data repetition rate, d2Representing the correlation rate of the variables, d3Representing the variable inefficiency, hqRepresenting the number of duplicate data in the data set, hsIndicating the number of invalid variables in the data set.
6. The method according to claim 1, wherein the data minerability assessment rule corresponding to the data minerability index includes:
when saidWhen the data minerability index is the classification index, the corresponding data minerability evaluation rule is the classifiability rate R6Classifiable rate R6The calculation formula of (a) is as follows:
Figure FDA0003373149220000037
R6=f1
wherein f is1Indicating classification accuracy, y indicates class labels in the dataset,
Figure FDA0003373149220000038
indicates y prediction class, yiIs the value of y in the ith row,
Figure FDA0003373149220000039
is composed of
Figure FDA00033731492200000310
The value in row i.
7. The method according to claim 1, wherein the data minerability assessment rule corresponding to the data minerability index includes:
when the data minerability index is a clustering index, the corresponding data minerability evaluation rule is a clustering rate R7The clusterable rate R7The calculation formula of (a) is as follows:
Figure FDA0003373149220000041
R7=g1
wherein, g1Representing the clustering accuracy, k is the total number of categories of data in the data set, ViIndicating the number of data pieces that cluster accurately for the ith class of samples.
8. The data minerability assessment method according to claim 1, wherein a plurality of scoring units are used to obtain a score for each of the data assessment indexes;
each of the scoring units is independent of the data evaluation index;
the method for obtaining the weight coefficient corresponding to each scoring unit is as follows:
Figure FDA0003373149220000042
Figure FDA0003373149220000043
wherein the content of the first and second substances,
Figure FDA0003373149220000044
means, σ, representing the score of the ith scoring unit on the data evaluation indexiRepresents the standard deviation, CV, of the score of the ith scoring unit for each data evaluation indexiCoefficient of variation, WE, representing the scoring result of the scoring unitiWeight coefficient, k, representing normalization of each scoring uniteIndicates the number of scoring units.
9. The method of claim 8, wherein the method of obtaining the composite score of each data evaluation index according to the weighting coefficient corresponding to each scoring unit is as follows:
Figure FDA0003373149220000045
wherein, SCjA composite score, S, representing each data evaluation indexijAnd the evaluation value of the ith evaluation unit on the jth data evaluation index is shown.
10. The method of claim 8, wherein the normalized weight coefficient of each of the data evaluation indexes is obtained according to the data evaluation index composite score as follows:
Figure FDA0003373149220000051
wherein, wjNormalized weight coefficient, k, representing each data evaluation indexsThe number of data evaluation indexes is indicated.
11. The method of claim 8, wherein the data minerability assessment model is as follows:
Figure FDA0003373149220000052
where S is the data-mineability composite score of the data set, RjScore representing the j-th data evaluation index, B5And B1To set the parameters.
CN201810023192.6A 2018-01-10 2018-01-10 Mining ability evaluation method based on industrial equipment data Active CN108197280B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810023192.6A CN108197280B (en) 2018-01-10 2018-01-10 Mining ability evaluation method based on industrial equipment data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810023192.6A CN108197280B (en) 2018-01-10 2018-01-10 Mining ability evaluation method based on industrial equipment data

Publications (2)

Publication Number Publication Date
CN108197280A CN108197280A (en) 2018-06-22
CN108197280B true CN108197280B (en) 2022-05-13

Family

ID=62588653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810023192.6A Active CN108197280B (en) 2018-01-10 2018-01-10 Mining ability evaluation method based on industrial equipment data

Country Status (1)

Country Link
CN (1) CN108197280B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109144986A (en) * 2018-07-30 2019-01-04 上海电气集团股份有限公司 A kind of importance appraisal procedure of industrial equipment data
CN111126622B (en) * 2019-12-19 2023-11-03 中国银联股份有限公司 Data anomaly detection method and device
CN112114571B (en) * 2020-09-24 2021-11-30 中冶赛迪重庆信息技术有限公司 Industrial data processing method, system and equipment
CN112465001A (en) * 2020-11-23 2021-03-09 上海电气集团股份有限公司 Classification method and device based on logistic regression
CN112488528A (en) * 2020-12-01 2021-03-12 东莞中国科学院云计算产业技术创新与育成中心 Data set processing method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103247008A (en) * 2013-05-07 2013-08-14 国家电网公司 Quality evaluation method of electricity statistical index data
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN106599230A (en) * 2016-12-19 2017-04-26 北京天元创新科技有限公司 Method and system for evaluating distributed data mining model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130024739A (en) * 2011-08-31 2013-03-08 성균관대학교산학협력단 System and method for analyzing experience in real time

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103247008A (en) * 2013-05-07 2013-08-14 国家电网公司 Quality evaluation method of electricity statistical index data
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN106599230A (en) * 2016-12-19 2017-04-26 北京天元创新科技有限公司 Method and system for evaluating distributed data mining model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于规则库的数据质量评估方法;刘芳;《计算机系统应用》;20171130;165-169页 *

Also Published As

Publication number Publication date
CN108197280A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108197280B (en) Mining ability evaluation method based on industrial equipment data
US8489502B2 (en) Methods and systems for multi-credit reporting agency data modeling
EP3716165A1 (en) Esg criteria-based enterprise evaluation device and operation method thereof
CN110414555A (en) Detect the method and device of exceptional sample
CN112904810B (en) Process industry nonlinear process monitoring method based on effective feature selection
Siew-Hong et al. Selection of optimal maintenance policy by using fuzzy multi criteria decision making method
Liang et al. A stock time series forecasting approach incorporating candlestick patterns and sequence similarity
Korepanova et al. Applicability of similarity coefficients in social circle matching
CN110794360A (en) Method and system for predicting fault of intelligent electric energy meter based on machine learning
CN107194604A (en) A kind of fired power generating unit method for evaluating reliability
CN113283673A (en) Model performance attenuation evaluation method, model training method and device
CN116485185A (en) Enterprise risk analysis system and method based on comparison data
Pang et al. WT combined early warning model and applications for loaning platform customers default prediction in smart city
WO2022183019A1 (en) Methods for mitigation of algorithmic bias discrimination, proxy discrimination and disparate impact
Yip Business failure prediction: a case-based reasoning approach
Glasserman et al. Assessing Look-Ahead Bias in Stock Return Predictions Generated By GPT Sentiment Analysis
Shuang et al. Bankruptcy prediction in construction companies via Fisher's Linear Discriminant Analysis
Passonneau et al. Reducing noise in labels and features for a real world dataset: Application of NLP corpus annotation methods
Ahlberg Application of the Ordered Lorenz Curve in the Analysis of a Non-Life Insurance Portfolio
CN110598973A (en) IAP-based risk evaluation method for authentication process of green furniture product
Han et al. Research on Standard Drafting Unit Evaluation Based on Ordered Factor Analysis Model
Gui Enterprise Accounting Risk Early Warning Model Based on Artificial Intelligence System Economics
Sippl et al. Data-based similarity assessment of engineering changes and manufacturing changes
CN117763316A (en) High-dimensional data dimension reduction method and dimension reduction system based on machine learning
KR20230063948A (en) Company's growth factor analysis system using unstructured data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant