CN110544047A

CN110544047A - Bad data identification method

Info

Publication number: CN110544047A
Application number: CN201910854363.4A
Authority: CN
Inventors: 娄建楼; 贾俊奇; 曲朝阳; 李燕; 孙博; 王蕾
Original assignee: Northeast Dianli University
Current assignee: Northeast Electric Power University
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2019-12-06

Abstract

The invention discloses a bad data identification method, which comprises the following steps: s1, determining the initial clustering number of the PAM algorithm by using a model evaluation Index of a coacervation hierarchical clustering algorithm and a real Index; s2, clustering the normal data by using a PAM algorithm, and calculating the mean square error of each class to obtain the class mean square error range of the normal data; s3, clustering the data to be measured by using a gap statistical algorithm and obtaining a result; s4, comparing the cluster number of the data to be tested with the cluster number of the normal data obtained by HC-Center clustering algorithm, if they are consistent, there is no bad data, otherwise, the mean square error of each class is calculated, and it is judged whether it is in the range of the mean square error of the normal data, if not, the class data is regarded as bad data. The method overcomes the defect that the PAM algorithm needs to manually set the initial clustering number, and improves the clustering accuracy; the clustering operation can be efficiently and accurately performed on the data.

Description

bad data identification method

Technical Field

the invention relates to a bad data identification method, in particular to a bad data identification method based on HC-Center clustering and a gap statistical algorithm.

background

at present, information technology is rapidly developing, and data is continuously generated. As the amount of data grows, the quality of the data is also of great concern. The bad data in the general sense refers to some false data, non-uniform data, missing data and distorted data, but in the actual production process, the bad data which is often worth noting is some measured values with large measurement errors. It is the existence of these bad data that may cause the misjudgment of decision-making personnel and further affect the normal operation of the whole production operation system. In a common example, such as data generated in a thermal power generating unit, a large amount of data can accurately reflect the operation state of each system of the whole thermal power generating unit, and information with a great application value for operation optimization of the thermal power generating unit is hidden. The information reflected by the data can ensure that the whole thermal power generating unit can safely, stably and efficiently operate, so that the economic benefit is maximized, and therefore, the identification of poor data has remarkable significance and value.

In the past identification research on bad data, students at home and abroad put forward a plurality of methods for identifying the bad data by using data analysis, control theory, artificial intelligence algorithm and other disciplines. Huang S et al propose a conventional bad data detection method based on an estimation identification method, but it may cause residual contamination and residual inundation during the identification process. Monsanta et al provide a Spark-based method for identifying bad data by using a parallel K-means algorithm, which deals with the problems that when a traditional clustering algorithm is applied to massive high-dimensional data, single-machine computing resources are lacked, and a MapReduce framework cannot effectively process frequent iterative computation, but the K-means clustering algorithm cannot intelligently pre-judge the number of clusters.

disclosure of Invention

The invention mainly aims to provide a bad data identification method based on HC-Center clustering and a gap statistical algorithm.

The technical scheme adopted by the invention is as follows: a bad data identification method comprises the following steps:

s1, determining the initial clustering number of the PAM algorithm by using a model evaluation Index of a coacervation hierarchical clustering algorithm and a real Index;

S2, clustering the normal data by using a PAM algorithm, and calculating the mean square error of each class to obtain the class mean square error range of the normal data;

s3, clustering the data to be measured by using a gap statistical algorithm and obtaining a result;

s4, comparing the cluster number of the data to be tested with the cluster number of the normal data obtained by HC-Center clustering algorithm, if they are consistent, there is no bad data, otherwise, the mean square error of each class is calculated, and it is judged whether it is in the range of the mean square error of the normal data, if not, the class data is regarded as bad data.

further, the step S2 includes:

S201, taking sample points in a data set as independent clusters;

S202, reasonably selecting parameters according to a Lance-Williams formula, calculating the proximity between clusters, and combining the two clusters with the minimum distance;

s203, recalculating the cluster center;

S204, setting a threshold value, and judging whether the cluster meets the requirement by using a model evaluation Index of a real Index;

S205, if the requirement is met, the step S206 is carried out, otherwise, the steps S202, S203 and S204 are executed circularly;

S206, taking the category number K obtained by clustering as the initial clustering number of the PAM algorithm;

S207, randomly selecting K sample points in the data set as central sample points of each cluster of the PAM algorithm, and selecting a reasonable distance formula to distribute non-central sample points to the cluster represented by the nearest central sample point;

S208, calculating the sum of the distances from each non-central sample point to each central sample point (initial cost), replacing the central points by the non-central sample points one by one, dividing clusters again, and calculating the total cost of each round of replacement by using a cost function;

s209, if the cost of the non-central sample point replacing the central sample point is smaller than that of the non-central sample point before replacement, replacing the central sample point with the non-central sample point to form a new K central sample point sets;

s210, repeatedly executing the steps S208 and S209 until the central sample point set does not change any more;

and S211, obtaining a final clustering result and calculating the mean square error of each type.

furthermore, the bad data identification method further comprises the following steps:

aiming at a normal data set, firstly selecting parameters for a Lance-Williams formula, then performing coacervation hierarchical clustering and defining a threshold value, applying a model evaluation Index of a real Index to a clustered result and comparing the result with the threshold value, if the model evaluation Index does not meet the requirement of the threshold value, repeating the steps, and if the model evaluation Index does not meet the requirement of the threshold value, taking the obtained category number K as the initial clustering number of the next PAM algorithm; randomly selecting K clustering centers and classifying the clustering centers, calculating initial costs from each point to K central points, calculating the total cost of each non-central sample point replacing each central sample point by using a cost function, and replacing the central sample points and the non-central sample points, wherein the termination condition is that the central sample points are not replaced any more, and the mean square error of each class is calculated after a clustering result is obtained; meanwhile, carrying out normalization preprocessing on the data to be detected to obtain error square data; generating F groups of reference distribution data sets, setting the initial clustering number K value to be 1, performing iterative clustering on the data sets to be tested and the reference distribution data sets respectively, and calculating the clustering dispersion of the data to be tested and the expected value corresponding to the reference distribution data sets to finally obtain a gap value; finally, calculating the simulation error of the standard deviation sum generated by each group of reference distribution data sets; if the sum of the K = K +1 is less than the sum of the K = K +1 and the K = K +1 is circulated again; if the K value is larger than or equal to the threshold K value, comparing the K value with the cluster number obtained by the HC-Center clustering algorithm of the normal data, if the K value is equal to the threshold K value, the data to be tested does not contain bad data, otherwise, calculating the mean square error of each class, judging whether the mean square error is in the mean square error range of the normal data cluster, and if the mean square error is not in the mean square error range of the normal data cluster, judging that the data in the class are the bad data.

The invention has the advantages that:

the invention provides an HC-Center clustering algorithm combining respective advantages of a condensation hierarchy clustering algorithm and a PAM algorithm, which overcomes the defect that the PAM algorithm needs to manually set the initial clustering number and improves the clustering accuracy.

The traditional gap statistical algorithm is combined with the HC-Center clustering algorithm provided by the invention to provide a set of new bad data identification method, and the method has the main advantages of improving the identification accuracy of bad data and reducing the probability of misjudgment and misinformation.

the HC-Center clustering algorithm based on the coacervation hierarchical clustering algorithm and the PAM algorithm can efficiently and accurately perform clustering operation on data.

In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a bad data identification method according to the present invention;

FIG. 2 is a flow chart of the HC-Center clustering algorithm of the present invention;

FIG. 3 is a comparison graph of the accuracy test of the present invention;

FIG. 4 is a graph of the results of preprocessing of normal data according to the present invention;

FIG. 5 is a graph of the comparison of Gap (K) to Gap (K +1) -Sk +1 for normal data of the present invention;

FIG. 6 is a comparison graph of normal data false positive rate of the present invention;

FIG. 7 is a graph of the results of data preprocessing with a single bad data according to the present invention;

FIG. 8 is a graph of the comparison of Gap (K) to Gap (K +1) -Sk +1 for a single bad datum of the present invention;

FIG. 9 is a comparison graph of single bad data recognition rate of the present invention;

FIG. 10 is a graph of the results of data preprocessing with multiple bad data according to the present invention;

FIG. 11 is a graph of Gap (K) versus Gap (K +1) -Sk +1 comparison values for a plurality of bad data according to the present invention;

FIG. 12 is a graph comparing a plurality of bad data identification rates according to the present invention.

Detailed Description

in order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

referring to fig. 1, as shown in fig. 1, a method for identifying bad data includes the following steps:

Referring to fig. 2, as shown in fig. 2, the step S2 includes:

S201, taking sample points in a data set as independent clusters;

S203, recalculating the cluster center;

The bad data identification method further comprises the following steps:

experimental results and analysis:

according to the method, data of relevant equipment of a combustion system in a DCS system of 11-15 days in 2018 of a certain power plant are selected to verify the performance effect of the method in the identification of the large data set. The method selects measurement data of related equipment of a thermal power generating unit combustion system once per minute, wherein the total sampling time is 6 hours, and the total sampling time is 360 groups of measurement data, namely 156960 data items. The coal mill A comprises 36 attribute values, the coal mill B comprises 36 attribute values, the coal mill C comprises 24 attribute values, the coal mill D comprises 24 attribute values, the coal mill E comprises 24 attribute values, the low passing wall temperature comprises 25 attribute values, the secondary air door 1 comprises 36 attribute values, the secondary air door 2 comprises 32 attribute values, the partition screen wall temperature comprises 8 attribute values, the high passing wall temperature comprises 24 attribute values, the high passing wall temperature comprises 25 attribute values, the rear screen passing wall temperature 1 comprises 24 attribute values, the rear screen passing wall temperature 2 comprises 21 attribute values, the screen passing wall temperature 1 comprises 28 attribute values, the screen passing wall temperature 2 comprises 28 attribute values, the main parameter 1 comprises 24 attribute values, and the main parameter 2 comprises 9 attribute values. All attribute values are numbered uniformly in the experiment to prevent data confusion.

Purpose of the experiment:

in order to verify the novelty and correctness of the proposed theory, the invention sets the following experiments for verification. The first experiment is an accuracy test of the HC-Center clustering algorithm. The advantages of the HC-Center clustering algorithm provided by the invention on accuracy are verified by comparing some existing clustering algorithms of the same type. The second experiment is a simulation experiment for identifying bad data by combining the HC-Center clustering algorithm and the gap statistical algorithm. The invention uses the data of the related equipment of the combustion system under the thermal power generating unit, and verifies the novelty and the correctness of the bad data identification method provided by the invention and the accuracy higher than the identification method of the same type by carrying out a bad data identification simulation experiment on the data.

Results and analysis:

HC-Center accuracy analysis:

The HC-Center clustering algorithm provided by the invention is a relatively stable clustering algorithm, and has considerable advantages in accuracy compared with other clustering algorithms of the same type. In the experiment, clustering operation is performed on three data sets, namely a coal mill A (the data volume is 1.34 GB), a secondary air door 1 (the data volume is 1.34 GB) and a high wall-crossing temperature (the data volume is 813.52 MB) in a combustion system by sequentially applying the HC-Center clustering algorithm and the K-Medoids algorithm provided by the invention and the traditional K-Means algorithm.

In the experiment of the invention, the Jaccard similarity coefficient is adopted as the determination method of the accuracy, and the calculation formula is as follows:

The effect of the Jaccard coefficient is to represent the similarity of two data sets in the form of ratio, and the ratio of the sample set intersection to the sample set is calculated. In this experiment, the Jaccard coefficient is equal to the ratio of the data set actually clustered correctly to the data set to be clustered.

FIG. 3 illustrates the advantage of HC-Center clustering algorithm in accuracy over the other two similar clustering methods on different datasets.

for three data sets used in an experiment, the identification accuracy of the HC-Center clustering algorithm is higher than that of the traditional K-Means and K-media algorithms, and the main reason is that the HC-Center clustering algorithm provided by the invention provides a relatively accurate initial clustering number through a coacervation hierarchical clustering algorithm and a model evaluation Index of a real Index, so that a subsequent PAM algorithm has a more detailed and accurate clustering basis in a clustering process, and the identification accuracy is higher compared with other two algorithms of the same type.

bad data identification simulation experiment:

before carrying out a bad data identification simulation experiment, normalization pretreatment needs to be carried out on the data to be detected in the data set. And training the BP neural network by using 300 groups of data, testing the trained neural network by using the remaining 60 groups of data, finally obtaining a difference value of the data to be tested before and after the input and the output of the neural network, and taking the square (namely the sum of squared errors) of the difference value as the basis of the operation of the gap statistical algorithm.

The invention uses bad data identification method based on state estimation-robust least square method to verify the 147 th group of data in the data set in advance to ensure that the data is a group of normal data without bad data. The results of the pretreatment of group 147 data are shown in FIG. 4.

The HC-Center clustering algorithm provided by the invention is an early preparation work of the bad data identification simulation experiment, so that the 147 th group of data is utilized to carry out HC-Center clustering operation, and the mean square deviations of various groups are calculated, and the result is shown in the following table 1.

TABLE 1 HC-center clustering results for Normal data

Table 2PreprocessingResults of Normal Data

1) does not contain bad data

aiming at the conditions of misjudgment and the like in the identification process of bad data, the invention carries out the same identification experiment of the bad data on normal data. In the 147 th group of data clustering process, the values corresponding to the different numbers of clusters are shown in the following table 2, and the corresponding and comparative results are shown in fig. 5.

TABLE 2 GSA algorithm results for Normal data

Table 3 Results of GSA Algorithm for Normal Data

As can be seen from fig. 5, when "h", the number of best clusters obtained by the gap statistic algorithm is 3, which is equal to the number of clusters obtained by HC-Center clustering of normal data, i.e., no bad data is contained in the 147 th group of data. The fact that the 147 th group of data does not contain bad data is also proved by verifying the mean square error of each class, and the correctness of the bad data identification method provided by the invention in the detection of the normal data is illustrated.

The bad data identification method based on the HC-Center clustering algorithm and the gap statistical algorithm, which is provided by the invention, is based on the original algorithm, and compares the same type of algorithms by taking the clustering result of normal data and the mean square difference values of the normal data as reference bases in the bad data identification process, thereby greatly reducing the misjudgment rate of bad data identification. Under the condition of no bad data, fig. 6 is a comparison graph of the misjudgment rate of the traditional GSA identification method, the improved GSA identification method and the bad data identification method based on the HC-Center algorithm and the gap statistical algorithm after a plurality of tests.

2) single bad data

In a simulation experiment for identifying single bad data of related equipment of a combustion system of a thermal power generating unit, firstly, setting that single bad data exists in 147 th group of data, wherein the data with the reference number 89 is one bad data in the experiment. The results of the 147 th set of data pretreatments with a single bad data are shown in figure 7. The 147 th group of data containing single bad data corresponds to different numbers of clusters in the clustering process, the values are shown in the following table 3, and the corresponding and comparative result chart is shown in fig. 8.

TABLE 3 GSA Algorithm results with Single bad data

Table3 Results of GSA Algorithm with Single Bad Data

As can be seen from fig. 8, when the number of best clusters obtained by the gap statistic algorithm is 4, which is not equal to the number of clusters obtained by the HC-Center cluster of the normal data, the mean square error of each cluster is calculated, and the 4 th cluster exceeds the mean square error range of the normal data after HC-Center cluster, the data in the 4 th cluster can be determined as bad data. The fact that the data No. 89 in the 4 th cluster can be verified as bad data by searching shows the correctness of the bad data identification method provided by the invention when identifying single bad data.

The bad data identification method based on the HC-Center clustering algorithm and the gap statistical algorithm, which is provided by the invention, adds links of comparing the HC-Center clustering number of normal data and calculating the mean square error of various types on the basis of the original algorithm, thereby greatly improving the identification accuracy of bad data. Compared with the traditional GSA identification method and the improved GSA identification method of the same type, the bad data identification method provided by the invention is far higher than the other two methods in the identification accuracy rate of single bad data. In the case of single bad data, fig. 9 is a comparison graph of the accuracy of the traditional GSA identification method, the improved GSA identification method and the bad data identification method based on the HC-Center algorithm and the gap statistic algorithm after a plurality of tests.

3) a plurality of bad data

In a simulation experiment for identifying a plurality of bad data of related equipment of a combustion system of a thermal power generating unit, firstly, setting that a plurality of bad data exist in the 147 th group of data, wherein the data with the reference numbers of 52, 89, 101, 277, 301, 367 and 421 are the plurality of bad data in the experiment. The preprocessing result of the 147 th group of data containing a plurality of bad data is shown in fig. 11, the 147 th group of data containing a plurality of bad data corresponds to different numbers of clusters in the clustering process, the values are shown in table 4, and the corresponding and comparative results are shown in fig. 11.

TABLE 4 GSA Algorithm results with multiple bad data

as can be seen from fig. 11, when, the optimal cluster number obtained by the gap statistical algorithm is 6, which is not equal to the cluster number obtained by the HC-Center cluster of the normal data, and the mean square deviations of the various clusters are calculated, and the 2 nd, 5 th and 6 th clusters exceed the mean square deviation value range after the HC-Center cluster of the normal data, the data in the 2 nd cluster, the 5 th cluster and the 6 th cluster can be determined as bad data, and the fact that the data in the 2 nd cluster, the data in the 5 th cluster, the data in the 277 th, 301 th, 367 th and the data in the 6 th cluster are found to be bad data, illustrates the correctness of the bad data identification method provided by the present invention when identifying a plurality of bad data.

and (4) conclusion:

aiming at the problem of identification of bad data, the invention introduces a model evaluation Index of real Index aiming at the problems of undefined initial clustering Center, poor clustering effect and low clustering accuracy existing in the traditional clustering algorithm, integrates the advantages of the coacervation hierarchical clustering algorithm and the PAM algorithm, innovatively provides the HC-Center clustering algorithm, and verifies that the algorithm has good applicability through experiments and also has obvious advantages in accuracy. On the basis, the invention compares the clustering number of the normal data obtained by the HC-Center clustering algorithm with the optimal clustering number of the data to be detected obtained by the gap statistical algorithm, and identifies the bad data by taking the mean square error range of the class as a judgment standard. The invention carries out experiments on the data of the combustion system of the thermal power generating unit of a power plant under three conditions of no bad data, single bad data and a plurality of bad data, and the result shows that the method can more accurately and objectively realize the identification of the bad data on the basis of avoiding residual pollution and residual inundation. With the arrival of the digital age and the further development of future technologies, the method of the present invention can be further combined with a big data environment, and the processing efficiency and accuracy of the method in identifying bad data are further improved.

The PAM algorithm is improved by utilizing a coacervation hierarchical clustering algorithm and a model evaluation Index of unknown real Index, the defects of indefinite initial clustering number and low clustering accuracy of the algorithm are overcome, and the HC-Center clustering algorithm is creatively provided. Aiming at the problems of poor universality, no reference of identification results, low identification accuracy and the like of the conventional bad data identification method, the invention provides a new bad data identification method by fusing a gap statistical algorithm on the basis of an HC-Center clustering algorithm:

(1) And determining the initial clustering number of the PAM algorithm by using a model evaluation Index of a coacervation hierarchical clustering algorithm and a real Index.

(2) and clustering the normal data by using a PAM algorithm, and calculating the mean square error of each class to obtain the class mean square error range of the normal data.

(3) And clustering the data to be detected by using a gap statistical algorithm to obtain a result.

(4) And comparing whether the cluster number of the data to be tested is consistent with the cluster number of the normal data obtained by the HC-Center clustering algorithm. If the data is consistent with the standard data, the standard data is not existed, otherwise, the mean square error of each class needs to be calculated, whether the data is in the class mean square error range of the normal data or not is judged, and if the data is not in the class mean square error range of the normal data, the data in the class is regarded as the bad data.

the method respectively calculates the normal data set and the data set to be measured, and obtains bad data in the data set by comparing the HC-Center clustering result of the normal data with the calculation result of the gap statistical algorithm of the data to be measured.

and aiming at a normal data set, firstly selecting parameters for a Lance-Williams formula, then carrying out coacervation hierarchical clustering and defining a threshold value, applying a model evaluation Index of a real Index to a clustered result and comparing the result with the threshold value, if the model evaluation Index does not meet the requirement of the threshold value, repeating the steps, and if the model evaluation Index does not meet the requirement of the threshold value, taking the obtained category number K as the initial clustering number of the next PAM algorithm. Randomly selecting K clustering centers and classifying the clustering centers, calculating initial costs from each point to K central points, calculating the total cost of each non-central sample point replacing each central sample point by using a cost function, replacing the central sample point with the non-central sample point, and calculating the mean square error of each class after a clustering result is obtained under the condition that the central sample point is not replaced any more. Meanwhile, normalization preprocessing is carried out on the data to be detected, and error square data are obtained. Generating F groups of reference distribution data sets, setting the initial clustering number K value to be 1, performing iterative clustering on the data sets to be tested and the reference distribution data sets respectively, and calculating the clustering dispersion of the data to be tested and the expected value corresponding to the reference distribution data sets to finally obtain the gap value.

finally, the simulation error of the standard deviation sum generated by each group of reference distribution data sets is calculated. If the sum is less than the preset value, enabling K = K +1 to circulate the steps in the gap statistical algorithm again. If the K value is larger than or equal to the threshold K value, comparing the K value with the cluster number obtained by the HC-Center clustering algorithm of the normal data, if the K value is equal to the threshold K value, the data to be tested does not contain bad data, otherwise, calculating the mean square error of each class, judging whether the mean square error is in the mean square error range of the normal data cluster, and if the mean square error is not in the mean square error range of the normal data cluster, judging that the data in the class are the bad data.

aiming at the problem that the bad data identification method is exposed, the invention provides the HC-Center clustering algorithm which combines the advantages of the coacervation hierarchical clustering algorithm and the PAM algorithm, overcomes the defect that the PAM algorithm needs to manually set the initial clustering number, and improves the clustering accuracy.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A bad data identification method is characterized by comprising the following steps:

2. the method of claim 1, wherein the steps are performed in a serial manner

s2 includes:

S201, taking sample points in a data set as independent clusters;

s203, recalculating the cluster center;

3. The method of claim 1, wherein the defect data is identified by the defect data

the data identification method further comprises the following steps: