Disclosure of Invention
In order to overcome the defects of the prior art, the present disclosure provides a method for identifying bad data of electric power metering, wherein the found inaccurate data points can well reduce the misjudgment rate.
In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
in a first aspect, a method for identifying bad data of power metering is disclosed, which includes:
obtaining original electric power metering data and preprocessing the data;
clustering the preprocessed electric power metering data;
judging whether the clustering result of the data to be detected and the user to which the data belongs has inter-class similarity, if so, judging the data to be accurate data, if not, judging the data to be suspicious, and continuously judging whether the data to be detected has smoothness, if so, judging the data to be accurate data, otherwise, judging the data to be inaccurate data to be bad data.
According to the further technical scheme, the clustering effectiveness index C is obtained when the preprocessed electric power metering data are clusteredcThe method is combined with a k-means clustering algorithm, and specifically comprises the following steps:
determining an initial clustering number k value;
selecting k samples from the n samples as initial clustering centers;
calculating the distance between each sample and the clustering center;
dividing the sample again according to the principle of minimum distance, namely the least sum of squared errors;
calculating the mean value of each type of sample as a new clustering center;
if the sum of the distance changes of the centers of the two iterative clustering is smaller than a threshold value, the iteration is finished;
calculating a clustering validity index Cc;
And selecting different k values to carry out the steps, calculating the clustering effectiveness index, and selecting the clustering number k with the maximum effectiveness index value from the clustering effectiveness index, wherein the clustering number and the clustering result are optimal.
According to the further technical scheme, when judging whether the data to be detected has inter-class similarity, defining an inter-class similarity index delta (i):
delta (i) represents the inter-class similarity of the ith point data on the load curve to be measured, LP
c(i) For the ith data, LP, on the load curve to be measured
d(i) Setting a threshold value r for the ith data on a typical load curve of the class to which the load belongs, and considering that when delta (i) is equal to [ -r, r]When the data belongs to the exact data, otherwise, when
The data is considered suspect data.
According to the further technical scheme, the characteristic of smoothness is used for further screening the suspicious data, and the smoothness index can be measured by comparing the data of two points before and after the suspicious data.
Further technical solution, assume load curve LPcThe ith point above is considered suspect data, then the smoothness metric ε (i) is defined:
ε (i) represents the smoothness of the ith data on the load curve to be measured. Similar to the method for measuring similarity index, a threshold u is set, and when epsilon (i) epsilon [ -u, u is considered]The data is regarded as accurate data when the data is read, and vice versa
The suspect data is identified as inaccurate data.
In a further embodiment, the thresholds r and u may be determined based on operational experience.
The further technical scheme also comprises the following steps: the method comprises the following steps of evaluating the accuracy of the metering data:
the accuracy of the metering data can be measured by comparing the inaccurate data with the number of all sampling data.
In a second aspect, a system for identifying bad data in power metering is disclosed, which comprises:
a power metering data acquisition module configured to: obtaining original electric power metering data and preprocessing the data;
a power metering data clustering module configured to: clustering the preprocessed electric power metering data;
a bad data determination module configured to: and judging whether the cluster result of the data to be detected and the user to which the data belongs has inter-class similarity, if so, judging the data to be accurate, if not, continuously judging whether the data to be detected has smoothness, if so, judging the data to be accurate, otherwise, judging the data to be inaccurate, namely bad data.
The above one or more technical solutions have the following beneficial effects:
the invention introduces the correlation coefficient into the clustering effectiveness evaluation, represents the distance between samples by the correlation coefficient, and defines the clustering effectiveness index Cc. Compared with the conventional commonly used Xie-Beni index which uses Euclidean distance to calculate the intra-cluster distance or the inter-cluster distance, the effectiveness index CcOn the other hand, the effectiveness of the clustering algorithm is measured.
The index for judging the clustering effectiveness provided by the invention is represented as different calculation results when the clustering samples are different from the clustering number. When C is presentcAnd when the value is maximum, the clustering effect is best. Therefore, the invention combines the traditional k-means clustering algorithm with the effectiveness index CcThe optimal cluster number can be determined by iteration in combination with the calculation of (2).
The invention provides an electric power metering bad data identification method based on similar internal similarity and self smoothness. By combining the intra-class similarity judgment of the load curve to be detected and the typical curve with the smoothness judgment of the load curve to be detected, the misjudgment caused by single judgment standard can be reduced to a certain extent. One of the quality characteristics of the power metering data, namely the accuracy, is quantized and expressed more intuitively by providing an accuracy quantization index.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Detailed Description
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
Example one
The embodiment discloses a method for identifying bad data of electric power metering, which firstly introduces a clustering effect evaluation index based on a correlation coefficient:
the clustering effectiveness has various evaluation indexes, and the quantitative effectiveness judgment is carried out by judging the intra-class distance and the inter-class distance of the clustering result fundamentally. A good clustering result should be achieved with the largest possible distance between classes and the smallest possible distance of samples to their cluster centers. The invention introduces the correlation coefficient into the clustering effectiveness evaluation, and the correlation coefficient is used for representing the distance between samples, thereby constructing a new clustering effectiveness evaluation index Cc。CcThe intra-class correlation coefficient and the inter-class correlation coefficient are used for reflecting the intra-class similarity and the inter-class similarity simultaneously. The index definition process is as follows:
first, the intra-class correlation coefficient is defined as:
wherein alpha is
ciClass-i correlation coefficient, x, representing class-c ith load curve
cbRepresents the b-th data point on the c-th clustering result typical load curve,
means, x, representing points on a class c typical load curve
ibRepresents the b-th data point on the ith curve in the c-th class,
and (3) representing the mean value of the ith curve in the c type, wherein m is the number of data points on each load curve.
The inter-class correlation coefficient is defined as:
wherein, beta
cjRepresenting the inter-class correlation coefficient, x, of the typical load curve in the c-th cluster and the typical load curve in the j-th cluster
cbRepresenting the b-th data point on the typical load curve in the c-th cluster,
data mean, x, representing a class c typical load curve
jbThen the b-th data point on the j-th class typical load curve is represented,
the mean of the data on the class j typical load curve is shown.
Defining a clustering validity index Cc:
Where n is the total number of samples and kc represents the number of samples contained in the c-th cluster. Since the correlation coefficient is a number smaller than 1, the closer to 1 indicates the stronger correlation between the two,
is the sum of the intra-class correlation coefficients, max (β)
cj) Is the maximum value of the correlation coefficient between classes, and the clustering effectiveness index C is obtained by dividing the upper part and the lower part
c. When the number k of clusters takes different values, the index C
cWhen the size of C is different
cWhen the value is maximum, the representative clustering effect is better, so that the optimal clustering number can be obtained.
Improved k-means clustering algorithm
Data normalization
In order to make the load curves with large numerical difference comparable during clustering, the data needs to be normalized first. Because the metering data points of each hour are directly selected as features during clustering, and the condition that dimensions are not uniform does not exist, an extreme linear normalization formula is selected, wherein the formula is as follows:
wherein LP (i) represents the raw data of the ith point on the daily load curve,
normalized data representing the ith point on the daily load curve.
The clustering effectiveness index C defined in the foregoingcAnd the improved k-means clustering algorithm which enables the k value to be determined more conveniently is established by combining with the k-means clustering algorithm. Referring to the attached figure 1, the basic implementation steps are as follows:
(1) empirically determining a value of k;
(2) selecting k samples from the n samples as initial cluster centers: c0, C1, C k-1;
(3) calculating the distance between each sample and the clustering center; the sample refers to a daily load curve of the power consumer, and the curve which is composed of one point per hour (24 points in the whole day) or one point per 15 minutes (96 points in the whole day) reflects the change trend of the power consumption of the consumer along with the time. Can be obtained by an electric power metering device.
(4) The samples are divided again according to the principle of minimum distance (namely the sum of squared errors);
(5) the mean of each class of samples is calculated as the new cluster center,
(6) judging that the sum of the distance changes of the centers of the two iterative clustering is smaller than a threshold value, finishing the iteration, and otherwise, repeating the steps (3), (4) and (5) until the conditions are met;
(7) calculating a clustering validity index Cc;
(8) And (1) returning, selecting different k values to carry out the steps, and calculating a clustering effectiveness index CcSelecting C therefromcAnd (4) considering the maximum clustering number k of the index values to be the optimal clustering number and clustering result.
In an embodiment, referring to fig. 2, the method for identifying poor power metering data based on the similarity and the smoothness includes:
(1) bad data discrimination
The same user is in production and lifeOn the same type of working day or holiday, the power consumption follows a certain law, i.e. the shapes of the load curves are similar on different dates. The change of the load curve in one day is regular, although the electric load is suddenly started, the change is limited compared with the adjacent time, namely the load curve has certain smoothness. Based on this, when the inaccurate data is discriminated, the feature of similarity is used first. And searching inaccurate data by inspecting the transverse similarity of the data in a ratio mode. Suppose LPdFor a typical daily load curve of some kind, LPcA certain daily load curve to be detected. Defining an inter-class similarity index δ (i):
delta (i) represents the inter-class similarity of the ith point data on the load curve to be measured, LP
c(i) For the ith data, LP, on the load curve to be measured
d(i) The data of the ith point on the typical load curve of the class to which the load belongs. Setting a threshold r, considering as delta (i) ∈ r, r]When the data belongs to the exact data, otherwise, when
The data is considered suspect data.
The feature of smoothness may then be applied to further screen the suspect data. The smoothness index can be measured by comparing the indexes of the front point and the rear point of the suspicious data, and the load curve LP is assumedcThe ith point above is considered suspect data, then the smoothness metric ε (i) is defined:
ε (i) represents the smoothness of the ith data on the load curve to be measured. Similar to the method for measuring similarity index, a threshold u is set, and when epsilon (i) epsilon [ -u, u is considered]The data is considered to be accurate data, and vice versa,when in use
The suspect data is identified as inaccurate data.
In practical applications, the threshold values r and u may be determined empirically by grid operators.
(2) Assessment of accuracy of metrology data
The metering data consists of data measured by each sampling point, wherein the numerical value of part of the sampling points may deviate from the true value due to the faults of the metering device, the interference on signal acquisition or transmission and the like, and the problem of inaccurate data occurs. Therefore, the accuracy of the metering data can be measured by comparing the inaccurate data with the number of all sampling data. Therefore, the invention defines a measurement accuracy index mu to measure the accuracy of daily acquisition measurement data, and the index is defined as follows:
in the formula, nbRepresenting the number of inaccurate data in the load measurement data of a day, and N representing the number of all sampling points in the load of the day.
(3) Novel clustering effectiveness evaluation index
And the clustering algorithm selects Euclidean distance or Mahalanobis distance more when calculating the distance between the sample and the clustering center, and then evaluates the clustering effectiveness by using the same distance calculation method. Thereby bringing about a problem of applicability of the clustering algorithm. If the selected distance calculation method is not suitable for the clustering object, the obtained clustering effectiveness evaluation result is not credible. The invention provides a clustering effect evaluation index based on a correlation coefficient, which is used for evaluating a clustering result from the aspect of statistics, and can reduce the one-sided influence of clustering algorithm evaluation, so that the evaluation result is more reasonable.
The traditional k-means clustering algorithm needs to manually appoint the clustering number k when clustering is carried out, and the value of k is directly related to the clustering effect. How to determine the optimal number of clustersk has been an important part of the research of clustering algorithm. The index for judging the clustering effectiveness provided by the invention is represented as different calculation results when the clustering samples are different from the clustering number. When C is presentcAnd when the value is maximum, the clustering effect is best. The invention provides an improved k-means clustering algorithm, which combines the traditional k-means clustering algorithm with an effectiveness index CcThe optimal clustering number can be determined through iteration by combining the calculation, so that the determination of the clustering number is more scientific and objective.
The invention finds the inaccurate data points in the daily load curve of the user by measuring the similarity in the daily load curve and the smoothness of the daily load curve. Compared with the situation that the inaccurate data points in the load curve are searched only through the similarity in the class or the self-smoothness, the situation that the accurate data is misjudged as the inaccurate data can occur, the method and the device can be combined, and the misjudgment rate can be well reduced.
Example two
It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.
Example four
The present embodiment aims to provide a system for identifying bad data in power metering, which includes:
a power metering data acquisition module configured to: obtaining original electric power metering data and preprocessing the data; the part of the main hardware equipment comprises:
various electric energy meter meters installed at the user side: the recording of dividing the electricity consumption and the load by the user in each hour or specified time interval is directly realized;
electric power measurement terminal: the meter is used for collecting and uploading the metering data recorded by the meter and receiving a control command of an upper management end;
a transmission network: the network for realizing measurement and acquisition data transmission comprises an optical fiber private network, a wireless private network and the like;
a data server: for storage and analysis of historical metering data;
a power metering data clustering module configured to: clustering the preprocessed electric power metering data;
the part of the main hardware equipment comprises:
an application server: for clustering module program storage and execution.
A bad data determination module configured to: and judging whether the data to be detected has inter-class similarity, if so, judging the data to be accurate, if not, continuously judging whether the data to be detected has smoothness, if so, judging the data to be accurate, otherwise, judging the data to be inaccurate, namely bad data.
The part of the main hardware equipment comprises:
an application server: and the bad data determination module is used for storing and executing the bad data determination module program.
The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.