CN113673551A

CN113673551A - A method and system for identifying bad data in power metering

Info

Publication number: CN113673551A
Application number: CN202110741482.6A
Authority: CN
Inventors: 陈祉如; 代燕杰; 刘轶娟; 郭亮; 荆臻; 杜艳; 董贤光; 张志�; 赵曦
Original assignee: Shandong University; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd; State Grid Shandong Electric Power Co Ltd; Marketing Service Center of State Grid Shandong Electric Power Co Ltd
Current assignee: Shandong University; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd; State Grid Shandong Electric Power Co Ltd; Marketing Service Center of State Grid Shandong Electric Power Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-11-19
Anticipated expiration: 2041-06-30
Also published as: CN113673551B

Abstract

The present disclosure proposes a method and system for identifying bad power metering data, including: obtaining original power metering data and preprocessing; clustering the preprocessed power metering data; Whether there is inter-class similarity, if so, the data is accurate data, if not, continue to judge whether the clustered data has smoothness, if so, the data is accurate data, otherwise inaccurate data is bad data. By proposing the accuracy quantification index, the accuracy, one of the quality characteristics of power metering data, is more intuitively quantified and expressed.

Description

Method and system for identifying bad data of electric power metering

Technical Field

The disclosure belongs to the technical field of electric power metering data identification, and particularly relates to a method and a system for identifying bad data of electric power metering.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In recent years, the power market enters a rapid development stage as an important means for realizing optimal configuration of power resources, and power metering also becomes a very important basic link in the development process of the power market. The electric power metering data contains rich information and has important significance for the developing electric power market. Through the processing and analysis of the power metering data, more information of the power utilization mode of the user can be obtained, so that a better fitting and alternative scheme can be found when the power metering data is lost, and more valuable references are provided for the recommendation of the retail package of the user when the retail market is developed later.

At present, the power metering technology is gradually developing towards automation and intellectualization. The quality of the electric power metering data is greatly improved compared with the manual metering era. However, as the demand for electric power in production and life of China is gradually increased, in the actual operation process, the phenomenon of unstable quality of electric power metering data still exists due to metering faults of electric power meters, interference in the data acquisition and transmission process and the like. In the aspect of measuring the quality of the power measurement data, the integrity, timeliness and accuracy of the data are main measurement indexes. The integrity and timeliness of the method are evaluated by a complete method, but the accuracy is used as the most important measurement index, and the evaluation method is not mature.

The application of a "Liuli, Wanggang, Dian-Jian. k-means clustering algorithm [ J ] in load curve classification electric power system protection and control, 2011,39(23):65-68+ 73", "Liu Hui boat, Zhou Kao le, Hu Xiao Jian ] poor load data identification and correction [ J ] Chinese electric power based on fuzzy load clustering, 2013,46(10): 29-34" and other documents provide a poor data identification method, wherein the former determines inaccurate data through transverse similarity or longitudinal smoothness; the latter determines the allowable variation range of the load values of various load curves through experience, and determines inaccurate data when the allowable variation range is exceeded. Both methods give out methods for judging bad data, but the proposed inaccurate data judgment method is single, and the situation of misjudgment may exist in practical application. The method combines the characteristics of transverse similarity and longitudinal smoothness of the load curve, is more comprehensively considered compared with the former two characteristics, and can effectively reduce the misjudgment rate. The method is suitable for searching inaccurate data generated by sudden change of individual data in the metering data acquisition due to electromagnetic interference and the like. The method mainly aims at the obtained electric power metering data, and identifies inaccurate data by analyzing a daily load curve composed of active power in the electric power metering data.

In summary, the technical problems related to the acquisition of poor data of power metering in the prior art are as follows: in the prior art, a bad data identification method mostly needs to combine some data information except active power, for example, combine a system line structure for obtaining metering data to perform judgment and the like. Different from the previous methods, the method mainly focuses on the electricity utilization rule and habit of users, the load curve obtained through the electric power metering equipment is independent of the equipment and the mode for obtaining data and the line structure of the system, the method is faster and more convenient, and inaccurate data identification can be carried out on all conditions capable of obtaining the load curve.

Disclosure of Invention

In order to overcome the defects of the prior art, the present disclosure provides a method for identifying bad data of electric power metering, wherein the found inaccurate data points can well reduce the misjudgment rate.

In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:

in a first aspect, a method for identifying bad data of power metering is disclosed, which includes:

obtaining original electric power metering data and preprocessing the data;

clustering the preprocessed electric power metering data;

judging whether the clustering result of the data to be detected and the user to which the data belongs has inter-class similarity, if so, judging the data to be accurate data, if not, judging the data to be suspicious, and continuously judging whether the data to be detected has smoothness, if so, judging the data to be accurate data, otherwise, judging the data to be inaccurate data to be bad data.

According to the further technical scheme, the clustering effectiveness index C is obtained when the preprocessed electric power metering data are clustered_cThe method is combined with a k-means clustering algorithm, and specifically comprises the following steps:

determining an initial clustering number k value;

selecting k samples from the n samples as initial clustering centers;

calculating the distance between each sample and the clustering center;

dividing the sample again according to the principle of minimum distance, namely the least sum of squared errors;

calculating the mean value of each type of sample as a new clustering center;

if the sum of the distance changes of the centers of the two iterative clustering is smaller than a threshold value, the iteration is finished;

calculating a clustering validity index C_c；

And selecting different k values to carry out the steps, calculating the clustering effectiveness index, and selecting the clustering number k with the maximum effectiveness index value from the clustering effectiveness index, wherein the clustering number and the clustering result are optimal.

According to the further technical scheme, when judging whether the data to be detected has inter-class similarity, defining an inter-class similarity index delta (i):

delta (i) represents the inter-class similarity of the ith point data on the load curve to be measured, LP_c(i) For the ith data, LP, on the load curve to be measured_d(i) Setting a threshold value r for the ith data on a typical load curve of the class to which the load belongs, and considering that when delta (i) is equal to [ -r, r]When the data belongs to the exact data, otherwise, when

The data is considered suspect data.

According to the further technical scheme, the characteristic of smoothness is used for further screening the suspicious data, and the smoothness index can be measured by comparing the data of two points before and after the suspicious data.

Further technical solution, assume load curve LP_cThe ith point above is considered suspect data, then the smoothness metric ε (i) is defined:

ε (i) represents the smoothness of the ith data on the load curve to be measured. Similar to the method for measuring similarity index, a threshold u is set, and when epsilon (i) epsilon [ -u, u is considered]The data is regarded as accurate data when the data is read, and vice versa

The suspect data is identified as inaccurate data.

In a further embodiment, the thresholds r and u may be determined based on operational experience.

The further technical scheme also comprises the following steps: the method comprises the following steps of evaluating the accuracy of the metering data:

the accuracy of the metering data can be measured by comparing the inaccurate data with the number of all sampling data.

In a second aspect, a system for identifying bad data in power metering is disclosed, which comprises:

a power metering data acquisition module configured to: obtaining original electric power metering data and preprocessing the data;

a power metering data clustering module configured to: clustering the preprocessed electric power metering data;

a bad data determination module configured to: and judging whether the cluster result of the data to be detected and the user to which the data belongs has inter-class similarity, if so, judging the data to be accurate, if not, continuously judging whether the data to be detected has smoothness, if so, judging the data to be accurate, otherwise, judging the data to be inaccurate, namely bad data.

The above one or more technical solutions have the following beneficial effects:

the invention introduces the correlation coefficient into the clustering effectiveness evaluation, represents the distance between samples by the correlation coefficient, and defines the clustering effectiveness index C_c. Compared with the conventional commonly used Xie-Beni index which uses Euclidean distance to calculate the intra-cluster distance or the inter-cluster distance, the effectiveness index C_cOn the other hand, the effectiveness of the clustering algorithm is measured.

The index for judging the clustering effectiveness provided by the invention is represented as different calculation results when the clustering samples are different from the clustering number. When C is present_cAnd when the value is maximum, the clustering effect is best. Therefore, the invention combines the traditional k-means clustering algorithm with the effectiveness index C_cThe optimal cluster number can be determined by iteration in combination with the calculation of (2).

The invention provides an electric power metering bad data identification method based on similar internal similarity and self smoothness. By combining the intra-class similarity judgment of the load curve to be detected and the typical curve with the smoothness judgment of the load curve to be detected, the misjudgment caused by single judgment standard can be reduced to a certain extent. One of the quality characteristics of the power metering data, namely the accuracy, is quantized and expressed more intuitively by providing an accuracy quantization index.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flow chart of an improved k-means clustering algorithm according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a bad data identification method based on intra-class similarity and self-smoothness according to an embodiment of the disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example one

The embodiment discloses a method for identifying bad data of electric power metering, which firstly introduces a clustering effect evaluation index based on a correlation coefficient:

the clustering effectiveness has various evaluation indexes, and the quantitative effectiveness judgment is carried out by judging the intra-class distance and the inter-class distance of the clustering result fundamentally. A good clustering result should be achieved with the largest possible distance between classes and the smallest possible distance of samples to their cluster centers. The invention introduces the correlation coefficient into the clustering effectiveness evaluation, and the correlation coefficient is used for representing the distance between samples, thereby constructing a new clustering effectiveness evaluation index C_c。C_cThe intra-class correlation coefficient and the inter-class correlation coefficient are used for reflecting the intra-class similarity and the inter-class similarity simultaneously. The index definition process is as follows:

first, the intra-class correlation coefficient is defined as:

wherein alpha is_ciClass-i correlation coefficient, x, representing class-c ith load curve_cbRepresents the b-th data point on the c-th clustering result typical load curve,

means, x, representing points on a class c typical load curve_ibRepresents the b-th data point on the ith curve in the c-th class,

and (3) representing the mean value of the ith curve in the c type, wherein m is the number of data points on each load curve.

The inter-class correlation coefficient is defined as:

wherein, beta_cjRepresenting the inter-class correlation coefficient, x, of the typical load curve in the c-th cluster and the typical load curve in the j-th cluster_cbRepresenting the b-th data point on the typical load curve in the c-th cluster,

data mean, x, representing a class c typical load curve_jbThen the b-th data point on the j-th class typical load curve is represented,

the mean of the data on the class j typical load curve is shown.

Defining a clustering validity index C_c：

Where n is the total number of samples and kc represents the number of samples contained in the c-th cluster. Since the correlation coefficient is a number smaller than 1, the closer to 1 indicates the stronger correlation between the two,

is the sum of the intra-class correlation coefficients, max (β)_cj) Is the maximum value of the correlation coefficient between classes, and the clustering effectiveness index C is obtained by dividing the upper part and the lower part_c. When the number k of clusters takes different values, the index C_cWhen the size of C is different_cWhen the value is maximum, the representative clustering effect is better, so that the optimal clustering number can be obtained.

Improved k-means clustering algorithm

Data normalization

In order to make the load curves with large numerical difference comparable during clustering, the data needs to be normalized first. Because the metering data points of each hour are directly selected as features during clustering, and the condition that dimensions are not uniform does not exist, an extreme linear normalization formula is selected, wherein the formula is as follows:

wherein LP (i) represents the raw data of the ith point on the daily load curve,

normalized data representing the ith point on the daily load curve.

The clustering effectiveness index C defined in the foregoing_cAnd the improved k-means clustering algorithm which enables the k value to be determined more conveniently is established by combining with the k-means clustering algorithm. Referring to the attached figure 1, the basic implementation steps are as follows:

(1) empirically determining a value of k;

(2) selecting k samples from the n samples as initial cluster centers: c0, C1, C k-1;

(3) calculating the distance between each sample and the clustering center; the sample refers to a daily load curve of the power consumer, and the curve which is composed of one point per hour (24 points in the whole day) or one point per 15 minutes (96 points in the whole day) reflects the change trend of the power consumption of the consumer along with the time. Can be obtained by an electric power metering device.

(4) The samples are divided again according to the principle of minimum distance (namely the sum of squared errors);

(5) the mean of each class of samples is calculated as the new cluster center,

(6) judging that the sum of the distance changes of the centers of the two iterative clustering is smaller than a threshold value, finishing the iteration, and otherwise, repeating the steps (3), (4) and (5) until the conditions are met;

(7) calculating a clustering validity index C_c；

(8) And (1) returning, selecting different k values to carry out the steps, and calculating a clustering effectiveness index C_cSelecting C therefrom_cAnd (4) considering the maximum clustering number k of the index values to be the optimal clustering number and clustering result.

In an embodiment, referring to fig. 2, the method for identifying poor power metering data based on the similarity and the smoothness includes:

(1) bad data discrimination

The same user is in production and lifeOn the same type of working day or holiday, the power consumption follows a certain law, i.e. the shapes of the load curves are similar on different dates. The change of the load curve in one day is regular, although the electric load is suddenly started, the change is limited compared with the adjacent time, namely the load curve has certain smoothness. Based on this, when the inaccurate data is discriminated, the feature of similarity is used first. And searching inaccurate data by inspecting the transverse similarity of the data in a ratio mode. Suppose LP_dFor a typical daily load curve of some kind, LP_cA certain daily load curve to be detected. Defining an inter-class similarity index δ (i):

delta (i) represents the inter-class similarity of the ith point data on the load curve to be measured, LP_c(i) For the ith data, LP, on the load curve to be measured_d(i) The data of the ith point on the typical load curve of the class to which the load belongs. Setting a threshold r, considering as delta (i) ∈ r, r]When the data belongs to the exact data, otherwise, when

The data is considered suspect data.

The feature of smoothness may then be applied to further screen the suspect data. The smoothness index can be measured by comparing the indexes of the front point and the rear point of the suspicious data, and the load curve LP is assumed_cThe ith point above is considered suspect data, then the smoothness metric ε (i) is defined:

ε (i) represents the smoothness of the ith data on the load curve to be measured. Similar to the method for measuring similarity index, a threshold u is set, and when epsilon (i) epsilon [ -u, u is considered]The data is considered to be accurate data, and vice versa,when in use

The suspect data is identified as inaccurate data.

In practical applications, the threshold values r and u may be determined empirically by grid operators.

(2) Assessment of accuracy of metrology data

The metering data consists of data measured by each sampling point, wherein the numerical value of part of the sampling points may deviate from the true value due to the faults of the metering device, the interference on signal acquisition or transmission and the like, and the problem of inaccurate data occurs. Therefore, the accuracy of the metering data can be measured by comparing the inaccurate data with the number of all sampling data. Therefore, the invention defines a measurement accuracy index mu to measure the accuracy of daily acquisition measurement data, and the index is defined as follows:

in the formula, n_bRepresenting the number of inaccurate data in the load measurement data of a day, and N representing the number of all sampling points in the load of the day.

(3) Novel clustering effectiveness evaluation index

And the clustering algorithm selects Euclidean distance or Mahalanobis distance more when calculating the distance between the sample and the clustering center, and then evaluates the clustering effectiveness by using the same distance calculation method. Thereby bringing about a problem of applicability of the clustering algorithm. If the selected distance calculation method is not suitable for the clustering object, the obtained clustering effectiveness evaluation result is not credible. The invention provides a clustering effect evaluation index based on a correlation coefficient, which is used for evaluating a clustering result from the aspect of statistics, and can reduce the one-sided influence of clustering algorithm evaluation, so that the evaluation result is more reasonable.

The traditional k-means clustering algorithm needs to manually appoint the clustering number k when clustering is carried out, and the value of k is directly related to the clustering effect. How to determine the optimal number of clustersk has been an important part of the research of clustering algorithm. The index for judging the clustering effectiveness provided by the invention is represented as different calculation results when the clustering samples are different from the clustering number. When C is present_cAnd when the value is maximum, the clustering effect is best. The invention provides an improved k-means clustering algorithm, which combines the traditional k-means clustering algorithm with an effectiveness index C_cThe optimal clustering number can be determined through iteration by combining the calculation, so that the determination of the clustering number is more scientific and objective.

The invention finds the inaccurate data points in the daily load curve of the user by measuring the similarity in the daily load curve and the smoothness of the daily load curve. Compared with the situation that the inaccurate data points in the load curve are searched only through the similarity in the class or the self-smoothness, the situation that the accurate data is misjudged as the inaccurate data can occur, the method and the device can be combined, and the misjudgment rate can be well reduced.

Example two

It is an object of this embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the program.

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

Example four

The present embodiment aims to provide a system for identifying bad data in power metering, which includes:

a power metering data acquisition module configured to: obtaining original electric power metering data and preprocessing the data; the part of the main hardware equipment comprises:

various electric energy meter meters installed at the user side: the recording of dividing the electricity consumption and the load by the user in each hour or specified time interval is directly realized;

electric power measurement terminal: the meter is used for collecting and uploading the metering data recorded by the meter and receiving a control command of an upper management end;

a transmission network: the network for realizing measurement and acquisition data transmission comprises an optical fiber private network, a wireless private network and the like;

a data server: for storage and analysis of historical metering data;

the part of the main hardware equipment comprises:

an application server: for clustering module program storage and execution.

A bad data determination module configured to: and judging whether the data to be detected has inter-class similarity, if so, judging the data to be accurate, if not, continuously judging whether the data to be detected has smoothness, if so, judging the data to be accurate, otherwise, judging the data to be inaccurate, namely bad data.

The part of the main hardware equipment comprises:

an application server: and the bad data determination module is used for storing and executing the bad data determination module program.

The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method for identifying bad data of electric power metering is characterized by comprising the following steps:

obtaining original electric power metering data and preprocessing the data;

clustering the preprocessed electric power metering data;

2. The method as claimed in claim 1, wherein the clustering validity index C is determined when the preprocessed power metering data are clustered_cThe method is combined with a k-means clustering algorithm, and specifically comprises the following steps:

determining an initial clustering number k value;

selecting k samples from the n samples as initial clustering centers;

calculating the distance between each sample and the clustering center;

dividing the samples again according to the principle of minimum distance, namely sum of squares of errors;

calculating the mean value of each type of sample as a new clustering center;

calculating a clustering validity index C_c；

3. The method as claimed in claim 1, wherein when determining whether the data to be measured has inter-class similarity, defining an inter-class similarity index δ (i):

The data is considered suspect data.

4. The method as claimed in claim 3, wherein the suspected data is further screened by using the smoothness, and the smoothness index can be measured by comparing the data of two points before and after the suspected data.

5. The method as claimed in claim 1, wherein the load curve LP is assumed_cThe ith point above is considered suspect data, then the smoothness metric ε (i) is defined:

The suspect data is identified as inaccurate data.

6. The method as claimed in claim 5, wherein the threshold r is determined empirically;

the threshold u is determined empirically.

7. The method as claimed in claim 1, further comprising: the method comprises the following steps of evaluating the accuracy of the metering data:

8. A bad data identification system for electric power metering is characterized by comprising:

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of the preceding claims 1 to 7.