CN117370744A

CN117370744A - Dynamic cleaning method and system for abnormal power consumption data of power consumer

Info

Publication number: CN117370744A
Application number: CN202311666866.1A
Authority: CN
Inventors: 李晓辉; 韩可欣; 刘伟东; 吕伟嘉; 刘小琛; 李祯祥; 王崇; 葛磊蛟; 张革; 赵宏伟; 骆文涛; 骆斌; 杜天硕
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Marketing Service Center of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Marketing Service Center of State Grid Tianjin Electric Power Co Ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-01-09

Abstract

The invention discloses a dynamic cleaning method and a system for abnormal data of electricity consumption of an electric power user. The method comprises the following steps: calculating the size of an abnormal factor of all data points; clustering is carried out aiming at the local abnormal factor value, and the local abnormal factor value on the boundary line between the sample normal data point and the sample abnormal data point obtained by self-adaptive clustering is used as an abnormal judgment threshold; judging abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into a missing data dynamic filling program; taking blank missing data points as local missing types; filling data points of the local missing type data by adopting a least square regression model prediction result; and filling the long-term missing type data interval by adopting a random forest model prediction result. And evaluating the dynamic cleaning effect of the original data. The method and the device can be used for rapidly and effectively cleaning the abnormal power consumption data of the power consumers, and meet the cleaning and use requirements of the current multi-source big data of the power distribution network.

Description

Dynamic cleaning method and system for abnormal power consumption data of power consumer

Technical Field

The invention relates to the field of processing of abnormal power consumption data of power users, in particular to a dynamic cleaning method and a dynamic cleaning system for abnormal power consumption data of power users.

Background

With the development of intelligent acquisition terminals, the data acquisition system uploads massive multi-source measurement data of electricity consumption of power users to a data center, and data information of the power distribution network is managed and read by adopting a database. Because the sensor is in short time failure, external interference, transmission errors and other factors, the acquired data is not complete and reliable, so that the phenomenon of data abnormality and deletion of the original data occurs, and the data quality is improved by data cleaning. However, in the traditional cleaning method at present, the threshold value judgment of the abnormal data completely depends on manual setting, and misjudgment exists, so that the method is not suitable for cleaning the multi-source big data of the power distribution network.

Disclosure of Invention

Aiming at the technical problems pointed out in the background technology, the invention provides a method and a system for dynamically cleaning abnormal power consumption data of a power user.

In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:

on the one hand

The invention provides a dynamic cleaning method for abnormal data of electricity consumption of an electric power user, which comprises the following steps:

step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;

step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;

step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;

step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;

step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;

step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.

Step 7: and evaluating the dynamic cleaning effect of the original data.

On the other hand

Correspondingly to the method, the invention also provides a power user electricity consumption abnormal data dynamic cleaning system which comprises an abnormal factor calculating unit, an abnormal judgment threshold calculating unit, a blank missing data point forming unit, a missing data time length judging unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:

the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;

the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;

the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;

the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;

the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;

the long-term missing type data interval filling unit is used for setting the number of decision trees of the random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.

The evaluation unit is used for evaluating the dynamic cleaning effect of the original data.

Compared with the prior art, the invention has the beneficial effects that: by utilizing the dynamic cleaning scheme for the power consumption abnormal data of the power users, the power consumption abnormal data of the power users can be quickly and effectively cleaned, the cleaning and use requirements of the current multi-source big data of the power distribution network are met, and the method is convenient to popularize and use in industry.

Drawings

Fig. 1 is a schematic flow chart of a method provided in an embodiment of the present application.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the traditional cleaning method, the threshold judgment of abnormal data completely depends on manual setting, and misjudgment exists, so that the method is not suitable for cleaning multi-source big data of a power distribution network; in addition, the filling of missing data often pay attention to filling accuracy too much, so that the running time of an algorithm is ignored, and the filling efficiency of missing massive data for power users is poor. Therefore, the current method for cleaning the electricity consumption big data of the power consumer has certain limitation.

As shown in fig. 1, the method for dynamically cleaning abnormal data of electricity consumption of a power consumer provided in this embodiment includes the following steps:

Step 7: and evaluating the dynamic cleaning effect of the original data.

Preferably, the step 1 is specifically as follows:

step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx _a Data pointsx _b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx _a First, thekDistant data points and data pointsx _a A distance therebetween; />Data pointsx _a To data pointx _b Is the first of (2)kReach the distance and takeAnd->Is represented by the following formula:

；

step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx _a Is the first of (2)kA distance field; data pointsx _a Is>Is taken as a pointx _a Is the first of (2)kDistance field->All data points to data pointx _a Inverse of the average reachable distance +.>Representing data pointsx _a The density magnitude with the surrounding field data points is as follows:

；

in the method, in the process of the invention,representing data pointsx _a Is the first of (2)kThe number of all data points within the distance;

step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx _a Local anomaly factor value of (2)The following formula:

；

in the method, in the process of the invention,representing data pointsx _a First, thekDistance field->Local reachable density of internal data points and data pointsx _a Average of the ratios of the local reachable densities, +.>The larger the data pointx _a The greater the likelihood of being an outlier.

It should be noted that the number of the substrates,kdistance: for any data point in the X sub-data sets, the nearest to data point okThe distance of a point is called the data point okA distance;kdistance neighborhood: all distances not greater than okThe neighborhood formed by the data object points of the distance is called askA distance neighborhood. The distance can be reached: for any data point a and B in the X sub-data sets, the distance between data point a to data point B is the reachable distance.

Preferably, the step 2 is specifically as follows:

step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:

；

wherein:x _i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π _g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix isIs a single Gaussian model probability distribution function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>I.e. the data point in each original data after this iteration belongs to the firstgProbability of a gaussian model.

Step 2.2: executing M step in EM algorithm, and the following formula:

；

wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;

step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:

；

step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.

It should be noted that the random forest model essentially belongs to an integrated learning method in machine learning, and integrates a plurality of decision trees through an integrated learning thought, wherein a basic unit is a decision tree, each decision tree is a classifier, and the final output is obtained by averaging the results of all the decision trees. When the decision tree is applied to regression problems, node splitting can be controlled through variance, and the smaller the node variance is, the more the selected feature values are represented, the better the feature of node splitting is.

The sample variance is:；

wherein: x is a sub-dataset;Nis the firstfThe number of sample points in the sub-samples;y _f is the firstfThe value of the sub-sample point;y _c is the firstcActual values of sub-sample points. And continuously splitting the whole sample point into different node spaces, and obtaining a predicted value by each node, wherein the average value of all the predicted values of the nodes is the final predicted result of the decision tree.

Preferably, in step 7, the dynamic cleaning effect of the raw data is evaluated, and regression evaluation index R is adopted ² An evaluation algorithm;

the regression evaluation index R ² An evaluation algorithm, as follows:

；

wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y _f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;

according toR ² The size of the cleaning solution is determined to be good or bad.

Preferably, in step 7, the dynamic cleaning effect of the raw data is evaluated, and the average absolute percentage error is also adoptedM _APE And root mean square errorR _MSE As an evaluation index of the data prediction effect, the following formula is adopted:

；

wherein,Nis the firstfThe number of sample points in the sub-samples;M _APE and (3) withR _MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.

Correspondingly, the invention also provides a power consumer electricity consumption abnormal data dynamic cleaning system which comprises an abnormal factor calculating unit, an abnormal judgment threshold calculating unit, a blank missing data point forming unit, a missing data time length judging unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:

the long-term missing type data interval filling unit is used for setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;

Preferably, the anomaly factor calculation unit is configured to perform the following:

；

in the method, in the process of the invention,representing data pointsx _a First, thekDistance field->Office of internal data pointsLocal reachable density and data pointsx _a Average of the ratios of the local reachable densities, +.>The larger the data pointx _a The greater the likelihood of being an outlier.

Preferably, the blank missing data point forming unit is configured to perform the following:

；

wherein:x _i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π _g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix isIs a single Gaussian model probability distribution function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>The method comprises the steps of carrying out a first treatment on the surface of the I.e. the data point in each original data after this iteration belongs to the firstgProbability of a gaussian model.

Step 2.2: executing M step in EM algorithm, and the following formula:

；

Preferably, the evaluation unit is used for evaluating the dynamic cleaning effect of the raw data, and regression evaluation index R is adopted ² An evaluation algorithm;

the regression evaluation index R ² An evaluation algorithm, as follows:

；

wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y _f is the firstfThe value of the sub-sample point,nnumber of deletionsDepending on the length of the sheet,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;

If R is ² =0, illustrating that the method of the present invention has the same effect as the mean filling method in the prior art; if R is ² < 0, the method of the invention has poorer effect compared with the mean value filling method in the prior art; if R is ² More than 0, the method of the invention has better effect than the mean value filling method in the prior art, and R ² The larger the value, the better the effect.

Preferably, the evaluation unit is used for evaluating the dynamic cleaning effect of the original data, and average absolute percentage error is also adoptedM _APE And root mean square errorR _MSE As an evaluation index of the data prediction effect, the following formula is adopted:

；

M _APE And (3) withR _MSE The smaller the value of (c) is, the smaller the error of the final prediction result of the algorithm is, and the better the effect is. The two types respectively reflect the fitting degree and the error size between the algorithm calculation result and the true value,R _MSE the absolute deviation of the load true value from the predicted value is measured,M _APE the relative deviation between the load true value and the predicted value is measured, R ² The degree of fitting between the predicted value and the load true value is measured, and the indexes comprehensively evaluate the algorithm performance from different aspects.

The foregoing details of the optional implementation of the embodiment of the present invention have been described in detail with reference to the accompanying drawings, but the embodiment of the present invention is not limited to the specific details of the foregoing implementation, and various simple modifications may be made to the technical solution of the embodiment of the present invention within the scope of the technical concept of the embodiment of the present invention, and these simple modifications all fall within the protection scope of the embodiment of the present invention.

Claims

1. The dynamic cleaning method for the abnormal data of the electricity consumption of the power consumer is characterized by comprising the following steps:

step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;

step 7: and evaluating the dynamic cleaning effect of the original data.

2. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 1, wherein the step 1 is specifically as follows:

step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx _a Data pointsx _b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx _a First, thekDistant data points and data pointsx _a A distance therebetween; />Data pointsx _a To data pointx _b Is the first of (2)kReach distance, get->Andis represented by the following formula:

；

3. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 2, wherein the step 2 is specifically as follows:

；

wherein:x _i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π _g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix +.>Is a single Gaussian model probability distribution function; calculating local anomalies of data points in each raw dataThe factor value is attPosterior probability of Gaussian mixture model in multiple iterations +.>；

Step 2.2: executing M step in EM algorithm, and the following formula:

；

4. A consumer electricity anomaly data according to claim 3The dynamic cleaning method is characterized in that in the step 7, the dynamic cleaning effect of the original data is evaluated, and regression evaluation index R is adopted ² An evaluation algorithm;

the regression evaluation index R ² An evaluation algorithm, as follows:

；

5. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 4, wherein in step 7, the dynamic cleaning effect of the original data is evaluated, and average absolute percentage error is adoptedM _APE And root mean square errorR _MSE As an evaluation index of the data prediction effect, the following formula is adopted:

；

wherein,Nis the firstfThe number of sample points in the sub-samples;M _APE and (3) withR _MSE The smaller the value of (2) is, the final predicted junction is representedThe smaller the fruit error, the better the effect.

6. The power consumption abnormal data dynamic cleaning system for the power users is characterized by comprising an abnormal factor calculation unit, an abnormal judgment threshold calculation unit, a blank missing data point forming unit, a missing data time length judgment unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:

7. The power consumer electricity consumption anomaly data dynamic cleaning system of claim 6, wherein the anomaly factor calculation unit is configured to perform the following:

；

step 1.2:according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx _a Is the first of (2)kA distance field; data pointsx _a Is>Is taken as a pointx _a Is the first of (2)kDistance field->All data points to data pointx _a Inverse of the average reachable distance +.>Representing data pointsx _a The density magnitude with the surrounding field data points is as follows:

；

8. The power consumer electricity anomaly data dynamic cleaning system of claim 7, wherein the blank missing data point forming unit is configured to perform the following:

；

wherein:x _i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π _g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix +.>Single gaussian model probability score for (a)A cloth function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>；

Step 2.2: executing M step in EM algorithm, and the following formula:

；

9. The system for dynamically cleaning abnormal electricity consumption data of electric power consumer according to claim 8, wherein the evaluation unit is configured to evaluate the dynamic cleaning effect of the raw data by using regression evaluation index R ² An evaluation algorithm;

the regression evaluation index R ² An evaluation algorithm, as follows:

；

10. The system for dynamically cleaning abnormal electricity consumption data of electric power consumer according to claim 9, wherein the evaluation unit is configured to evaluate the dynamic cleaning effect of the raw data, and further employs an average absolute percentage errorM _APE And root mean square errorR _MSE As an evaluation index of the data prediction effect, the following formula is adopted:

；

wherein,Nis the firstfNumber of sample points in a sub-sampleAn amount of;M _APE and (3) withR _MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.