CN117370744A - Dynamic cleaning method and system for abnormal power consumption data of power consumer - Google Patents
Dynamic cleaning method and system for abnormal power consumption data of power consumer Download PDFInfo
- Publication number
- CN117370744A CN117370744A CN202311666866.1A CN202311666866A CN117370744A CN 117370744 A CN117370744 A CN 117370744A CN 202311666866 A CN202311666866 A CN 202311666866A CN 117370744 A CN117370744 A CN 117370744A
- Authority
- CN
- China
- Prior art keywords
- data
- data points
- missing
- abnormal
- local
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 101
- 238000004140 cleaning Methods 0.000 title claims abstract description 49
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000000694 effects Effects 0.000 claims abstract description 28
- 230000007774 longterm Effects 0.000 claims abstract description 25
- 238000007637 random forest analysis Methods 0.000 claims abstract description 24
- 230000005611 electricity Effects 0.000 claims abstract description 16
- 239000000203 mixture Substances 0.000 claims description 34
- 238000011156 evaluation Methods 0.000 claims description 30
- 238000004422 calculation algorithm Methods 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 24
- 238000003066 decision tree Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000001174 ascending effect Effects 0.000 claims description 6
- 238000007476 Maximum Likelihood Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 4
- 238000005315 distribution function Methods 0.000 claims description 3
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 239000004744 fabric Substances 0.000 claims 1
- 230000006870 function Effects 0.000 description 6
- 241000764238 Isis Species 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
- G06F18/15—Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/27—Regression, e.g. linear or logistic regression
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Economics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a dynamic cleaning method and a system for abnormal data of electricity consumption of an electric power user. The method comprises the following steps: calculating the size of an abnormal factor of all data points; clustering is carried out aiming at the local abnormal factor value, and the local abnormal factor value on the boundary line between the sample normal data point and the sample abnormal data point obtained by self-adaptive clustering is used as an abnormal judgment threshold; judging abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into a missing data dynamic filling program; taking blank missing data points as local missing types; filling data points of the local missing type data by adopting a least square regression model prediction result; and filling the long-term missing type data interval by adopting a random forest model prediction result. And evaluating the dynamic cleaning effect of the original data. The method and the device can be used for rapidly and effectively cleaning the abnormal power consumption data of the power consumers, and meet the cleaning and use requirements of the current multi-source big data of the power distribution network.
Description
Technical Field
The invention relates to the field of processing of abnormal power consumption data of power users, in particular to a dynamic cleaning method and a dynamic cleaning system for abnormal power consumption data of power users.
Background
With the development of intelligent acquisition terminals, the data acquisition system uploads massive multi-source measurement data of electricity consumption of power users to a data center, and data information of the power distribution network is managed and read by adopting a database. Because the sensor is in short time failure, external interference, transmission errors and other factors, the acquired data is not complete and reliable, so that the phenomenon of data abnormality and deletion of the original data occurs, and the data quality is improved by data cleaning. However, in the traditional cleaning method at present, the threshold value judgment of the abnormal data completely depends on manual setting, and misjudgment exists, so that the method is not suitable for cleaning the multi-source big data of the power distribution network.
Disclosure of Invention
Aiming at the technical problems pointed out in the background technology, the invention provides a method and a system for dynamically cleaning abnormal power consumption data of a power user.
In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:
on the one hand
The invention provides a dynamic cleaning method for abnormal data of electricity consumption of an electric power user, which comprises the following steps:
step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;
step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;
step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;
step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;
step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;
step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.
Step 7: and evaluating the dynamic cleaning effect of the original data.
On the other hand
Correspondingly to the method, the invention also provides a power user electricity consumption abnormal data dynamic cleaning system which comprises an abnormal factor calculating unit, an abnormal judgment threshold calculating unit, a blank missing data point forming unit, a missing data time length judging unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:
the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;
the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;
the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;
the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;
the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;
the long-term missing type data interval filling unit is used for setting the number of decision trees of the random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.
The evaluation unit is used for evaluating the dynamic cleaning effect of the original data.
Compared with the prior art, the invention has the beneficial effects that: by utilizing the dynamic cleaning scheme for the power consumption abnormal data of the power users, the power consumption abnormal data of the power users can be quickly and effectively cleaned, the cleaning and use requirements of the current multi-source big data of the power distribution network are met, and the method is convenient to popularize and use in industry.
Drawings
Fig. 1 is a schematic flow chart of a method provided in an embodiment of the present application.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In the traditional cleaning method, the threshold judgment of abnormal data completely depends on manual setting, and misjudgment exists, so that the method is not suitable for cleaning multi-source big data of a power distribution network; in addition, the filling of missing data often pay attention to filling accuracy too much, so that the running time of an algorithm is ignored, and the filling efficiency of missing massive data for power users is poor. Therefore, the current method for cleaning the electricity consumption big data of the power consumer has certain limitation.
As shown in fig. 1, the method for dynamically cleaning abnormal data of electricity consumption of a power consumer provided in this embodiment includes the following steps:
step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;
step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;
step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;
step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;
step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;
step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.
Step 7: and evaluating the dynamic cleaning effect of the original data.
Preferably, the step 1 is specifically as follows:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach the distance and takeAnd->Is represented by the following formula:
;
step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
;
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
;
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Local reachable density of internal data points and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
It should be noted that the number of the substrates,kdistance: for any data point in the X sub-data sets, the nearest to data point okThe distance of a point is called the data point okA distance;kdistance neighborhood: all distances not greater than okThe neighborhood formed by the data object points of the distance is called askA distance neighborhood. The distance can be reached: for any data point a and B in the X sub-data sets, the distance between data point a to data point B is the reachable distance.
Preferably, the step 2 is specifically as follows:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
;
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix isIs a single Gaussian model probability distribution function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>I.e. the data point in each original data after this iteration belongs to the firstgProbability of a gaussian model.
Step 2.2: executing M step in EM algorithm, and the following formula:
;
;
;
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
;
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
It should be noted that the random forest model essentially belongs to an integrated learning method in machine learning, and integrates a plurality of decision trees through an integrated learning thought, wherein a basic unit is a decision tree, each decision tree is a classifier, and the final output is obtained by averaging the results of all the decision trees. When the decision tree is applied to regression problems, node splitting can be controlled through variance, and the smaller the node variance is, the more the selected feature values are represented, the better the feature of node splitting is.
The sample variance is:;
wherein: x is a sub-dataset;Nis the firstfThe number of sample points in the sub-samples;y f is the firstfThe value of the sub-sample point;y c is the firstcActual values of sub-sample points. And continuously splitting the whole sample point into different node spaces, and obtaining a predicted value by each node, wherein the average value of all the predicted values of the nodes is the final predicted result of the decision tree.
Preferably, in step 7, the dynamic cleaning effect of the raw data is evaluated, and regression evaluation index R is adopted 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
;
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
Preferably, in step 7, the dynamic cleaning effect of the raw data is evaluated, and the average absolute percentage error is also adoptedM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
;
;
wherein,Nis the firstfThe number of sample points in the sub-samples;M APE and (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.
Correspondingly, the invention also provides a power consumer electricity consumption abnormal data dynamic cleaning system which comprises an abnormal factor calculating unit, an abnormal judgment threshold calculating unit, a blank missing data point forming unit, a missing data time length judging unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:
the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;
the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;
the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;
the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;
the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;
the long-term missing type data interval filling unit is used for setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;
the evaluation unit is used for evaluating the dynamic cleaning effect of the original data.
Preferably, the anomaly factor calculation unit is configured to perform the following:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach the distance and takeAnd->Is represented by the following formula:
;
step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
;
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
;
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Office of internal data pointsLocal reachable density and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
Preferably, the blank missing data point forming unit is configured to perform the following:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
;
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix isIs a single Gaussian model probability distribution function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>The method comprises the steps of carrying out a first treatment on the surface of the I.e. the data point in each original data after this iteration belongs to the firstgProbability of a gaussian model.
Step 2.2: executing M step in EM algorithm, and the following formula:
;
;
;
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
;
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
Preferably, the evaluation unit is used for evaluating the dynamic cleaning effect of the raw data, and regression evaluation index R is adopted 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
;
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nnumber of deletionsDepending on the length of the sheet,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
If R is 2 =0, illustrating that the method of the present invention has the same effect as the mean filling method in the prior art; if R is 2 < 0, the method of the invention has poorer effect compared with the mean value filling method in the prior art; if R is 2 More than 0, the method of the invention has better effect than the mean value filling method in the prior art, and R 2 The larger the value, the better the effect.
Preferably, the evaluation unit is used for evaluating the dynamic cleaning effect of the original data, and average absolute percentage error is also adoptedM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
;
;
wherein,Nis the firstfThe number of sample points in the sub-samples;M APE and (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.
M APE And (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result of the algorithm is, and the better the effect is. The two types respectively reflect the fitting degree and the error size between the algorithm calculation result and the true value,R MSE the absolute deviation of the load true value from the predicted value is measured,M APE the relative deviation between the load true value and the predicted value is measured, R 2 The degree of fitting between the predicted value and the load true value is measured, and the indexes comprehensively evaluate the algorithm performance from different aspects.
The foregoing details of the optional implementation of the embodiment of the present invention have been described in detail with reference to the accompanying drawings, but the embodiment of the present invention is not limited to the specific details of the foregoing implementation, and various simple modifications may be made to the technical solution of the embodiment of the present invention within the scope of the technical concept of the embodiment of the present invention, and these simple modifications all fall within the protection scope of the embodiment of the present invention.
Claims (10)
1. The dynamic cleaning method for the abnormal data of the electricity consumption of the power consumer is characterized by comprising the following steps:
step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;
step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;
step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;
step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;
step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;
step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;
step 7: and evaluating the dynamic cleaning effect of the original data.
2. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 1, wherein the step 1 is specifically as follows:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach distance, get->Andis represented by the following formula:
;
step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
;
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
;
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Local reachable density of internal data points and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
3. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 2, wherein the step 2 is specifically as follows:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
;
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix +.>Is a single Gaussian model probability distribution function; calculating local anomalies of data points in each raw dataThe factor value is attPosterior probability of Gaussian mixture model in multiple iterations +.>;
Step 2.2: executing M step in EM algorithm, and the following formula:
;
;
;
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
;
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
4. A consumer electricity anomaly data according to claim 3The dynamic cleaning method is characterized in that in the step 7, the dynamic cleaning effect of the original data is evaluated, and regression evaluation index R is adopted 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
;
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
5. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 4, wherein in step 7, the dynamic cleaning effect of the original data is evaluated, and average absolute percentage error is adoptedM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
;
;
wherein,Nis the firstfThe number of sample points in the sub-samples;M APE and (3) withR MSE The smaller the value of (2) is, the final predicted junction is representedThe smaller the fruit error, the better the effect.
6. The power consumption abnormal data dynamic cleaning system for the power users is characterized by comprising an abnormal factor calculation unit, an abnormal judgment threshold calculation unit, a blank missing data point forming unit, a missing data time length judgment unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:
the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;
the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;
the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;
the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;
the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;
the long-term missing type data interval filling unit is used for setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;
the evaluation unit is used for evaluating the dynamic cleaning effect of the original data.
7. The power consumer electricity consumption anomaly data dynamic cleaning system of claim 6, wherein the anomaly factor calculation unit is configured to perform the following:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach distance, get->Andis represented by the following formula:
;
step 1.2:according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
;
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
;
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Local reachable density of internal data points and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
8. The power consumer electricity anomaly data dynamic cleaning system of claim 7, wherein the blank missing data point forming unit is configured to perform the following:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
;
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix +.>Single gaussian model probability score for (a)A cloth function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>;
Step 2.2: executing M step in EM algorithm, and the following formula:
;
;
;
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
;
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
9. The system for dynamically cleaning abnormal electricity consumption data of electric power consumer according to claim 8, wherein the evaluation unit is configured to evaluate the dynamic cleaning effect of the raw data by using regression evaluation index R 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
;
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
10. The system for dynamically cleaning abnormal electricity consumption data of electric power consumer according to claim 9, wherein the evaluation unit is configured to evaluate the dynamic cleaning effect of the raw data, and further employs an average absolute percentage errorM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
;
;
wherein,Nis the firstfNumber of sample points in a sub-sampleAn amount of;M APE and (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311666866.1A CN117370744A (en) | 2023-12-07 | 2023-12-07 | Dynamic cleaning method and system for abnormal power consumption data of power consumer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311666866.1A CN117370744A (en) | 2023-12-07 | 2023-12-07 | Dynamic cleaning method and system for abnormal power consumption data of power consumer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117370744A true CN117370744A (en) | 2024-01-09 |
Family
ID=89393251
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311666866.1A Pending CN117370744A (en) | 2023-12-07 | 2023-12-07 | Dynamic cleaning method and system for abnormal power consumption data of power consumer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117370744A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118035227A (en) * | 2024-04-15 | 2024-05-14 | 山东云擎信息技术有限公司 | Data intelligent processing method and system based on big data evaluation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111639978A (en) * | 2020-06-08 | 2020-09-08 | 武汉理工大学 | Electronic commerce event driving type demand forecasting method based on Prophet-random forest |
CN112926627A (en) * | 2021-01-28 | 2021-06-08 | 电子科技大学 | Equipment defect time prediction method based on capacitive equipment defect data |
CN113468796A (en) * | 2021-04-13 | 2021-10-01 | 广西电网有限责任公司南宁供电局 | Voltage missing data identification method based on improved random forest algorithm |
CN113886375A (en) * | 2021-09-29 | 2022-01-04 | 东北电力大学 | Wind power data cleaning method based on isolated forest and local outlier factors |
US20220382263A1 (en) * | 2021-04-30 | 2022-12-01 | Dalian University Of Technology | Distributed industrial energy operation optimization platform automatically constructing intelligent models and algorithms |
CN117113162A (en) * | 2023-05-23 | 2023-11-24 | 南华大学 | Eddar-rock structure background discrimination and graphic method integrating machine learning |
-
2023
- 2023-12-07 CN CN202311666866.1A patent/CN117370744A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111639978A (en) * | 2020-06-08 | 2020-09-08 | 武汉理工大学 | Electronic commerce event driving type demand forecasting method based on Prophet-random forest |
CN112926627A (en) * | 2021-01-28 | 2021-06-08 | 电子科技大学 | Equipment defect time prediction method based on capacitive equipment defect data |
CN113468796A (en) * | 2021-04-13 | 2021-10-01 | 广西电网有限责任公司南宁供电局 | Voltage missing data identification method based on improved random forest algorithm |
US20220382263A1 (en) * | 2021-04-30 | 2022-12-01 | Dalian University Of Technology | Distributed industrial energy operation optimization platform automatically constructing intelligent models and algorithms |
CN113886375A (en) * | 2021-09-29 | 2022-01-04 | 东北电力大学 | Wind power data cleaning method based on isolated forest and local outlier factors |
CN117113162A (en) * | 2023-05-23 | 2023-11-24 | 南华大学 | Eddar-rock structure background discrimination and graphic method integrating machine learning |
Non-Patent Citations (1)
Title |
---|
梅玉杰: ""基于机器学习的配电网异常缺失数据动态清洗方法"", 《电力系统保护与控制》, pages 159 - 168 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118035227A (en) * | 2024-04-15 | 2024-05-14 | 山东云擎信息技术有限公司 | Data intelligent processing method and system based on big data evaluation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145957B (en) | Method and device for identifying and processing abnormal indexes of power distribution network based on big data | |
CN107612016B (en) | Planning method of distributed power supply in power distribution network based on maximum voltage correlation entropy | |
CN109978079A (en) | A kind of data cleaning method of improved storehouse noise reduction self-encoding encoder | |
CN109742788B (en) | New energy power station grid-connected performance evaluation index correction method | |
CN111259953A (en) | Equipment defect time prediction method based on capacitive equipment defect data | |
CN113191253A (en) | Non-invasive load identification method based on feature fusion under edge machine learning | |
CN116522268B (en) | Line loss anomaly identification method for power distribution network | |
CN116008714B (en) | Anti-electricity-stealing analysis method based on intelligent measurement terminal | |
CN116911806B (en) | Internet + based power enterprise energy information management system | |
CN116821832A (en) | Abnormal data identification and correction method for high-voltage industrial and commercial user power load | |
CN110212592B (en) | Thermal power generating unit load regulation maximum rate estimation method and system based on piecewise linear expression | |
CN114298136A (en) | Wind speed prediction method based on local mean decomposition and deep learning neural network | |
CN113886375A (en) | Wind power data cleaning method based on isolated forest and local outlier factors | |
CN107844872B (en) | Short-term wind speed forecasting method for wind power generation | |
CN113379116A (en) | Cluster and convolutional neural network-based line loss prediction method for transformer area | |
CN117370744A (en) | Dynamic cleaning method and system for abnormal power consumption data of power consumer | |
CN111864728B (en) | Important equipment identification method and system for reconfigurable power distribution network | |
CN112287605A (en) | Flow check method based on graph convolution network acceleration | |
CN112329971A (en) | Modeling method of investment decision model of power transmission and transformation project | |
CN116307844A (en) | Low-voltage transformer area line loss evaluation analysis method | |
CN114118592B (en) | Smart power grids power consumption end short-term energy consumption prediction system | |
CN111160675B (en) | Power grid vulnerability assessment method considering operation reliability | |
CN114239999A (en) | Element reliability parameter optimization analysis method based on cross entropy important sampling | |
CN114417918A (en) | Method for extracting wind power plant signal characteristics and denoising optimization data | |
Ji et al. | Cost Prediction of Distribution Network Project Based on DART Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |