CN117370744A - Dynamic cleaning method and system for abnormal power consumption data of power consumer - Google Patents

Dynamic cleaning method and system for abnormal power consumption data of power consumer Download PDF

Info

Publication number
CN117370744A
CN117370744A CN202311666866.1A CN202311666866A CN117370744A CN 117370744 A CN117370744 A CN 117370744A CN 202311666866 A CN202311666866 A CN 202311666866A CN 117370744 A CN117370744 A CN 117370744A
Authority
CN
China
Prior art keywords
data
data points
missing
abnormal
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311666866.1A
Other languages
Chinese (zh)
Inventor
李晓辉
韩可欣
刘伟东
吕伟嘉
刘小琛
李祯祥
王崇
葛磊蛟
张革
赵宏伟
骆文涛
骆斌
杜天硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Marketing Service Center of State Grid Tianjin Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Marketing Service Center of State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Tianjin Electric Power Co Ltd, Marketing Service Center of State Grid Tianjin Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202311666866.1A priority Critical patent/CN117370744A/en
Publication of CN117370744A publication Critical patent/CN117370744A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/27Regression, e.g. linear or logistic regression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a dynamic cleaning method and a system for abnormal data of electricity consumption of an electric power user. The method comprises the following steps: calculating the size of an abnormal factor of all data points; clustering is carried out aiming at the local abnormal factor value, and the local abnormal factor value on the boundary line between the sample normal data point and the sample abnormal data point obtained by self-adaptive clustering is used as an abnormal judgment threshold; judging abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into a missing data dynamic filling program; taking blank missing data points as local missing types; filling data points of the local missing type data by adopting a least square regression model prediction result; and filling the long-term missing type data interval by adopting a random forest model prediction result. And evaluating the dynamic cleaning effect of the original data. The method and the device can be used for rapidly and effectively cleaning the abnormal power consumption data of the power consumers, and meet the cleaning and use requirements of the current multi-source big data of the power distribution network.

Description

Dynamic cleaning method and system for abnormal power consumption data of power consumer
Technical Field
The invention relates to the field of processing of abnormal power consumption data of power users, in particular to a dynamic cleaning method and a dynamic cleaning system for abnormal power consumption data of power users.
Background
With the development of intelligent acquisition terminals, the data acquisition system uploads massive multi-source measurement data of electricity consumption of power users to a data center, and data information of the power distribution network is managed and read by adopting a database. Because the sensor is in short time failure, external interference, transmission errors and other factors, the acquired data is not complete and reliable, so that the phenomenon of data abnormality and deletion of the original data occurs, and the data quality is improved by data cleaning. However, in the traditional cleaning method at present, the threshold value judgment of the abnormal data completely depends on manual setting, and misjudgment exists, so that the method is not suitable for cleaning the multi-source big data of the power distribution network.
Disclosure of Invention
Aiming at the technical problems pointed out in the background technology, the invention provides a method and a system for dynamically cleaning abnormal power consumption data of a power user.
In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:
on the one hand
The invention provides a dynamic cleaning method for abnormal data of electricity consumption of an electric power user, which comprises the following steps:
step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;
step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;
step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;
step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;
step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;
step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.
Step 7: and evaluating the dynamic cleaning effect of the original data.
On the other hand
Correspondingly to the method, the invention also provides a power user electricity consumption abnormal data dynamic cleaning system which comprises an abnormal factor calculating unit, an abnormal judgment threshold calculating unit, a blank missing data point forming unit, a missing data time length judging unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:
the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;
the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;
the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;
the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;
the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;
the long-term missing type data interval filling unit is used for setting the number of decision trees of the random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.
The evaluation unit is used for evaluating the dynamic cleaning effect of the original data.
Compared with the prior art, the invention has the beneficial effects that: by utilizing the dynamic cleaning scheme for the power consumption abnormal data of the power users, the power consumption abnormal data of the power users can be quickly and effectively cleaned, the cleaning and use requirements of the current multi-source big data of the power distribution network are met, and the method is convenient to popularize and use in industry.
Drawings
Fig. 1 is a schematic flow chart of a method provided in an embodiment of the present application.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
In the traditional cleaning method, the threshold judgment of abnormal data completely depends on manual setting, and misjudgment exists, so that the method is not suitable for cleaning multi-source big data of a power distribution network; in addition, the filling of missing data often pay attention to filling accuracy too much, so that the running time of an algorithm is ignored, and the filling efficiency of missing massive data for power users is poor. Therefore, the current method for cleaning the electricity consumption big data of the power consumer has certain limitation.
As shown in fig. 1, the method for dynamically cleaning abnormal data of electricity consumption of a power consumer provided in this embodiment includes the following steps:
step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;
step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;
step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;
step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;
step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;
step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result.
Step 7: and evaluating the dynamic cleaning effect of the original data.
Preferably, the step 1 is specifically as follows:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach the distance and takeAnd->Is represented by the following formula:
step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Local reachable density of internal data points and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
It should be noted that the number of the substrates,kdistance: for any data point in the X sub-data sets, the nearest to data point okThe distance of a point is called the data point okA distance;kdistance neighborhood: all distances not greater than okThe neighborhood formed by the data object points of the distance is called askA distance neighborhood. The distance can be reached: for any data point a and B in the X sub-data sets, the distance between data point a to data point B is the reachable distance.
Preferably, the step 2 is specifically as follows:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix isIs a single Gaussian model probability distribution function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>I.e. the data point in each original data after this iteration belongs to the firstgProbability of a gaussian model.
Step 2.2: executing M step in EM algorithm, and the following formula:
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
It should be noted that the random forest model essentially belongs to an integrated learning method in machine learning, and integrates a plurality of decision trees through an integrated learning thought, wherein a basic unit is a decision tree, each decision tree is a classifier, and the final output is obtained by averaging the results of all the decision trees. When the decision tree is applied to regression problems, node splitting can be controlled through variance, and the smaller the node variance is, the more the selected feature values are represented, the better the feature of node splitting is.
The sample variance is:
wherein: x is a sub-dataset;Nis the firstfThe number of sample points in the sub-samples;y f is the firstfThe value of the sub-sample point;y c is the firstcActual values of sub-sample points. And continuously splitting the whole sample point into different node spaces, and obtaining a predicted value by each node, wherein the average value of all the predicted values of the nodes is the final predicted result of the decision tree.
Preferably, in step 7, the dynamic cleaning effect of the raw data is evaluated, and regression evaluation index R is adopted 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
Preferably, in step 7, the dynamic cleaning effect of the raw data is evaluated, and the average absolute percentage error is also adoptedM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
wherein,Nis the firstfThe number of sample points in the sub-samples;M APE and (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.
Correspondingly, the invention also provides a power consumer electricity consumption abnormal data dynamic cleaning system which comprises an abnormal factor calculating unit, an abnormal judgment threshold calculating unit, a blank missing data point forming unit, a missing data time length judging unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:
the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;
the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;
the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;
the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;
the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;
the long-term missing type data interval filling unit is used for setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;
the evaluation unit is used for evaluating the dynamic cleaning effect of the original data.
Preferably, the anomaly factor calculation unit is configured to perform the following:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach the distance and takeAnd->Is represented by the following formula:
step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Office of internal data pointsLocal reachable density and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
Preferably, the blank missing data point forming unit is configured to perform the following:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix isIs a single Gaussian model probability distribution function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>The method comprises the steps of carrying out a first treatment on the surface of the I.e. the data point in each original data after this iteration belongs to the firstgProbability of a gaussian model.
Step 2.2: executing M step in EM algorithm, and the following formula:
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
Preferably, the evaluation unit is used for evaluating the dynamic cleaning effect of the raw data, and regression evaluation index R is adopted 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nnumber of deletionsDepending on the length of the sheet,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
If R is 2 =0, illustrating that the method of the present invention has the same effect as the mean filling method in the prior art; if R is 2 < 0, the method of the invention has poorer effect compared with the mean value filling method in the prior art; if R is 2 More than 0, the method of the invention has better effect than the mean value filling method in the prior art, and R 2 The larger the value, the better the effect.
Preferably, the evaluation unit is used for evaluating the dynamic cleaning effect of the original data, and average absolute percentage error is also adoptedM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
wherein,Nis the firstfThe number of sample points in the sub-samples;M APE and (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.
M APE And (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result of the algorithm is, and the better the effect is. The two types respectively reflect the fitting degree and the error size between the algorithm calculation result and the true value,R MSE the absolute deviation of the load true value from the predicted value is measured,M APE the relative deviation between the load true value and the predicted value is measured, R 2 The degree of fitting between the predicted value and the load true value is measured, and the indexes comprehensively evaluate the algorithm performance from different aspects.
The foregoing details of the optional implementation of the embodiment of the present invention have been described in detail with reference to the accompanying drawings, but the embodiment of the present invention is not limited to the specific details of the foregoing implementation, and various simple modifications may be made to the technical solution of the embodiment of the present invention within the scope of the technical concept of the embodiment of the present invention, and these simple modifications all fall within the protection scope of the embodiment of the present invention.

Claims (10)

1. The dynamic cleaning method for the abnormal data of the electricity consumption of the power consumer is characterized by comprising the following steps:
step 1: calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points, and sorting according to ascending order;
step 2: using a Gaussian mixture model, clustering according to two types of normal data and abnormal data aiming at the sequenced local abnormal factor values, and taking the local abnormal factor value on a boundary line between a sample normal data point and a sample abnormal data point obtained by self-adaptive clustering as an abnormal judgment threshold;
step 3: the local abnormal factor values of all data points in the original data are circularly compared with the abnormal judgment threshold value, the data points with the local abnormal factor values smaller than the abnormal judgment threshold value are judged to be normal data, otherwise, the data points are judged to be abnormal data, all abnormal data are removed to form blank missing data points, and the blank missing data points and the original data are input into a missing data dynamic filling program;
step 4: judging the time length of missing data in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking blank missing data points as the partial missing types;
step 5: taking data points which are similar to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a prediction result of the least square regression model;
step 6: setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after a long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;
step 7: and evaluating the dynamic cleaning effect of the original data.
2. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 1, wherein the step 1 is specifically as follows:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach distance, get->Andis represented by the following formula:
step 1.2: according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Local reachable density of internal data points and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
3. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 2, wherein the step 2 is specifically as follows:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix +.>Is a single Gaussian model probability distribution function; calculating local anomalies of data points in each raw dataThe factor value is attPosterior probability of Gaussian mixture model in multiple iterations +.>
Step 2.2: executing M step in EM algorithm, and the following formula:
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
4. A consumer electricity anomaly data according to claim 3The dynamic cleaning method is characterized in that in the step 7, the dynamic cleaning effect of the original data is evaluated, and regression evaluation index R is adopted 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
5. The method for dynamically cleaning abnormal data of electricity consumption of an electric power consumer according to claim 4, wherein in step 7, the dynamic cleaning effect of the original data is evaluated, and average absolute percentage error is adoptedM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
wherein,Nis the firstfThe number of sample points in the sub-samples;M APE and (3) withR MSE The smaller the value of (2) is, the final predicted junction is representedThe smaller the fruit error, the better the effect.
6. The power consumption abnormal data dynamic cleaning system for the power users is characterized by comprising an abnormal factor calculation unit, an abnormal judgment threshold calculation unit, a blank missing data point forming unit, a missing data time length judgment unit, a local missing type data filling unit, a long-term missing type data interval filling unit and an evaluation unit:
the abnormal factor calculation unit is used for calculating the abnormal factor sizes of all data points in the input original data, obtaining local abnormal factor values of all data points and carrying out sorting treatment according to ascending order;
the anomaly judgment threshold calculation unit is used for clustering the ordered local anomaly factor values according to normal data and anomaly data by using a Gaussian mixture model, and taking the local anomaly factor value on a boundary line between a sample normal data point and a sample anomaly data point obtained by self-adaptive clustering as an anomaly judgment threshold;
the blank missing data point forming unit is used for circularly comparing the local abnormal factor values of all data points in the original data with the abnormal judgment threshold value, judging the data points with the local abnormal factor values smaller than the abnormal judgment threshold value as normal data, otherwise judging the data points as abnormal data, removing all abnormal data to form blank missing data points, and inputting the blank missing data points and the original data into the missing data dynamic filling program;
the missing data time length judging unit is used for judging the missing data time length in the original data, dividing the data with different missing time lengths into two missing types of partial missing and long-term missing, and taking the blank missing data points as the partial missing types;
the local missing type data filling unit is used for taking data points which are close to the data points of the local missing type data as training data, establishing a least square regression model, training, and filling the data points of the local missing type data by adopting a least square regression model prediction result;
the long-term missing type data interval filling unit is used for setting the number of decision trees of a random forest model, taking data points which are similar to each other before and after the long-term missing type data interval as training data, building the random forest model and training, and filling the long-term missing type data interval by adopting a random forest model prediction result;
the evaluation unit is used for evaluating the dynamic cleaning effect of the original data.
7. The power consumer electricity consumption anomaly data dynamic cleaning system of claim 6, wherein the anomaly factor calculation unit is configured to perform the following:
step 1.1: calculating the first data point among all data points in the original datakThe distance can be reached by the distance between the two,representing data pointsx a Data pointsx b Distance between->Represent the firstkDistance of the firstkDistance is the distance data point of all data pointsx a First, thekDistant data points and data pointsx a A distance therebetween; />Data pointsx a To data pointx b Is the first of (2)kReach distance, get->Andis represented by the following formula:
step 1.2:according to the obtained firstkThe reachable distance is calculated, and the local reachable density of all data points is calculated;representing data pointsx a Is the first of (2)kA distance field; data pointsx a Is>Is taken as a pointx a Is the first of (2)kDistance field->All data points to data pointx a Inverse of the average reachable distance +.>Representing data pointsx a The density magnitude with the surrounding field data points is as follows:
in the method, in the process of the invention,representing data pointsx a Is the first of (2)kThe number of all data points within the distance;
step 1.3: calculating local anomaly factor values for each data point in the raw data, the data pointsx a Local anomaly factor value of (2)The following formula:
in the method, in the process of the invention,representing data pointsx a First, thekDistance field->Local reachable density of internal data points and data pointsx a Average of the ratios of the local reachable densities, +.>The larger the data pointx a The greater the likelihood of being an outlier.
8. The power consumer electricity anomaly data dynamic cleaning system of claim 7, wherein the blank missing data point forming unit is configured to perform the following:
step 2.1: randomly initializing parameters of an EM algorithmStep E in the EM algorithm is executed, and the following formula is adopted:
wherein:x i represent the firstiData points;Gfor a set of gaussian mixture models,jindex to gaussian mixture model;π g is the firstgWeights of the gaussian models in the gaussian mixture model;mean value of +.>Covariance matrix +.>Single gaussian model probability score for (a)A cloth function; calculating local anomaly factor values of data points in each original datatPosterior probability of Gaussian mixture model in multiple iterations +.>
Step 2.2: executing M step in EM algorithm, and the following formula:
wherein,Iis all thatxIs a collection of (3);Tis all thattIs a collection of (3); obtained by calculating maximum likelihood function of parametertParameter estimation in multiple iterationsUpdating the relevant parameters in each Gaussian mixture model according to the parameter estimation value to obtain the firsttA Gaussian mixture model after the iteration;
step 2.3: calculate the firsttLog likelihood function of Gaussian mixture model after multiple iterationsThe following formula:
step 2.4: and alternately executing the step E and the step M in the EM algorithm until the convergence of the log-likelihood function or the iteration number reaches the set maximum value.
9. The system for dynamically cleaning abnormal electricity consumption data of electric power consumer according to claim 8, wherein the evaluation unit is configured to evaluate the dynamic cleaning effect of the raw data by using regression evaluation index R 2 An evaluation algorithm;
the regression evaluation index R 2 An evaluation algorithm, as follows:
wherein the random forest model randomly samples the original data into X sub-data sets to form a sub-sample set by using bootstrap,y f is the firstfThe value of the sub-sample point,nin order to miss the length of the data,is the firstfThe mean value of the sub-sample points; />Is a model predictive value;
according toR 2 The size of the cleaning solution is determined to be good or bad.
10. The system for dynamically cleaning abnormal electricity consumption data of electric power consumer according to claim 9, wherein the evaluation unit is configured to evaluate the dynamic cleaning effect of the raw data, and further employs an average absolute percentage errorM APE And root mean square errorR MSE As an evaluation index of the data prediction effect, the following formula is adopted:
wherein,Nis the firstfNumber of sample points in a sub-sampleAn amount of;M APE and (3) withR MSE The smaller the value of (c) is, the smaller the error of the final prediction result is, and the better the effect is.
CN202311666866.1A 2023-12-07 2023-12-07 Dynamic cleaning method and system for abnormal power consumption data of power consumer Pending CN117370744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311666866.1A CN117370744A (en) 2023-12-07 2023-12-07 Dynamic cleaning method and system for abnormal power consumption data of power consumer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311666866.1A CN117370744A (en) 2023-12-07 2023-12-07 Dynamic cleaning method and system for abnormal power consumption data of power consumer

Publications (1)

Publication Number Publication Date
CN117370744A true CN117370744A (en) 2024-01-09

Family

ID=89393251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311666866.1A Pending CN117370744A (en) 2023-12-07 2023-12-07 Dynamic cleaning method and system for abnormal power consumption data of power consumer

Country Status (1)

Country Link
CN (1) CN117370744A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035227A (en) * 2024-04-15 2024-05-14 山东云擎信息技术有限公司 Data intelligent processing method and system based on big data evaluation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639978A (en) * 2020-06-08 2020-09-08 武汉理工大学 Electronic commerce event driving type demand forecasting method based on Prophet-random forest
CN112926627A (en) * 2021-01-28 2021-06-08 电子科技大学 Equipment defect time prediction method based on capacitive equipment defect data
CN113468796A (en) * 2021-04-13 2021-10-01 广西电网有限责任公司南宁供电局 Voltage missing data identification method based on improved random forest algorithm
CN113886375A (en) * 2021-09-29 2022-01-04 东北电力大学 Wind power data cleaning method based on isolated forest and local outlier factors
US20220382263A1 (en) * 2021-04-30 2022-12-01 Dalian University Of Technology Distributed industrial energy operation optimization platform automatically constructing intelligent models and algorithms
CN117113162A (en) * 2023-05-23 2023-11-24 南华大学 Eddar-rock structure background discrimination and graphic method integrating machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639978A (en) * 2020-06-08 2020-09-08 武汉理工大学 Electronic commerce event driving type demand forecasting method based on Prophet-random forest
CN112926627A (en) * 2021-01-28 2021-06-08 电子科技大学 Equipment defect time prediction method based on capacitive equipment defect data
CN113468796A (en) * 2021-04-13 2021-10-01 广西电网有限责任公司南宁供电局 Voltage missing data identification method based on improved random forest algorithm
US20220382263A1 (en) * 2021-04-30 2022-12-01 Dalian University Of Technology Distributed industrial energy operation optimization platform automatically constructing intelligent models and algorithms
CN113886375A (en) * 2021-09-29 2022-01-04 东北电力大学 Wind power data cleaning method based on isolated forest and local outlier factors
CN117113162A (en) * 2023-05-23 2023-11-24 南华大学 Eddar-rock structure background discrimination and graphic method integrating machine learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梅玉杰: ""基于机器学习的配电网异常缺失数据动态清洗方法"", 《电力系统保护与控制》, pages 159 - 168 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118035227A (en) * 2024-04-15 2024-05-14 山东云擎信息技术有限公司 Data intelligent processing method and system based on big data evaluation

Similar Documents

Publication Publication Date Title
CN109145957B (en) Method and device for identifying and processing abnormal indexes of power distribution network based on big data
CN107612016B (en) Planning method of distributed power supply in power distribution network based on maximum voltage correlation entropy
CN109978079A (en) A kind of data cleaning method of improved storehouse noise reduction self-encoding encoder
CN109742788B (en) New energy power station grid-connected performance evaluation index correction method
CN111259953A (en) Equipment defect time prediction method based on capacitive equipment defect data
CN113191253A (en) Non-invasive load identification method based on feature fusion under edge machine learning
CN116522268B (en) Line loss anomaly identification method for power distribution network
CN116008714B (en) Anti-electricity-stealing analysis method based on intelligent measurement terminal
CN116911806B (en) Internet + based power enterprise energy information management system
CN116821832A (en) Abnormal data identification and correction method for high-voltage industrial and commercial user power load
CN110212592B (en) Thermal power generating unit load regulation maximum rate estimation method and system based on piecewise linear expression
CN114298136A (en) Wind speed prediction method based on local mean decomposition and deep learning neural network
CN113886375A (en) Wind power data cleaning method based on isolated forest and local outlier factors
CN107844872B (en) Short-term wind speed forecasting method for wind power generation
CN113379116A (en) Cluster and convolutional neural network-based line loss prediction method for transformer area
CN117370744A (en) Dynamic cleaning method and system for abnormal power consumption data of power consumer
CN111864728B (en) Important equipment identification method and system for reconfigurable power distribution network
CN112287605A (en) Flow check method based on graph convolution network acceleration
CN112329971A (en) Modeling method of investment decision model of power transmission and transformation project
CN116307844A (en) Low-voltage transformer area line loss evaluation analysis method
CN114118592B (en) Smart power grids power consumption end short-term energy consumption prediction system
CN111160675B (en) Power grid vulnerability assessment method considering operation reliability
CN114239999A (en) Element reliability parameter optimization analysis method based on cross entropy important sampling
CN114417918A (en) Method for extracting wind power plant signal characteristics and denoising optimization data
Ji et al. Cost Prediction of Distribution Network Project Based on DART Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination